Prefix Cache-Aware Routing in Precise Mode-Container Service for Kubernetes(ACK)-阿里云帮助中心

Maximize KV cache hits by routing each request to the vLLM replica with the longest matching prefix.

Key concepts

KV cache

During inference, the model generates key-value pairs for each token. Caching these pairs skips redundant computation, speeding up inference and reducing latency.

Automatic Prefix Caching (APC)

vLLM's APC stores KV cache from prior requests. When a new request shares a prefix with a cached one, vLLM reuses that KV cache, skipping recomputation.

Precise mode vs. estimated mode

	Precise mode	Estimated mode
Cache monitoring	Receives KV cache block distribution directly from each vLLM replica	Infers cache state without direct reporting
Cache hit accuracy	Higher — routes based on actual cache state	Lower — cannot precisely track KV cache distribution
Requirements	vLLM v0.10.0 or later with KV cache event reporting enabled at startup	No additional vLLM configuration required
Best for	Workloads with many shared-prefix requests	Scenarios where vLLM version constraints prevent precise mode

Use precise mode when running vLLM v0.10.0 or later with workloads that share system prompts or conversation history.

Prerequisites

Ensure you have:

An ACK managed cluster with a GPU node pool, or ACK with ACS GPU computing power via ACK Virtual Node.

Qwen3-32B requires over 64 GB of GPU memory. Use the ecs.gn8is-2x.8xlarge instance type for GPU node pools, or the GU8TF card type for ACS virtual nodes.
Gateway with Inference Extension version 1.4.0-aliyun.3 or later is installed with Enable Gateway API Inference Extension enabled. See Install the Gateway with Inference Extension add-on.

Deploy the model service

Step 1: Prepare the Qwen3-32B model files

Install git-lfs if not already installed.

# On RHEL/CentOS-based systems
yum install git-lfs

# On Debian/Ubuntu-based systems
apt-get install git-lfs

For other methods, see Installing Git Large File Storage.

Download the Qwen3-32B model from ModelScope.

git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3-32B.git
cd Qwen3-32B/
git lfs pull

Create an OSS folder and upload the model files. Install ossutil if needed.

ossutil mkdir oss://<YOUR-BUCKET-NAME>/Qwen3-32B
ossutil cp -r ./Qwen3-32B oss://<YOUR-BUCKET-NAME>/Qwen3-32B

Create llm-model.yaml to define an OSS-backed Secret, persistent volume (PV), and persistent volume claim (PVC). See Use ossfs 1.0 to create a statically provisioned volume.

apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: <YOUR-OSS-AK>       # AccessKey ID for OSS access
  akSecret: <YOUR-OSS-SK>   # AccessKey Secret for OSS access
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model
  labels:
    alicloud-pvname: llm-model
spec:
  capacity:
    storage: 30Gi
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: llm-model
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: <YOUR-BUCKET-NAME>      # OSS bucket name
      url: <YOUR-BUCKET-ENDPOINT>     # OSS endpoint, e.g., oss-cn-hangzhou-internal.aliyuncs.com
      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
      path: <YOUR-MODEL-PATH>         # Path to model files, e.g., /Qwen3-32B/
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  selector:
    matchLabels:
      alicloud-pvname: llm-model

Apply the manifest.
```
kubectl create -f llm-model.yaml
```

Step 2: Deploy the vLLM inference service

Create vllm.yaml.

Expand to view YAML content

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: qwen3
  name: qwen3
spec:
  progressDeadlineSeconds: 600
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: qwen3
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: '8000'
        prometheus.io/scrape: 'true'
      labels:
        app: qwen3
    spec:
      containers:
        - command:
            - sh
            - '-c'
            - >-
              vllm serve /models/Qwen3-32B --served-model-name Qwen3-32B
              --trust-remote-code --port=8000 --max-model-len 8192
              --gpu-memory-utilization 0.95 --enforce-eager --kv-events-config
              "{\"enable_kv_cache_events\":true,\"publisher\":\"zmq\",\"endpoint\":\"tcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557\",\"topic\":\"kv@${POD_IP}@Qwen3-32B\"}"
              --prefix-caching-hash-algo sha256_cbor_64bit --block-size 64
          env:
            - name: POD_IP
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: status.podIP
            - name: PYTHONHASHSEED
              value: '42'
          image: 'registry-cn-hangzhou.ack.aliyuncs.com/dev/vllm:0.10.0'
          imagePullPolicy: IfNotPresent
          name: vllm
          ports:
            - containerPort: 8000
              name: restful
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            initialDelaySeconds: 30
            periodSeconds: 30
            successThreshold: 1
            tcpSocket:
              port: 8000
            timeoutSeconds: 1
          resources:
            limits:
              nvidia.com/gpu: '1'
            requests:
              nvidia.com/gpu: '1'
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /models/Qwen3-32B
              name: model
            - mountPath: /dev/shm
              name: dshm
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
        - name: model
          persistentVolumeClaim:
            claimName: llm-model
        - emptyDir:
            medium: Memory
            sizeLimit: 30Gi
          name: dshm
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: qwen3
  name: qwen3
spec:
  ports:
    - name: http-serving
      port: 8000
      protocol: TCP
      targetPort: 8000
  selector:
    app: qwen3
  type: ClusterIP

Startup parameters and environment variables for precise-mode routing. The --block-size and PYTHONHASHSEED values must match the InferenceTrafficPolicy fields in the next section.

Parameter / variable	Description
`--kv-events-config`	KV cache event publishing configuration. Set `enable_kv_cache_events` to `true` and `publisher` to `zmq`. For `endpoint`, use `tcp://epp-<InferencePool namespace>-<InferencePool name>.envoy-gateway-system.<cluster local domain>:5557`. For `topic`, use `kv@${POD_IP}@<served model name>`. In this example, with InferencePool `qwen-inference-pool` in the `default` namespace and model `Qwen3-32B`, the values are `tcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557` and `kv@${POD_IP}@Qwen3-32B`.
`--prefix-caching-hash-algo`	Hash algorithm for KV cache prefix blocks. Must be `sha256_cbor_64bit`.
`--block-size`	Number of tokens per KV cache prefix block. Must match `blockSize` in `InferenceTrafficPolicy`. In this example: `64`.
`PYTHONHASHSEED`	Python hash seed. Must be non-zero and match `hashSeed` in `InferenceTrafficPolicy`. In this example: `42`.

Deploy the vLLM inference service.
```
kubectl create -f vllm.yaml
```

Deploy inference routing

Step 1: Deploy the inference routing policy

Create inference-policy.yaml. blockSize and hashSeed must match --block-size and PYTHONHASHSEED in your vLLM deployment.

# InferencePool selects the vLLM workload pods for routing
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    app: qwen3
---
# InferenceTrafficPolicy configures KV cache-aware load balancing for the pool
apiVersion: inferenceextension.alibabacloud.com/v1alpha1
kind: InferenceTrafficPolicy
metadata:
  name: inference-policy
spec:
  poolRef:
    name: qwen-inference-pool
  profile:
    single:                  # Backend is a single-model vLLM deployment
      trafficPolicy:
        prefixCache:
          mode: tracking     # Enables KV cache-aware load balancing (precise mode)
          trackingConfig:
            indexerConfig:
              tokenProcessorConfig:
                blockSize: 64            # Must match vLLM --block-size
                hashSeed: 42             # Must match vLLM PYTHONHASHSEED
                model: Qwen/Qwen3-32B   # Official ModelScope model name

Apply the policy.
```
kubectl apply -f inference-policy.yaml
```

Step 2: Deploy the gateway and routing rules

Create inference-gateway.yaml with the Gateway, HTTPRoute, and a backend timeout policy.

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: ack-gateway
  listeners:
  - name: http-llm
    protocol: HTTP
    port: 8080
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: inference-route
spec:
  parentRefs:
  - name: inference-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /v1
    backendRefs:
    - name: qwen-inference-pool
      kind: InferencePool
      group: inference.networking.x-k8s.io
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: backend-timeout
spec:
  timeout:
    http:
      requestTimeout: 24h
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway

Apply the manifest.
```
kubectl apply -f inference-gateway.yaml
```

Step 3: Verify routing

Send two requests with the same prefix and verify both reach the same vLLM replica.

Create two request payloads that share the same content in the first message.

echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt

echo '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you'\''re setting up a fun test. I'\''m ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt

Get the gateway's public IP address.

export GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')

Send both requests.

curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt

Check the inference extension logs. Both entries should show the same endpoint.Address, confirming both requests reached the same vLLM replica.

kubectl logs deploy/epp-default-qwen-inference-pool -n envoy-gateway-system | grep "handled"

Expected output:

2025-08-19T10:16:12Z	LEVEL(-2)	requestcontrol/director.go:278	Request handled	{"x-request-id": "00d5c24e-b3c8-461d-9848-7bb233243eb9", "model": "Qwen3-32B", "resolvedTargetModel": "Qwen3-32B", "criticality": "Critical", "model": "Qwen3-32B", "targetModel": "Qwen3-32B", "endpoint": "{NamespacedName:default/qwen3-779c54544f-9c4vz Address:10.0.0.5 Labels:map[app:qwen3 pod-template-hash:779c54544f]}"}
2025-08-19T10:16:19Z	LEVEL(-2)	requestcontrol/director.go:278	Request handled	{"x-request-id": "401925f5-fe65-46e3-8494-5afd83921ba5", "model": "Qwen3-32B", "resolvedTargetModel": "Qwen3-32B", "criticality": "Critical", "model": "Qwen3-32B", "targetModel": "Qwen3-32B", "endpoint": "{NamespacedName:default/qwen3-779c54544f-9c4vz Address:10.0.0.5 Labels:map[app:qwen3 pod-template-hash:779c54544f]}"}

In this example, both requests show Address:10.0.0.5, confirming they were routed to the same pod.