Maximize KV cache hits by routing each request to the vLLM replica with the longest matching prefix.
Key concepts
KV cache
During inference, the model generates key-value pairs for each token. Caching these pairs skips redundant computation, speeding up inference and reducing latency.
Automatic Prefix Caching (APC)
vLLM's APC stores KV cache from prior requests. When a new request shares a prefix with a cached one, vLLM reuses that KV cache, skipping recomputation.
Precise mode vs. estimated mode
| Precise mode | Estimated mode | |
|---|---|---|
| Cache monitoring | Receives KV cache block distribution directly from each vLLM replica | Infers cache state without direct reporting |
| Cache hit accuracy | Higher — routes based on actual cache state | Lower — cannot precisely track KV cache distribution |
| Requirements | vLLM v0.10.0 or later with KV cache event reporting enabled at startup | No additional vLLM configuration required |
| Best for | Workloads with many shared-prefix requests | Scenarios where vLLM version constraints prevent precise mode |
Use precise mode when running vLLM v0.10.0 or later with workloads that share system prompts or conversation history.
Prerequisites
Ensure you have:
-
An ACK managed cluster with a GPU node pool, or ACK with ACS GPU computing power via ACK Virtual Node.
Qwen3-32B requires over 64 GB of GPU memory. Use the ecs.gn8is-2x.8xlarge instance type for GPU node pools, or the GU8TF card type for ACS virtual nodes.
-
Gateway with Inference Extension version 1.4.0-aliyun.3 or later is installed with Enable Gateway API Inference Extension enabled. See Install the Gateway with Inference Extension add-on.
Deploy the model service
Step 1: Prepare the Qwen3-32B model files
-
Install git-lfs if not already installed.
# On RHEL/CentOS-based systems yum install git-lfs # On Debian/Ubuntu-based systems apt-get install git-lfsFor other methods, see Installing Git Large File Storage.
-
Download the Qwen3-32B model from ModelScope.
git lfs install GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3-32B.git cd Qwen3-32B/ git lfs pull -
Create an OSS folder and upload the model files. Install ossutil if needed.
ossutil mkdir oss://<YOUR-BUCKET-NAME>/Qwen3-32B ossutil cp -r ./Qwen3-32B oss://<YOUR-BUCKET-NAME>/Qwen3-32B -
Create
llm-model.yamlto define an OSS-backed Secret, persistent volume (PV), and persistent volume claim (PVC). See Use ossfs 1.0 to create a statically provisioned volume.apiVersion: v1 kind: Secret metadata: name: oss-secret stringData: akId: <YOUR-OSS-AK> # AccessKey ID for OSS access akSecret: <YOUR-OSS-SK> # AccessKey Secret for OSS access --- apiVersion: v1 kind: PersistentVolume metadata: name: llm-model labels: alicloud-pvname: llm-model spec: capacity: storage: 30Gi accessModes: - ReadOnlyMany persistentVolumeReclaimPolicy: Retain csi: driver: ossplugin.csi.alibabacloud.com volumeHandle: llm-model nodePublishSecretRef: name: oss-secret namespace: default volumeAttributes: bucket: <YOUR-BUCKET-NAME> # OSS bucket name url: <YOUR-BUCKET-ENDPOINT> # OSS endpoint, e.g., oss-cn-hangzhou-internal.aliyuncs.com otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other" path: <YOUR-MODEL-PATH> # Path to model files, e.g., /Qwen3-32B/ --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: llm-model spec: accessModes: - ReadOnlyMany resources: requests: storage: 30Gi selector: matchLabels: alicloud-pvname: llm-model -
Apply the manifest.
kubectl create -f llm-model.yaml
Step 2: Deploy the vLLM inference service
-
Create
vllm.yaml.Startup parameters and environment variables for precise-mode routing. The
--block-sizeandPYTHONHASHSEEDvalues must match theInferenceTrafficPolicyfields in the next section.Parameter / variable Description --kv-events-configKV cache event publishing configuration. Set enable_kv_cache_eventstotrueandpublishertozmq. Forendpoint, usetcp://epp-<InferencePool namespace>-<InferencePool name>.envoy-gateway-system.<cluster local domain>:5557. Fortopic, usekv@${POD_IP}@<served model name>. In this example, with InferencePoolqwen-inference-poolin thedefaultnamespace and modelQwen3-32B, the values aretcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557andkv@${POD_IP}@Qwen3-32B.--prefix-caching-hash-algoHash algorithm for KV cache prefix blocks. Must be sha256_cbor_64bit.--block-sizeNumber of tokens per KV cache prefix block. Must match blockSizeinInferenceTrafficPolicy. In this example:64.PYTHONHASHSEEDPython hash seed. Must be non-zero and match hashSeedinInferenceTrafficPolicy. In this example:42. -
Deploy the vLLM inference service.
kubectl create -f vllm.yaml
Deploy inference routing
Step 1: Deploy the inference routing policy
-
Create
inference-policy.yaml.blockSizeandhashSeedmust match--block-sizeandPYTHONHASHSEEDin your vLLM deployment.# InferencePool selects the vLLM workload pods for routing apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool spec: targetPortNumber: 8000 selector: app: qwen3 --- # InferenceTrafficPolicy configures KV cache-aware load balancing for the pool apiVersion: inferenceextension.alibabacloud.com/v1alpha1 kind: InferenceTrafficPolicy metadata: name: inference-policy spec: poolRef: name: qwen-inference-pool profile: single: # Backend is a single-model vLLM deployment trafficPolicy: prefixCache: mode: tracking # Enables KV cache-aware load balancing (precise mode) trackingConfig: indexerConfig: tokenProcessorConfig: blockSize: 64 # Must match vLLM --block-size hashSeed: 42 # Must match vLLM PYTHONHASHSEED model: Qwen/Qwen3-32B # Official ModelScope model name -
Apply the policy.
kubectl apply -f inference-policy.yaml
Step 2: Deploy the gateway and routing rules
-
Create
inference-gateway.yamlwith the Gateway, HTTPRoute, and a backend timeout policy.apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: inference-gateway spec: gatewayClassName: ack-gateway listeners: - name: http-llm protocol: HTTP port: 8080 --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: inference-route spec: parentRefs: - name: inference-gateway rules: - matches: - path: type: PathPrefix value: /v1 backendRefs: - name: qwen-inference-pool kind: InferencePool group: inference.networking.x-k8s.io --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: backend-timeout spec: timeout: http: requestTimeout: 24h targetRef: group: gateway.networking.k8s.io kind: Gateway name: inference-gateway -
Apply the manifest.
kubectl apply -f inference-gateway.yaml
Step 3: Verify routing
Send two requests with the same prefix and verify both reach the same vLLM replica.
-
Create two request payloads that share the same
contentin the first message.echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txtecho '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you'\''re setting up a fun test. I'\''m ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt -
Get the gateway's public IP address.
export GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}') -
Send both requests.
curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt -
Check the inference extension logs. Both entries should show the same
endpoint.Address, confirming both requests reached the same vLLM replica.kubectl logs deploy/epp-default-qwen-inference-pool -n envoy-gateway-system | grep "handled"Expected output:
2025-08-19T10:16:12Z LEVEL(-2) requestcontrol/director.go:278 Request handled {"x-request-id": "00d5c24e-b3c8-461d-9848-7bb233243eb9", "model": "Qwen3-32B", "resolvedTargetModel": "Qwen3-32B", "criticality": "Critical", "model": "Qwen3-32B", "targetModel": "Qwen3-32B", "endpoint": "{NamespacedName:default/qwen3-779c54544f-9c4vz Address:10.0.0.5 Labels:map[app:qwen3 pod-template-hash:779c54544f]}"} 2025-08-19T10:16:19Z LEVEL(-2) requestcontrol/director.go:278 Request handled {"x-request-id": "401925f5-fe65-46e3-8494-5afd83921ba5", "model": "Qwen3-32B", "resolvedTargetModel": "Qwen3-32B", "criticality": "Critical", "model": "Qwen3-32B", "targetModel": "Qwen3-32B", "endpoint": "{NamespacedName:default/qwen3-779c54544f-9c4vz Address:10.0.0.5 Labels:map[app:qwen3 pod-template-hash:779c54544f]}"}In this example, both requests show
Address:10.0.0.5, confirming they were routed to the same pod.