Configure prefix cache-aware routing in precise mode

更新时间:
复制 MD 格式

Maximize KV cache hits by routing each request to the vLLM replica with the longest matching prefix.

Key concepts

KV cache

During inference, the model generates key-value pairs for each token. Caching these pairs skips redundant computation, speeding up inference and reducing latency.

Automatic Prefix Caching (APC)

vLLM's APC stores KV cache from prior requests. When a new request shares a prefix with a cached one, vLLM reuses that KV cache, skipping recomputation.

Precise mode vs. estimated mode

Precise mode Estimated mode
Cache monitoring Receives KV cache block distribution directly from each vLLM replica Infers cache state without direct reporting
Cache hit accuracy Higher — routes based on actual cache state Lower — cannot precisely track KV cache distribution
Requirements vLLM v0.10.0 or later with KV cache event reporting enabled at startup No additional vLLM configuration required
Best for Workloads with many shared-prefix requests Scenarios where vLLM version constraints prevent precise mode

Use precise mode when running vLLM v0.10.0 or later with workloads that share system prompts or conversation history.

Prerequisites

Ensure you have:

Deploy the model service

Step 1: Prepare the Qwen3-32B model files

  1. Install git-lfs if not already installed.

    # On RHEL/CentOS-based systems
    yum install git-lfs
    
    # On Debian/Ubuntu-based systems
    apt-get install git-lfs

    For other methods, see Installing Git Large File Storage.

  2. Download the Qwen3-32B model from ModelScope.

    git lfs install
    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3-32B.git
    cd Qwen3-32B/
    git lfs pull
  3. Create an OSS folder and upload the model files. Install ossutil if needed.

    ossutil mkdir oss://<YOUR-BUCKET-NAME>/Qwen3-32B
    ossutil cp -r ./Qwen3-32B oss://<YOUR-BUCKET-NAME>/Qwen3-32B
  4. Create llm-model.yaml to define an OSS-backed Secret, persistent volume (PV), and persistent volume claim (PVC). See Use ossfs 1.0 to create a statically provisioned volume.

    apiVersion: v1
    kind: Secret
    metadata:
      name: oss-secret
    stringData:
      akId: <YOUR-OSS-AK>       # AccessKey ID for OSS access
      akSecret: <YOUR-OSS-SK>   # AccessKey Secret for OSS access
    ---
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: llm-model
      labels:
        alicloud-pvname: llm-model
    spec:
      capacity:
        storage: 30Gi
      accessModes:
        - ReadOnlyMany
      persistentVolumeReclaimPolicy: Retain
      csi:
        driver: ossplugin.csi.alibabacloud.com
        volumeHandle: llm-model
        nodePublishSecretRef:
          name: oss-secret
          namespace: default
        volumeAttributes:
          bucket: <YOUR-BUCKET-NAME>      # OSS bucket name
          url: <YOUR-BUCKET-ENDPOINT>     # OSS endpoint, e.g., oss-cn-hangzhou-internal.aliyuncs.com
          otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
          path: <YOUR-MODEL-PATH>         # Path to model files, e.g., /Qwen3-32B/
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: llm-model
    spec:
      accessModes:
        - ReadOnlyMany
      resources:
        requests:
          storage: 30Gi
      selector:
        matchLabels:
          alicloud-pvname: llm-model
  5. Apply the manifest.

    kubectl create -f llm-model.yaml

Step 2: Deploy the vLLM inference service

  1. Create vllm.yaml.

    Expand to view YAML content

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: qwen3
      name: qwen3
    spec:
      progressDeadlineSeconds: 600
      replicas: 3
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: qwen3
      strategy:
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
        type: RollingUpdate
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: '8000'
            prometheus.io/scrape: 'true'
          labels:
            app: qwen3
        spec:
          containers:
            - command:
                - sh
                - '-c'
                - >-
                  vllm serve /models/Qwen3-32B --served-model-name Qwen3-32B
                  --trust-remote-code --port=8000 --max-model-len 8192
                  --gpu-memory-utilization 0.95 --enforce-eager --kv-events-config
                  "{\"enable_kv_cache_events\":true,\"publisher\":\"zmq\",\"endpoint\":\"tcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557\",\"topic\":\"kv@${POD_IP}@Qwen3-32B\"}"
                  --prefix-caching-hash-algo sha256_cbor_64bit --block-size 64
              env:
                - name: POD_IP
                  valueFrom:
                    fieldRef:
                      apiVersion: v1
                      fieldPath: status.podIP
                - name: PYTHONHASHSEED
                  value: '42'
              image: 'registry-cn-hangzhou.ack.aliyuncs.com/dev/vllm:0.10.0'
              imagePullPolicy: IfNotPresent
              name: vllm
              ports:
                - containerPort: 8000
                  name: restful
                  protocol: TCP
              readinessProbe:
                failureThreshold: 3
                initialDelaySeconds: 30
                periodSeconds: 30
                successThreshold: 1
                tcpSocket:
                  port: 8000
                timeoutSeconds: 1
              resources:
                limits:
                  nvidia.com/gpu: '1'
                requests:
                  nvidia.com/gpu: '1'
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
              volumeMounts:
                - mountPath: /models/Qwen3-32B
                  name: model
                - mountPath: /dev/shm
                  name: dshm
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          volumes:
            - name: model
              persistentVolumeClaim:
                claimName: llm-model
            - emptyDir:
                medium: Memory
                sizeLimit: 30Gi
              name: dshm
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app: qwen3
      name: qwen3
    spec:
      ports:
        - name: http-serving
          port: 8000
          protocol: TCP
          targetPort: 8000
      selector:
        app: qwen3
      type: ClusterIP

    Startup parameters and environment variables for precise-mode routing. The --block-size and PYTHONHASHSEED values must match the InferenceTrafficPolicy fields in the next section.

    Parameter / variable Description
    --kv-events-config KV cache event publishing configuration. Set enable_kv_cache_events to true and publisher to zmq. For endpoint, use tcp://epp-<InferencePool namespace>-<InferencePool name>.envoy-gateway-system.<cluster local domain>:5557. For topic, use kv@${POD_IP}@<served model name>. In this example, with InferencePool qwen-inference-pool in the default namespace and model Qwen3-32B, the values are tcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557 and kv@${POD_IP}@Qwen3-32B.
    --prefix-caching-hash-algo Hash algorithm for KV cache prefix blocks. Must be sha256_cbor_64bit.
    --block-size Number of tokens per KV cache prefix block. Must match blockSize in InferenceTrafficPolicy. In this example: 64.
    PYTHONHASHSEED Python hash seed. Must be non-zero and match hashSeed in InferenceTrafficPolicy. In this example: 42.
  2. Deploy the vLLM inference service.

    kubectl create -f vllm.yaml

Deploy inference routing

Step 1: Deploy the inference routing policy

  1. Create inference-policy.yaml. blockSize and hashSeed must match --block-size and PYTHONHASHSEED in your vLLM deployment.

    # InferencePool selects the vLLM workload pods for routing
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
    spec:
      targetPortNumber: 8000
      selector:
        app: qwen3
    ---
    # InferenceTrafficPolicy configures KV cache-aware load balancing for the pool
    apiVersion: inferenceextension.alibabacloud.com/v1alpha1
    kind: InferenceTrafficPolicy
    metadata:
      name: inference-policy
    spec:
      poolRef:
        name: qwen-inference-pool
      profile:
        single:                  # Backend is a single-model vLLM deployment
          trafficPolicy:
            prefixCache:
              mode: tracking     # Enables KV cache-aware load balancing (precise mode)
              trackingConfig:
                indexerConfig:
                  tokenProcessorConfig:
                    blockSize: 64            # Must match vLLM --block-size
                    hashSeed: 42             # Must match vLLM PYTHONHASHSEED
                    model: Qwen/Qwen3-32B   # Official ModelScope model name
  2. Apply the policy.

    kubectl apply -f inference-policy.yaml

Step 2: Deploy the gateway and routing rules

  1. Create inference-gateway.yaml with the Gateway, HTTPRoute, and a backend timeout policy.

    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: inference-gateway
    spec:
      gatewayClassName: ack-gateway
      listeners:
      - name: http-llm
        protocol: HTTP
        port: 8080
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: inference-route
    spec:
      parentRefs:
      - name: inference-gateway
      rules:
      - matches:
        - path:
            type: PathPrefix
            value: /v1
        backendRefs:
        - name: qwen-inference-pool
          kind: InferencePool
          group: inference.networking.x-k8s.io
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: BackendTrafficPolicy
    metadata:
      name: backend-timeout
    spec:
      timeout:
        http:
          requestTimeout: 24h
      targetRef:
        group: gateway.networking.k8s.io
        kind: Gateway
        name: inference-gateway
  2. Apply the manifest.

    kubectl apply -f inference-gateway.yaml

Step 3: Verify routing

Send two requests with the same prefix and verify both reach the same vLLM replica.

  1. Create two request payloads that share the same content in the first message.

    echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt
    echo '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you'\''re setting up a fun test. I'\''m ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt
  2. Get the gateway's public IP address.

    export GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
  3. Send both requests.

    curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
    curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt
  4. Check the inference extension logs. Both entries should show the same endpoint.Address, confirming both requests reached the same vLLM replica.

    kubectl logs deploy/epp-default-qwen-inference-pool -n envoy-gateway-system | grep "handled"

    Expected output:

    2025-08-19T10:16:12Z	LEVEL(-2)	requestcontrol/director.go:278	Request handled	{"x-request-id": "00d5c24e-b3c8-461d-9848-7bb233243eb9", "model": "Qwen3-32B", "resolvedTargetModel": "Qwen3-32B", "criticality": "Critical", "model": "Qwen3-32B", "targetModel": "Qwen3-32B", "endpoint": "{NamespacedName:default/qwen3-779c54544f-9c4vz Address:10.0.0.5 Labels:map[app:qwen3 pod-template-hash:779c54544f]}"}
    2025-08-19T10:16:19Z	LEVEL(-2)	requestcontrol/director.go:278	Request handled	{"x-request-id": "401925f5-fe65-46e3-8494-5afd83921ba5", "model": "Qwen3-32B", "resolvedTargetModel": "Qwen3-32B", "criticality": "Critical", "model": "Qwen3-32B", "targetModel": "Qwen3-32B", "endpoint": "{NamespacedName:default/qwen3-779c54544f-9c4vz Address:10.0.0.5 Labels:map[app:qwen3 pod-template-hash:779c54544f]}"}

    In this example, both requests show Address:10.0.0.5, confirming they were routed to the same pod.