使用精准模式的前缀缓存感知路由能力提升vLLM生成式AI推理效率-容器服务 Kubernetes 版 ACK-阿里云

精准模式的前缀缓存感知路由专为生成式AI推理场景设计，基于KV Cache事件感知推理引擎中KV Cache的分布，动态分配请求至最优计算节点，可以显著提升大语言模型（LLM）服务效率。本文介绍如何使用Gateway with Inference Extension组件实现精准模式的前缀缓存感知路由能力。

背景信息

vLLM

vLLM是一个高效易用流行的构建LLM推理服务的框架，支持包括通义千问在内的多种常见大语言模型。vLLM通过PagedAttention优化、动态批量推理（Continuous Batching）模型量化等优化技术，可以取得较好的大语言模型推理效率。

KV Cache

在推理过程中，通过将模型生成的“键”（Key）和“值”（Value）进行缓存，来快速访问历史请求的上下文信息，从而提高模型生成文本的效率。通过使用 KV Cache，模型能够避免重复计算，显著加快推理速度，减少响应延迟。

vLLM的自动前缀缓存

vLLM支持自动前缀缓存特性。自动前缀缓存（APC）会缓存vLLM已经计算过请求的KV Cache，这样如果新的请求与某个历史请求具有相同的前缀，就可以直接复用现有的KV Cache，从而使新请求得以跳过共享前缀部分的KV Cache计算，从而加速对LLM推理请求的处理流程。

精准模式和估算模式的前缀缓存感知路由关系

精准模式的前缀缓存感知路由是指如下所属的负载均衡路由策略：

每个vLLM工作负载将自身缓存的KV Cache块信息通过事件消息的方式上报给Gateway with Inference Extension，通过感知每个vLLM工作负载对KV Cache块的缓存信息，在新请求到来时、根据请求内容将请求调度到缓存命中率最高的vLLM工作负载，从而最大化前缀缓存命中率，减少请求响应时间。此策略主要适用于有大量共享前缀请求的场景，请根据您的实际业务场景进行判断。

与估算模式的前缀缓存感知路由的目的相同，精准模式的前缀缓存感知路由也是利用推理服务框架的前缀缓存机制、尽可能提高前缀缓存命中率。

精准模式的前缀缓存感知路由由于能够直接接收KV Cache块分布信息，可以更精准地确定KV Cache缓存情况、最大化前缀缓存命中率；但需要推理服务使用vLLM v0.10.0及以上版本框架、并在启动时配置KV Cache事件上报信息。
估算模式的前缀缓存感知路由与推理引擎解耦，但无法实现对KV Cache分布情况的精准感知。

重要

使用精准模式的前缀缓存感知路由，您的推理服务必须使用vLLM v0.10.0及以上版本框架构建，并在vLLM启动参数中指定KV Cache事件相关参数，详细设置请参考本文示例说明。

前提条件

已创建带有GPU节点池的创建ACK托管集群。您也可以在ACK托管集群中安装ACK Virtual Node组件，以使用ACK使用ACS GPU算力示例。
本文Qwen-32B模型服务为例，需要GPU显存需大于64GB。推荐使用ecs.gn8is-2x.8xlarge规格，ACS虚拟节点推荐使用GU8TF卡型。
已安装1.4.0-aliyun.3及以上版本的Gateway with Inference Extension并勾选启用Gateway API推理扩展。操作入口，请参见安装Gateway with Inference Extension组件。

部署模型服务

步骤一：准备Qwen3-32B模型文件

从ModelScope下载Qwen-32B模型。
请确认是否已安装git-lfs插件，如未安装可执行yum install git-lfs或者apt-get install git-lfs安装。更多的安装方式，请参见安装git-lfs。
```
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3-32B.git
cd Qwen3-32B/
git lfs pull
```
在OSS中创建目录，将模型上传至OSS。
关于ossutil工具的安装和使用方法，请参见安装ossutil。
```
ossutil mkdir oss://<your-bucket-name>/Qwen3-32B
ossutil cp -r ./Qwen3-32B oss://<your-bucket-name>/Qwen3-32B
```

创建PV和PVC。为目标集群配置名为llm-model的存储卷PV和存储声明PVC。具体操作，请参见使用ossfs 1.0静态存储卷。

创建llm-model.yaml文件，该该YAML文件包含Secret、静态卷PV、静态卷PVC等配置。

apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: <your-oss-ak> # 配置用于访问OSS的AccessKey ID
  akSecret: <your-oss-sk> # 配置用于访问OSS的AccessKey Secret
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model
  labels:
    alicloud-pvname: llm-model
spec:
  capacity:
    storage: 30Gi 
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: llm-model
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: <your-bucket-name> # bucket名称
      url: <your-bucket-endpoint> # Endpoint信息，如oss-cn-hangzhou-internal.aliyuncs.com
      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
      path: <your-model-path> # 本示例中为/Qwen3-32B/
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  selector:
    matchLabels:
      alicloud-pvname: llm-model

创建Secret、创建静态卷PV、创建静态卷PVC。
```
kubectl create -f llm-model.yaml
```

步骤二：部署vLLM推理服务

创建vllm.yaml文件。

展开查看YAML内容

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: qwen3
  name: qwen3
spec:
  progressDeadlineSeconds: 600
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: qwen3
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: '8000'
        prometheus.io/scrape: 'true'
      creationTimestamp: null
      labels:
        app: qwen3
    spec:
      containers:
        - command:
            - sh
            - '-c'
            - >-
              vllm serve /models/Qwen3-32B --served-model-name Qwen3-32B
              --trust-remote-code --port=8000 --max-model-len 8192
              --gpu-memory-utilization 0.95 --enforce-eager --kv-events-config
              "{\"enable_kv_cache_events\":true,\"publisher\":\"zmq\",\"endpoint\":\"tcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557\",\"topic\":\"kv@${POD_IP}@Qwen3-32B\"}"
              --prefix-caching-hash-algo sha256_cbor_64bit --block-size 64
          env:
            - name: POD_IP
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: status.podIP
            - name: PYTHONHASHSEED
              value: '42'
          image: 'registry-cn-hangzhou.ack.aliyuncs.com/dev/vllm:0.10.0'
          imagePullPolicy: IfNotPresent
          name: vllm
          ports:
            - containerPort: 8000
              name: restful
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            initialDelaySeconds: 30
            periodSeconds: 30
            successThreshold: 1
            tcpSocket:
              port: 8000
            timeoutSeconds: 1
          resources:
            limits:
              nvidia.com/gpu: '1'
            requests:
              nvidia.com/gpu: '1'
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /models/Qwen3-32B
              name: model
            - mountPath: /dev/shm
              name: dshm
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
        - name: model
          persistentVolumeClaim:
            claimName: llm-model
        - emptyDir:
            medium: Memory
            sizeLimit: 30Gi
          name: dshm
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: qwen3
  name: qwen3
spec:
  ports:
    - name: http-serving
      port: 8000
      protocol: TCP
      targetPort: 8000
  selector:
    app: qwen3
  type: ClusterIP

以下为部分启动参数和环境变量说明：

启动参数/环境变量	说明
--kv-events-config	KV Cache事件发布相关配置。应为有效的 JSON 字符串或单独传递的 JSON 键。示例值为： `{"enable_kv_cache_events":true,"publisher":"zmq","endpoint":"tcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557","topic":"kv@${POD_IP}@Qwen3-32B"}` 其中： endpoint：需要指定为推理扩展的zmq服务端点，命名规则为`tcp://epp-<InferencePool命名空间>-<InferencePool名称>.envoy-gateway-system.<集群本地域名>:5557`。本文示例中，将在`default`命名空间下创建名为`qwen-inference-pool`的`InferencePool`，因此endpoint设定为`tcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557`。 topic：命名规则为`kv@${POD_IP}@<提供服务的模型名称>`。本文示例中，由于vLLM指定了`--served-model-name Qwen3-32B`启动参数，因此topic设定为`kv@${POD_IP}@Qwen3-32B`。
--prefix-caching-hash-algo	计算KV Cache前缀缓存块哈希值使用的算法，必须指定为`sha256_cbor_64bit`。
--block-size	每个KV Cache前缀缓存块存储的Token数量大小，本示例中指定为`64`。
PYTHONHASHSEED	Python执行哈希算法时使用的种子，需要指定为非0值，本示例中指定为`42`。

部署vLLM推理服务。
```
kubectl create -f vllm.yaml
```

部署推理路由

步骤一：部署推理路由策略

创建inference-policy.yaml。

# InferencePool为工作负载声明开启推理路由
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    app: qwen3
---
# InferenceTrafficPolicy 指定了针对InferencePool应用的具体流量策略
apiVersion: inferenceextension.alibabacloud.com/v1alpha1
kind: InferenceTrafficPolicy
metadata:
  name: inference-policy
spec:
  poolRef:
    name: qwen-inference-pool
  profile: 
    single: # 指定后端推理服务为单机vLLM部署
      trafficPolicy: # 指定针对推理服务的负载均衡策略
        prefixCache:
          mode: tracking # 开启KVCache感知的负载均衡
          trackingConfig:
            indexerConfig:
              tokenProcessorConfig:
                blockSize: 64 # 与vLLM的block-size启动参数保持一致
                hashSeed: 42  # 与vLLM的PYTHONHASHSEED环境变量保持一致
                model: Qwen/Qwen3-32B # 指定推理服务模型在modelscope中的官方名称

部署推理路由策略。
```
kubectl apply -f inference-policy.yaml
```

步骤二：部署网关和网关路由规则

创建inference-gateway.yaml。其中包含了网关、网关路由规则、以及网关后端超时规则。

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: ack-gateway
  listeners:
  - name: http-llm
    protocol: HTTP
    port: 8080
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: inference-route
spec:
  parentRefs:
  - name: inference-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /v1
    backendRefs:
    - name: qwen-inference-pool
      kind: InferencePool
      group: inference.networking.x-k8s.io
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: backend-timeout
spec:
  timeout:
    http:
      requestTimeout: 24h
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway

部署网关和路由规则。
```
kubectl apply -f inference-gateway.yaml
```

步骤三：验证路由规则

创建round1.txt和round2.txt。在两个txt文件中都包含了一段相同的一段content，通过先后将round1.txt和round2.txt作为LLM请求的Body，然后查看推理扩展的日志内容，来验证精准模式的前缀缓存感知路由功能是否工作。

round1.txt：

echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt

round2.txt：

echo '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt

获取网关的公网IP。

export GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')

发起两次会话请求，模拟多轮对话场景。

curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt

查看日志，确认前缀负载均衡是否生效。

kubectl logs deploy/epp-default-qwen-inference-pool -n envoy-gateway-system|grep "handled"

预期输出：

2025-08-19T10:16:12Z	LEVEL(-2)	requestcontrol/director.go:278	Request handled	{"x-request-id": "00d5c24e-b3c8-461d-9848-7bb233243eb9", "model": "Qwen3-32B", "resolvedTargetModel": "Qwen3-32B", "criticality": "Critical", "model": "Qwen3-32B", "targetModel": "Qwen3-32B", "endpoint": "{NamespacedName:default/qwen3-779c54544f-9c4vz Address:10.0.0.5 Labels:map[app:qwen3 pod-template-hash:779c54544f]}"}
2025-08-19T10:16:19Z	LEVEL(-2)	requestcontrol/director.go:278	Request handled	{"x-request-id": "401925f5-fe65-46e3-8494-5afd83921ba5", "model": "Qwen3-32B", "resolvedTargetModel": "Qwen3-32B", "criticality": "Critical", "model": "Qwen3-32B", "targetModel": "Qwen3-32B", "endpoint": "{NamespacedName:default/qwen3-779c54544f-9c4vz Address:10.0.0.5 Labels:map[app:qwen3 pod-template-hash:779c54544f]}"}

可以看到，拥有相同前缀的两个请求被转发到同一工作负载，说明精准模式的前缀缓存感知路由已生效。