使用智能推理路由实现KVCache感知的负载均衡

KV Cache感知的负载均衡专为生成式AI推理场景设计,通过动态分配请求至最优计算节点,可以显著提升大语言模型(LLM)服务效率。本文介绍如何使用Gateway with Inference Extension组件实现KVCache感知负载均衡策略。

背景信息

vLLM

vLLM是一个高效易用流行的构建LLM推理服务的框架,支持包括通义千问在内的多种常见大语言模型。vLLM通过PagedAttention优化、动态批量推理(Continuous Batching)模型量化等优化技术,可以取得较好的大语言模型推理效率。

KV Cache

在推理过程中,通过将模型生成的“键”(Key)和“值”(Value)进行缓存,来快速访问历史请求的上下文信息,从而提高模型生成文本的效率。通过使用 KV Cache,模型能够避免重复计算,显著加快推理速度,减少响应延迟。

vLLM的自动前缀缓存

vLLM支持自动前缀缓存特性。自动前缀缓存(APC)会缓存vLLM已经计算过请求的KVCache,这样如果新的请求与某个历史请求具有相同的前缀,就可以直接复用现有的KVCache,从而使新请求得以跳过共享前缀部分的KVCache计算,从而加速对LLM推理请求的处理流程。

KVCache感知的负载均衡策略和前缀感知的负载均衡策略的关系

KVCache感知的负载均衡策略是指如下所属的负载均衡策略:

每个vLLM工作负载将自身缓存的KVCache块信息通过事件消息的方式上报给Gateway with Inference Extension,通过感知每个vLLM工作负载对KVCache块的缓存信息,在新请求到来时、根据请求内容将请求调度到缓存命中率最高的vLLM工作负载,从而最大化前缀缓存命中率,减少请求响应时间。此策略主要适用于有大量共享前缀请求的场景,请根据您的实际业务场景进行判断。

与前缀感知的负载均衡的目的相同,KVCache感知的负载均衡也是利用推理服务框架的前缀缓存机制、尽可能提高前缀缓存命中率。

  • KVCache感知的负载均衡由于能够直接接收KVCache块分布信息,可以更精准地确定请求缓存情况、最大化前缀缓存命中率;但需要推理服务使用vLLM v0.10.0及以上版本框架、并在启动时配置KVCache事件上报信息。

  • 前缀感知的负载均衡与推理引擎解耦,但无法实现对KVCache分布情况的精准感知。

重要

使用KVCache感知的负载均衡,您的推理服务必须使用vLLM v0.10.0及以上版本框架构建,并在vLLM启动参数中指定KVCache事件相关参数,详细设置请参考本文示例说明。

前提条件

部署模型服务

步骤一:准备Qwen3-32B模型文件

  1. ModelScope下载Qwen-32B模型。

    请确认是否已安装git-lfs插件,如未安装可执行yum install git-lfs或者apt-get install git-lfs安装。更多的安装方式,请参见安装git-lfs
    git lfs install
    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3-32B.git
    cd Qwen3-32B/
    git lfs pull
  2. OSS中创建目录,将模型上传至OSS。

    关于ossutil工具的安装和使用方法,请参见安装ossutil
    ossutil mkdir oss://<your-bucket-name>/Qwen3-32B
    ossutil cp -r ./Qwen3-8B oss://<your-bucket-name>/Qwen3-32B
  3. 创建PVPVC。为目标集群配置名为llm-model的存储卷PV和存储声明PVC。具体操作,请参见使用ossfs 1.0静态存储卷

    1. 创建llm-model.yaml文件,该该YAML文件包含Secret、静态卷PV、静态卷PVC等配置。

      apiVersion: v1
      kind: Secret
      metadata:
        name: oss-secret
      stringData:
        akId: <your-oss-ak> # 配置用于访问OSSAccessKey ID
        akSecret: <your-oss-sk> # 配置用于访问OSSAccessKey Secret
      ---
      apiVersion: v1
      kind: PersistentVolume
      metadata:
        name: llm-model
        labels:
          alicloud-pvname: llm-model
      spec:
        capacity:
          storage: 30Gi 
        accessModes:
          - ReadOnlyMany
        persistentVolumeReclaimPolicy: Retain
        csi:
          driver: ossplugin.csi.alibabacloud.com
          volumeHandle: llm-model
          nodePublishSecretRef:
            name: oss-secret
            namespace: default
          volumeAttributes:
            bucket: <your-bucket-name> # bucket名称
            url: <your-bucket-endpoint> # Endpoint信息,如oss-cn-hangzhou-internal.aliyuncs.com
            otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
            path: <your-model-path> # 本示例中为/Qwen3-32B/
      ---
      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: llm-model
      spec:
        accessModes:
          - ReadOnlyMany
        resources:
          requests:
            storage: 30Gi
        selector:
          matchLabels:
            alicloud-pvname: llm-model
    2. 创建Secret、创建静态卷PV、创建静态卷PVC。

      kubectl create -f llm-model.yaml

步骤二:部署vLLM推理服务

  1. 创建vllm.yaml文件。

    展开查看YAML内容

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: qwen3
      name: qwen3
    spec:
      progressDeadlineSeconds: 600
      replicas: 3
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: qwen3
      strategy:
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
        type: RollingUpdate
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: '8000'
            prometheus.io/scrape: 'true'
          creationTimestamp: null
          labels:
            app: qwen3
        spec:
          containers:
            - command:
                - sh
                - '-c'
                - >-
                  vllm serve /models/Qwen3-32B --served-model-name Qwen3-32B
                  --trust-remote-code --port=8000 --max-model-len 8192
                  --gpu-memory-utilization 0.95 --enforce-eager --kv-events-config
                  "{\"enable_kv_cache_events\":true,\"publisher\":\"zmq\",\"endpoint\":\"tcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557\",\"topic\":\"kv@${POD_IP}@Qwen3-32B\"}"
                  --prefix-caching-hash-algo sha256_cbor_64bit --block-size 64
              env:
                - name: POD_IP
                  valueFrom:
                    fieldRef:
                      apiVersion: v1
                      fieldPath: status.podIP
                - name: PYTHONHASHSEED
                  value: '42'
              image: 'registry-cn-hangzhou.ack.aliyuncs.com/dev/vllm:0.10.0'
              imagePullPolicy: IfNotPresent
              name: vllm
              ports:
                - containerPort: 8000
                  name: restful
                  protocol: TCP
              readinessProbe:
                failureThreshold: 3
                initialDelaySeconds: 30
                periodSeconds: 30
                successThreshold: 1
                tcpSocket:
                  port: 8000
                timeoutSeconds: 1
              resources:
                limits:
                  nvidia.com/gpu: '1'
                requests:
                  nvidia.com/gpu: '1'
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
              volumeMounts:
                - mountPath: /models/Qwen3-32B
                  name: model
                - mountPath: /dev/shm
                  name: dshm
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          volumes:
            - name: model
              persistentVolumeClaim:
                claimName: llm-model
            - emptyDir:
                medium: Memory
                sizeLimit: 30Gi
              name: dshm
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app: qwen3
      name: qwen3
    spec:
      ports:
        - name: http-serving
          port: 8000
          protocol: TCP
          targetPort: 8000
      selector:
        app: qwen3
      type: ClusterIP

    以下为部分启动参数和环境变量说明:

    启动参数/环境变量

    说明

    --kv-events-config

    KVCache事件发布相关配置。应为有效的 JSON 字符串或单独传递的 JSON 键。

    示例值为:

    {"enable_kv_cache_events":true,"publisher":"zmq","endpoint":"tcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557","topic":"kv@${POD_IP}@Qwen3-32B"}

    其中:

    • endpoint:需要指定为推理扩展的zmq服务端点,命名规则为tcp://epp-<InferencePool命名空间>-<InferencePool名称>.envoy-gateway-system.<集群本地域名>:5557。本文示例中,将在default命名空间下创建名为qwen-inference-poolInferencePool,因此endpoint设定为tcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557

    • topic:命名规则为kv@${POD_IP}@<提供服务的模型名称>。本文示例中,由于vLLM指定了--served-model-name Qwen3-32B启动参数,因此topic设定为kv@${POD_IP}@Qwen3-32B

    --prefix-caching-hash-algo

    计算KVCache前缀缓存块哈希值使用的算法,必须指定为sha256_cbor_64bit

    --block-size

    每个KVCache前缀缓存块存储的Token数量大小,本示例中指定为64

    PYTHONHASHSEED

    Python执行哈希算法时使用的种子,需要指定为非0值,本示例中指定为42

  2. 部署vLLM推理服务。

    kubectl create -f vllm.yaml

部署推理路由

步骤一:部署推理路由策略

  1. 创建inference-policy.yaml。

    # InferencePool为工作负载声明开启推理路由
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
    spec:
      targetPortNumber: 8000
      selector:
        app: qwen3
    ---
    # InferenceTrafficPolicy 指定了针对InferencePool应用的具体流量策略
    apiVersion: inferenceextension.alibabacloud.com/v1alpha1
    kind: InferenceTrafficPolicy
    metadata:
      name: inference-policy
    spec:
      poolRef:
        name: qwen-inference-pool
      profile: 
        single: # 指定后端推理服务为单机vLLM部署
          trafficPolicy: # 指定针对推理服务的负载均衡策略
            prefixCache:
              mode: tracking # 开启KVCache感知的负载均衡
              trackingConfig:
                indexerConfig:
                  tokenProcessorConfig:
                    blockSize: 64 # 与vLLMblock-size启动参数保持一致
                    hashSeed: 42  # 与vLLMPYTHONHASHSEED环境变量保持一致
                    model: Qwen/Qwen3-32B # 指定推理服务模型在modelscope中的官方名称
  2. 部署推理路由策略。

    kubectl apply -f inference-policy.yaml

步骤二:部署网关和网关路由规则

  1. 创建inference-gateway.yaml。其中包含了网关、网关路由规则、以及网关后端超时规则。

    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: inference-gateway
    spec:
      gatewayClassName: ack-gateway
      listeners:
      - name: http-llm
        protocol: HTTP
        port: 8080
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: inference-route
    spec:
      parentRefs:
      - name: inference-gateway
      rules:
      - matches:
        - path:
            type: PathPrefix
            value: /v1
        backendRefs:
        - name: qwen-inference-pool
          kind: InferencePool
          group: inference.networking.x-k8s.io
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: BackendTrafficPolicy
    metadata:
      name: backend-timeout
    spec:
      timeout:
        http:
          requestTimeout: 24h
      targetRef:
        group: gateway.networking.k8s.io
        kind: Gateway
        name: inference-gateway
  2. 部署网关和路由规则。

    kubectl apply -f inference-gateway.yaml

步骤三:验证路由规则

  1. 创建round1.txtround2.txt。在两个txt文件中都包含了一段相同的一段content,通过先后将round1.txtround2.txt作为LLM请求的Body,然后查看推理扩展的日志内容,来验证是否触发智能路由的KVCache感知负载均衡功能。

    round1.txt:

    echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt

    round2.txt:

    echo '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt
  2. 获取网关的公网IP。

    export GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
  3. 发起两次会话请求,模拟多轮对话场景。

    curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
    curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt
  4. 查看日志,确认前缀负载均衡是否生效。

    kubectl logs deploy/epp-default-qwen-inference-pool -n envoy-gateway-system|grep "handled"

    预期输出:

    2025-08-19T10:16:12Z	LEVEL(-2)	requestcontrol/director.go:278	Request handled	{"x-request-id": "00d5c24e-b3c8-461d-9848-7bb233243eb9", "model": "Qwen3-32B", "resolvedTargetModel": "Qwen3-32B", "criticality": "Critical", "model": "Qwen3-32B", "targetModel": "Qwen3-32B", "endpoint": "{NamespacedName:default/qwen3-779c54544f-9c4vz Address:10.0.0.5 Labels:map[app:qwen3 pod-template-hash:779c54544f]}"}
    2025-08-19T10:16:19Z	LEVEL(-2)	requestcontrol/director.go:278	Request handled	{"x-request-id": "401925f5-fe65-46e3-8494-5afd83921ba5", "model": "Qwen3-32B", "resolvedTargetModel": "Qwen3-32B", "criticality": "Critical", "model": "Qwen3-32B", "targetModel": "Qwen3-32B", "endpoint": "{NamespacedName:default/qwen3-779c54544f-9c4vz Address:10.0.0.5 Labels:map[app:qwen3 pod-template-hash:779c54544f]}"}

    可以看到,拥有相同前缀的两个请求被转发到同一工作负载,说明KVCache感知的负载均衡已生效。