通过Gateway with Inference Extension配置生成式AI的前缀感知负载均衡-容器计算服务-阿里云

通过Gateway with Inference Extension组件，您可以根据生成式AI推理服务的不同使用场景、指定使用推理服务路由的不同负载均衡策略。本文介绍如何使用Gateway with Inference Extension组件实现前缀感知的负载均衡策略。

重要

阅读本文前，请确保您已经了解InferencePool和InferenceModel的相关概念。
本文内容依赖1.4.0及以上版本的Gateway with Inference Extension。

背景信息

vLLM的自动前缀缓存

vLLM支持自动前缀缓存特性。自动前缀缓存（APC）会缓存vLLM已经计算过请求的KV Cache，这样如果新的请求与某个历史请求具有相同的前缀，就可以直接复用现有的KV Cache，从而使新请求得以跳过共享前缀部分的KV Cache计算，从而加速对LLM推理请求的处理流程。

前缀感知的负载均衡策略

前缀感知的负载均衡策略是指将共享同一前缀内容的请求尽可能发送到同一个推理服务器Pod的负载均衡策略。

在模型服务器开启APC特性的情况下，前缀感知的负载均衡策略可以尽可能的提高前缀缓存命中率，减少请求响应时间。此策略主要适用于有大量共享前缀请求的场景，请根据您的实际业务场景进行判断。

典型的使用场景如下：

长文档查询：用户反复使用不同的查询对同一长文档（例如软件手册或年度报告）进行查询。
多轮对话：用户可能在同一聊天会话中多次与应用程序进行交互。

前提条件

已创建带有GPU节点池的ACK托管集群。您也可以在ACK托管集群中安装ACK Virtual Node组件，以使用ACS GPU算力。
已安装1.4.0版本的Gateway with Inference Extension并勾选启用Gateway API推理扩展。操作入口，请参见安装组件。

说明

本文使用的镜像推荐ACK集群使用A10卡型，ACS GPU算力推荐使用L20(GN8IS)卡型。

同时，由于LLM镜像体积较大，建议您提前转存到ACR，使用内网地址进行拉取。直接从公网拉取的速度取决于集群EIP的带宽配置，会有较长的等待时间。

操作步骤

步骤一：部署示例推理服务

创建vllm-service.yaml。

展开查看YAML内容

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: qwen
  name: qwen
spec:
  progressDeadlineSeconds: 600
  replicas: 5
  selector:
    matchLabels:
      app: qwen
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8000"
        prometheus.io/scrape: "true"
      labels:
        app: qwen
        alibabacloud.com/compute-class: gpu
        alibabacloud.com/compute-qos: default
        alibabacloud.com/gpu-model-series: GN8IS
    spec:
      containers:
        - command:
            - sh
            - -c
            - vllm serve /models/Qwen-2.5-7B-Instruct --port 8000 --enable_prefix_caching --trust-remote-code --served-model-name /model/qwen --gpu-memory-utilization 0.95 --enforce-eager --enable-lora --max-loras 2 --max-cpu-loras 4 --lora-modules travel-helper-v1=/models/Qwen-TravelHelper-Lora travel-helper-v2=/models/Qwen-TravelHelper-Lora-v2
          image: registry-cn-hangzhou.ack.aliyuncs.com/dev/qwen-2.5-7b-instruct-lora:v0.1
          imagePullPolicy: IfNotPresent
          name: custom-serving
          ports:
            - containerPort: 8000
              name: http
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            initialDelaySeconds: 30
            periodSeconds: 30
            successThreshold: 1
            tcpSocket:
              port: 8000
            timeoutSeconds: 1
          resources:
            limits:
              nvidia.com/gpu: "1"
              cpu: "8"
              memory: 30G
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /dev/shm
              name: dshm
      restartPolicy: Always
      volumes:
        - emptyDir:
            medium: Memory
            sizeLimit: 30Gi
          name: dshm
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: qwen
  name: qwen
spec:
  ports:
    - name: http-serving
      port: 8000
      protocol: TCP
      targetPort: 8000
  selector:
    app: qwen

部署示例推理服务。
```
kubectl apply -f vllm-service.yaml
```

步骤二：部署推理路由

本步骤创建InferencePool资源和InferenceModel资源。

创建inference-pool.yaml。

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  annotations:
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
  name: vllm-qwen-pool
spec:
  targetPortNumber: 8000
  selector:
    app: qwen
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: inferencemodel-qwen
spec:
  modelName: /model/qwen
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: vllm-qwen-pool
  targetModels:
  - name: /model/qwen
    weight: 100

在InferencePool资源中，通过设定inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"注解，为InferencePool中的Pod启用前缀感知的负载均衡策略。

部署推理路由。
```
kubectl apply -f inference-pool.yaml
```

步骤三：部署网关和网关路由规则

本步骤将创建一个包含8080和8081端口的网关。其中在网关的8081端口通过HTTPRoute资源指定了网关路由后端为推理扩展提供的InferencePool，推理请求将会被路由到InferencePool指定的Pod集合中。在网关的8080端口通过HTTPRoute资源指定了网关路由后端为Service，推理请求会通过普通的HTTP最小请求数负载均衡策略路由到相同的Pod集合中。

创建inference-gateway.yaml。

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: qwen-inference-gateway-class
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: qwen-inference-gateway
spec:
  gatewayClassName: qwen-inference-gateway-class
  listeners:
    - name: http
      protocol: HTTP
      port: 8080
    - name: llm-gw
      protocol: HTTP
      port: 8081
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: qwen-backend
spec:
  parentRefs:
    - name: qwen-inference-gateway
      sectionName: llm-gw
  rules:
    - backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: vllm-qwen-pool
      matches:
        - path:
            type: PathPrefix
            value: /
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: qwen-backend-no-inference
spec:
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: qwen-inference-gateway
    sectionName: http
  rules:
  - backendRefs:
    - group: ""
      kind: Service
      name: qwen
      port: 8000
      weight: 1
    matches:
    - path:
        type: PathPrefix
        value: /
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: backend-timeout
spec:
  timeout:
    http:
      requestTimeout: 1h
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: qwen-inference-gateway

部署网关。
```
kubectl apply -f inference-gateway.yaml
```

步骤四：验证路由规则

创建round1.txt和round2.txt。在两个txt文件中都包含了一段相同的一段content，通过先后将round1.txt和round2.txt作为LLM请求的Body，然后查看智能路由extensionRef的日志内容，来验证是否触发智能路由的前缀感知功能。

round1.txt：

echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt

round2.txt：

echo '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt

获取Gateway的公网IP。

export GATEWAY_IP=$(kubectl get gateway/qwen-inference-gateway -o jsonpath='{.status.addresses[0].value}')

发起两次会话请求，模拟多轮对话场景。

curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt

查看日志，确认前缀负载均衡是否生效。

kubectl logs deploy/epp-default-inference-gateway-ext-proc -n envoy-gateway-system|grep "Do prefix"

预期输出：

2025-05-23T03:33:09Z    INFO    scheduling/prefixcache_filter.go:311    Do prefix-aware routing!        {"request": "v68m4zx472", "matching ratio": " 0.54 > 0.50"}

可以看到，日志中有Do prefix-aware routing!的信息，说明前缀负载均衡已经生效。

（可选）步骤五：通过多轮对话测试评估推理服务性能

本步骤以ACK集群为例，演示使用压测工具进行多轮对话测试，来对比普通HTTP路由和推理路由的前缀负载均衡效果。

部署llm-qa-benchmark压测工具。

kubectl apply -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: llm-qa-benchmark
  name: llm-qa-benchmark
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llm-qa-benchmark
  template:
    metadata:
      labels:
        app: llm-qa-benchmark
    spec:
      containers:
      - command:
        - sh
        - -c
        - sleep inf
        image: registry-cn-hangzhou.ack.aliyuncs.com/dev/llm-qa-benchmark:v0.1
        imagePullPolicy: IfNotPresent
        name: llm-qa-benchmark
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      restartPolicy: Always
EOF

获取Gateway内网IP。

export GW_IP=$(kubectl get svc -n envoy-gateway-system -l gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=qwen-inference-gateway -o jsonpath='{.items[0].spec.clusterIP}')

执行压测。

重要

以下测试结果由测试环境生成，具体结果请以实际环境为准。

普通HTTP路由

kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \
    --num-users 8 \
    --num-rounds 15 \
    --qps 0.1 \
    --shared-system-prompt 100 \
    --sharegpt \
    --user-history-prompt 2000 \
    --answer-len 100 \
    --model /model/qwen \
    --time 600 \
    --base-url http://${GW_IP}:8080/v1

预期输出：

==================== Performance summary ======================
  QPS: 0.1000 reqs/s

  Processing speed: 0.1080 reqs/s

  Requests on-the-fly: 0

  Input tokens per second: 259.0703 tokens/s

  Output tokens per second: 4.8576 tokens/s

  Average generation throughput (per request): 26.6710 tokens/req/s

  Average TTFT: 0.3669s

Time range: 1748231183.2753935 - 1748231766.4799275 (583.20s)
===============================================================

推理服务路由

 kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \
    --num-users 8 \
    --num-rounds 15 \
    --qps 0.1 \
    --shared-system-prompt 100 \
    --sharegpt \
    --user-history-prompt 2000 \
    --answer-len 100 \
    --model /model/qwen \
    --time 600 \
    --base-url http://${GW_IP}:8081/v1

预期输出：

==================== Performance summary ======================
  QPS: 0.1000 reqs/s

  Processing speed: 0.1081 reqs/s

  Requests on-the-fly: 0

  Input tokens per second: 259.3009 tokens/s

  Output tokens per second: 4.8548 tokens/s

  Average generation throughput (per request): 26.9300 tokens/req/s

  Average TTFT: 0.2761s

Time range: 1748231885.874972 - 1748232468.5918882 (582.72s)
===============================================================

可以看到，推理服务路由的Average TTFT较HTTP路由有明显提升。