通过Gateway with Inference Extension组件,您可以根据生成式AI推理服务的不同使用场景、指定使用推理服务路由的不同负载均衡策略。本文介绍如何使用Gateway with Inference Extension组件实现前缀感知的负载均衡策略。
阅读本文前,请确保您已经了解InferencePool和InferenceModel的相关概念。
本文内容依赖1.4.0及以上版本的Gateway with Inference Extension。
背景信息
vLLM的自动前缀缓存
vLLM支持自动前缀缓存特性。自动前缀缓存(APC)会缓存vLLM已经计算过请求的KV Cache,这样如果新的请求与某个历史请求具有相同的前缀,就可以直接复用现有的KV Cache,从而使新请求得以跳过共享前缀部分的KV Cache计算,从而加速对LLM推理请求的处理流程。
前缀感知的负载均衡策略
前缀感知的负载均衡策略是指将共享同一前缀内容的请求尽可能发送到同一个推理服务器Pod的负载均衡策略。
在模型服务器开启APC特性的情况下,前缀感知的负载均衡策略可以尽可能的提高前缀缓存命中率,减少请求响应时间。此策略主要适用于有大量共享前缀请求的场景,请根据您的实际业务场景进行判断。
典型的使用场景如下:
长文档查询:用户反复使用不同的查询对同一长文档(例如软件手册或年度报告)进行查询。
多轮对话:用户可能在同一聊天会话中多次与应用程序进行交互。
前提条件
已创建带有GPU节点池的ACK托管集群。您也可以在ACK托管集群中安装ACK Virtual Node组件,以使用ACS GPU算力。
已安装1.4.0版本的Gateway with Inference Extension并勾选启用Gateway API推理扩展。操作入口,请参见安装组件。
本文使用的镜像推荐ACK集群使用A10卡型,ACS GPU算力推荐使用L20(GN8IS)卡型。
同时,由于LLM镜像体积较大,建议您提前转存到ACR,使用内网地址进行拉取。直接从公网拉取的速度取决于集群EIP的带宽配置,会有较长的等待时间。
操作步骤
步骤一:部署示例推理服务
创建vllm-service.yaml。
部署示例推理服务。
kubectl apply -f vllm-service.yaml
步骤二:部署推理路由
本步骤创建InferencePool资源和InferenceModel资源。
创建inference-pool.yaml。
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: annotations: inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" name: vllm-qwen-pool spec: targetPortNumber: 8000 selector: app: qwen extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: inferencemodel-qwen spec: modelName: /model/qwen criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: vllm-qwen-pool targetModels: - name: /model/qwen weight: 100
在InferencePool资源中,通过设定
inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
注解,为InferencePool中的Pod启用前缀感知的负载均衡策略。部署推理路由。
kubectl apply -f inference-pool.yaml
步骤三:部署网关和网关路由规则
本步骤将创建一个包含8080和8081端口的网关。其中在网关的8081端口通过HTTPRoute资源指定了网关路由后端为推理扩展提供的InferencePool,推理请求将会被路由到InferencePool指定的Pod集合中。在网关的8080端口通过HTTPRoute资源指定了网关路由后端为Service,推理请求会通过普通的HTTP最小请求数负载均衡策略路由到相同的Pod集合中。
创建inference-gateway.yaml。
apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass metadata: name: qwen-inference-gateway-class spec: controllerName: gateway.envoyproxy.io/gatewayclass-controller --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: qwen-inference-gateway spec: gatewayClassName: qwen-inference-gateway-class listeners: - name: http protocol: HTTP port: 8080 - name: llm-gw protocol: HTTP port: 8081 --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: qwen-backend spec: parentRefs: - name: qwen-inference-gateway sectionName: llm-gw rules: - backendRefs: - group: inference.networking.x-k8s.io kind: InferencePool name: vllm-qwen-pool matches: - path: type: PathPrefix value: / --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: qwen-backend-no-inference spec: parentRefs: - group: gateway.networking.k8s.io kind: Gateway name: qwen-inference-gateway sectionName: http rules: - backendRefs: - group: "" kind: Service name: qwen port: 8000 weight: 1 matches: - path: type: PathPrefix value: / --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: backend-timeout spec: timeout: http: requestTimeout: 1h targetRef: group: gateway.networking.k8s.io kind: Gateway name: qwen-inference-gateway
部署网关。
kubectl apply -f inference-gateway.yaml
步骤四:验证路由规则
创建round1.txt和round2.txt。在两个txt文件中都包含了一段相同的一段
content
,通过先后将round1.txt和round2.txt作为LLM请求的Body,然后查看智能路由extensionRef的日志内容,来验证是否触发智能路由的前缀感知功能。round1.txt:
echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt
round2.txt:
echo '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt
获取Gateway的公网IP。
export GATEWAY_IP=$(kubectl get gateway/qwen-inference-gateway -o jsonpath='{.status.addresses[0].value}')
发起两次会话请求,模拟多轮对话场景。
curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt
查看日志,确认前缀负载均衡是否生效。
kubectl logs deploy/epp-default-inference-gateway-ext-proc -n envoy-gateway-system|grep "Do prefix"
预期输出:
2025-05-23T03:33:09Z INFO scheduling/prefixcache_filter.go:311 Do prefix-aware routing! {"request": "v68m4zx472", "matching ratio": " 0.54 > 0.50"}
可以看到,日志中有
Do prefix-aware routing!
的信息,说明前缀负载均衡已经生效。
(可选)步骤五:通过多轮对话测试评估推理服务性能
本步骤以ACK集群为例,演示使用压测工具进行多轮对话测试,来对比普通HTTP路由和推理路由的前缀负载均衡效果。
部署llm-qa-benchmark压测工具。
kubectl apply -f- <<EOF apiVersion: apps/v1 kind: Deployment metadata: labels: app: llm-qa-benchmark name: llm-qa-benchmark spec: replicas: 1 selector: matchLabels: app: llm-qa-benchmark template: metadata: labels: app: llm-qa-benchmark spec: containers: - command: - sh - -c - sleep inf image: registry-cn-hangzhou.ack.aliyuncs.com/dev/llm-qa-benchmark:v0.1 imagePullPolicy: IfNotPresent name: llm-qa-benchmark terminationMessagePath: /dev/termination-log terminationMessagePolicy: File restartPolicy: Always EOF
获取Gateway内网IP。
export GW_IP=$(kubectl get svc -n envoy-gateway-system -l gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=qwen-inference-gateway -o jsonpath='{.items[0].spec.clusterIP}')
执行压测。
重要以下测试结果由测试环境生成,具体结果请以实际环境为准。
普通HTTP路由
kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \ --num-users 8 \ --num-rounds 15 \ --qps 0.1 \ --shared-system-prompt 100 \ --sharegpt \ --user-history-prompt 2000 \ --answer-len 100 \ --model /model/qwen \ --time 600 \ --base-url http://${GW_IP}:8080/v1
预期输出:
==================== Performance summary ====================== QPS: 0.1000 reqs/s Processing speed: 0.1080 reqs/s Requests on-the-fly: 0 Input tokens per second: 259.0703 tokens/s Output tokens per second: 4.8576 tokens/s Average generation throughput (per request): 26.6710 tokens/req/s Average TTFT: 0.3669s Time range: 1748231183.2753935 - 1748231766.4799275 (583.20s) ===============================================================
推理服务路由
kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \ --num-users 8 \ --num-rounds 15 \ --qps 0.1 \ --shared-system-prompt 100 \ --sharegpt \ --user-history-prompt 2000 \ --answer-len 100 \ --model /model/qwen \ --time 600 \ --base-url http://${GW_IP}:8081/v1
预期输出:
==================== Performance summary ====================== QPS: 0.1000 reqs/s Processing speed: 0.1081 reqs/s Requests on-the-fly: 0 Input tokens per second: 259.3009 tokens/s Output tokens per second: 4.8548 tokens/s Average generation throughput (per request): 26.9300 tokens/req/s Average TTFT: 0.2761s Time range: 1748231885.874972 - 1748232468.5918882 (582.72s) ===============================================================
可以看到,推理服务路由的
Average TTFT
较HTTP路由有明显提升。