传统的HTTP请求,经典负载均衡算法可以将请求均匀地发送给不同的工作负载。然而,对于LLM推理服务来说,每个请求给后端带来的负载是难以预测的。推理网关(Gateway with Inference Extension)是基于Kubernetes社区Gateway API及其Inference Extension规范实现的增强型组件,它能够通过智能路由优化在多个推理服务工作负载之间的负载均衡性能,根据LLM推理服务不同场景提供不同的负载均衡策略,并实现模型灰度发布、推理请求排队等能力。
前提条件
步骤一:为推理服务配置智能路由
根据推理服务的不同需求,Gateway with Inference Extension提供了两种智能路由负载均衡策略。
基于请求队列长度和GPU Cache利用率的负载均衡(默认策略)。
前缀感知的负载均衡策略(Prefix Cache Aware Routing)。
您可以通过为推理服务声明InferencePool和InferenceModel资源来针对推理服务启用推理网关的智能路由能力,并根据后端推理服务的部署方式和选用的负载均衡策略来灵活调整InferencePool和InferenceModel资源配置。
基于请求队列长度和GPU Cache利用率的负载均衡
当InferencePool的annotations为空时,默认采用基于请求队列长度和GPU Cache利用率的推理服务智能路由策略。该策略会根据后端推理服务的实时负载情况(包括请求队列长度和GPU缓存利用率)来动态分配请求,以实现最优的负载均衡效果。
创建
inference_networking.yaml
文件。单机vLLM部署
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: vllm-inference extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100
单机SGLang部署
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/model-server-runtime: sglang spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: sgl-inference extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100
分布式vLLM部署
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: vllm-multi-nodes role: leader extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100
分布式SGLang部署
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/model-server-runtime: sglang spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: sglang-multi-nodes role: leader extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100
SGLang PD分离部署
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool spec: targetPortNumber: 8000 selector: alibabacloud.com/inference_backend: sglang # 同时选中prefill和decode工作负载 --- # InferenceTrafficPolicy 指定了针对InferencePool应用的具体流量策略 apiVersion: inferenceextension.alibabacloud.com/v1alpha1 kind: InferenceTrafficPolicy metadata: name: inference-policy spec: poolRef: name: qwen-inference-pool modelServerRuntime: sglang # 指定后端服务运行框架为SGLang profile: pd: # 指定后端服务以PD分离方式部署 pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role #通过指定pod标签区分InferencePool中的prefill和decode角色 kvTransfer: bootstrapPort: 34000 # SGLang PD分离服务进行KVCache传输时使用的bootstrap port,和RoleBasedGroup部署中指定的 disaggregation-bootstrap-port 参数一致。
创建基于请求队列长度和GPU Cache利用率的负载均衡。
kubectl create -f inference_networking.yaml
前缀感知的负载均衡(Prefix Cache Aware Routing)
前缀感知负载均衡策略(Prefix Cache Aware Routing)是一种将共享相同前缀内容的请求尽可能发送到同一个推理服务器Pod的策略。当模型服务器开启自动前缀缓存(APC)特性时,这种策略可以提高前缀缓存命中率,减少请求响应时间。
在本文档中使用的vLLM v0.9.2版本以及SGLang框架默认已开启前缀缓存功能,因此无需重新部署服务来启用前缀缓存。
要启用前缀感知负载均衡策略,需要在InferencePool中添加注解:inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
创建
Prefix_Cache.yaml
文件。单机vLLM部署
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: vllm-inference extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100
单机SGLang部署
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/model-server-runtime: sglang inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: sgl-inference extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100
分布式vLLM部署
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: vllm-multi-nodes role: leader extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100
分布式SGLang部署
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/model-server-runtime: sglang inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: sglang-multi-nodes role: leader extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100
SGLang PD分离部署
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool spec: targetPortNumber: 8000 selector: alibabacloud.com/inference_backend: sglang # 同时选中prefill和decode工作负载 --- # InferenceTrafficPolicy 指定了针对InferencePool应用的具体流量策略 apiVersion: inferenceextension.alibabacloud.com/v1alpha1 kind: InferenceTrafficPolicy metadata: name: inference-policy spec: poolRef: name: qwen-inference-pool modelServerRuntime: sglang # 指定后端服务运行框架为SGLang profile: pd: # 指定后端服务以PD分离方式部署 trafficPolicy: prefixCache: # 声明前缀缓存的负载均衡策略 mode: estimate prefillPolicyRef: prefixCache decodePolicyRef: prefixCache # prefill 和 decode 均应用前缀感知的负载均衡 pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role #通过指定pod标签区分InferencePool中的prefill和decode角色 kvTransfer: bootstrapPort: 34000 # SGLang PD分离服务进行KVCache传输时使用的bootstrap port,和RoleBasedGroup部署中指定的 disaggregation-bootstrap-port 参数一致。
创建前缀感知的负载均衡。
kubectl create -f Prefix_Cache.yaml
步骤二:部署网关
创建
gateway_networking.yaml
文件。apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass metadata: name: inference-gateway-class spec: controllerName: inference.networking.x-k8s.io/gateway-controller --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: inference-gateway spec: gatewayClassName: inference-gateway-class listeners: - name: http-llm protocol: HTTP port: 8080 --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: inference-route spec: parentRefs: - name: inference-gateway rules: - matches: - path: type: PathPrefix value: /v1 backendRefs: - name: qwen-inference-pool kind: InferencePool group: inference.networking.x-k8s.io
创建GatewayClass、Gateway和HTTPRoute资源,在8080端口配置LLM推理服务路由。
kubectl create -f gateway_networking.yaml
步骤三:验证推理网关配置
执行以下命令获取网关的外部访问地址:
export GATEWAY_HOST=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
通过curl命令测试8080端口的服务访问:
curl http://${GATEWAY_HOST}:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/models/Qwen3-32B", "messages": [ {"role": "user", "content": "你好,这是一个测试"} ], "max_tokens": 50 }'
验证不同负载均衡。
验证基于请求队列长度和GPU Cache利用率的负载均衡策略
默认策略基于请求队列长度和GPU Cache利用率进行智能路由。可以通过压测推理服务、观察推理服务TTFT和吞吐量指标进行观察。
具体测试方法可参考配置LLM服务可观测指标与可观测大盘。
验证前缀感知负载均衡
创建测试文件验证前缀感知负载均衡是否生效。
生成 round1.txt:
echo '{"max_tokens":24,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt
生成 round2.txt:
echo '{"max_tokens":3,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt
执行以下命令进行测试:
curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt
检查Inference Extension Processor的日志确认前缀感知的负载均衡是否生效:
kubectl logs deploy/inference-gateway-ext-proc -n envoy-gateway-system | grep "Request Handled"
如果看到打印的两条日志中出现相同的pod名称,说明前缀感知的负载均衡生效。
前缀感知的负载均衡的具体测试方法与效果,可参考 通过多轮对话测试评估推理服务性能。