使用推理网关的智能路由策略优化LLM推理服务负载均衡-容器服务 Kubernetes 版 ACK-阿里云

传统的HTTP请求，经典负载均衡算法可以将请求均匀地发送给不同的工作负载。然而，对于LLM推理服务来说，每个请求给后端带来的负载是难以预测的。推理网关（Gateway with Inference Extension）是基于Kubernetes社区Gateway API及其Inference Extension规范实现的增强型组件，它能够通过智能路由优化在多个推理服务工作负载之间的负载均衡性能，根据LLM推理服务不同场景提供不同的负载均衡策略，并实现模型灰度发布、推理请求排队等能力。

前提条件

已部署Gateway with Inference Extension组件。
已部署单机LLM推理服务或部署多机分布式推理服务。

步骤一：为推理服务配置智能路由

根据推理服务的不同需求，Gateway with Inference Extension提供了两种智能路由负载均衡策略。

基于请求队列长度和GPU Cache利用率的负载均衡（默认策略）。
前缀感知的负载均衡策略（Prefix Cache Aware Routing）。

您可以通过为推理服务声明InferencePool和InferenceModel资源来针对推理服务启用推理网关的智能路由能力，并根据后端推理服务的部署方式和选用的负载均衡策略来灵活调整InferencePool和InferenceModel资源配置。

基于请求队列长度和GPU Cache利用率的负载均衡

当InferencePool的annotations为空时，默认采用基于请求队列长度和GPU Cache利用率的推理服务智能路由策略。该策略会根据后端推理服务的实时负载情况（包括请求队列长度和GPU缓存利用率）来动态分配请求，以实现最优的负载均衡效果。

创建inference_networking.yaml文件。

单机vLLM部署

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: vllm-inference
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

单机SGLang部署

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: sgl-inference
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

分布式vLLM部署

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: vllm-multi-nodes
    role: leader
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

分布式SGLang部署

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: sglang-multi-nodes
    role: leader
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

SGLang PD分离部署

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference_backend: sglang # 同时选中prefill和decode工作负载
---
# InferenceTrafficPolicy 指定了针对InferencePool应用的具体流量策略
apiVersion: inferenceextension.alibabacloud.com/v1alpha1
kind: InferenceTrafficPolicy
metadata:
  name: inference-policy
spec:
  poolRef:
    name: qwen-inference-pool
  modelServerRuntime: sglang # 指定后端服务运行框架为SGLang
  profile:
    pd:  # 指定后端服务以PD分离方式部署
      pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role #通过指定pod标签区分InferencePool中的prefill和decode角色
      kvTransfer:
        bootstrapPort: 34000 # SGLang PD分离服务进行KVCache传输时使用的bootstrap port，和RoleBasedGroup部署中指定的 disaggregation-bootstrap-port 参数一致。

创建基于请求队列长度和GPU Cache利用率的负载均衡。
```
kubectl create -f inference_networking.yaml
```

前缀感知的负载均衡（Prefix Cache Aware Routing）

前缀感知负载均衡策略（Prefix Cache Aware Routing）是一种将共享相同前缀内容的请求尽可能发送到同一个推理服务器Pod的策略。当模型服务器开启自动前缀缓存(APC)特性时，这种策略可以提高前缀缓存命中率，减少请求响应时间。

重要

在本文档中使用的vLLM v0.9.2版本以及SGLang框架默认已开启前缀缓存功能，因此无需重新部署服务来启用前缀缓存。

要启用前缀感知负载均衡策略，需要在InferencePool中添加注解：inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"

创建Prefix_Cache.yaml文件。

单机vLLM部署

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: vllm-inference
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

单机SGLang部署

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: sgl-inference
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

分布式vLLM部署

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: vllm-multi-nodes
    role: leader
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

分布式SGLang部署

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: sglang-multi-nodes
    role: leader
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

SGLang PD分离部署

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference_backend: sglang # 同时选中prefill和decode工作负载
---
# InferenceTrafficPolicy 指定了针对InferencePool应用的具体流量策略
apiVersion: inferenceextension.alibabacloud.com/v1alpha1
kind: InferenceTrafficPolicy
metadata:
  name: inference-policy
spec:
  poolRef:
    name: qwen-inference-pool
  modelServerRuntime: sglang # 指定后端服务运行框架为SGLang
  profile:
    pd:  # 指定后端服务以PD分离方式部署
      trafficPolicy:
        prefixCache: # 声明前缀缓存的负载均衡策略
          mode: estimate
      prefillPolicyRef: prefixCache
      decodePolicyRef: prefixCache # prefill 和 decode 均应用前缀感知的负载均衡
      pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role #通过指定pod标签区分InferencePool中的prefill和decode角色
      kvTransfer:
        bootstrapPort: 34000 # SGLang PD分离服务进行KVCache传输时使用的bootstrap port，和RoleBasedGroup部署中指定的 disaggregation-bootstrap-port 参数一致。

创建前缀感知的负载均衡。
```
kubectl create -f Prefix_Cache.yaml
```

展开查看InferencePool与InferenceModel配置项说明。

配置项	类型	含义	默认值
metadata.annotations.inference.networking.x-k8s.io/model-server-runtime	string	指定模型服务运行时（如sglang）	无
metadata.annotations.inference.networking.x-k8s.io/routing-strategy	string	指定路由策略（可选DEFAULT、PREFIX_CACHE）	基于请求队列长度和GPU Cache利用率的智能路由策略
spec.targetPortNumber	int	指定推理服务的端口号	无
spec.selector	map[string]string	选择器用于匹配推理服务的Pod	无
spec.extensionRef	ObjectReference	对推理扩展服务的声明	无
spec.modelName	string	模型名称，用于路由匹配	无
spec.criticality	string	模型关键性等级，可选值为Critical、Standard	无
spec.poolRef	PoolReference	关联的InferencePool资源	无

步骤二：部署网关

创建gateway_networking.yaml文件。

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: inference-gateway-class
spec:
  controllerName: inference.networking.x-k8s.io/gateway-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: inference-gateway-class
  listeners:
  - name: http-llm
    protocol: HTTP
    port: 8080
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: inference-route
spec:
  parentRefs:
  - name: inference-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /v1
    backendRefs:
    - name: qwen-inference-pool
      kind: InferencePool
      group: inference.networking.x-k8s.io

创建GatewayClass、Gateway和HTTPRoute资源，在8080端口配置LLM推理服务路由。
```
kubectl create -f gateway_networking.yaml
```

步骤三：验证推理网关配置

执行以下命令获取网关的外部访问地址：

export GATEWAY_HOST=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')

通过curl命令测试8080端口的服务访问：

curl http://${GATEWAY_HOST}:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/models/Qwen3-32B",
    "messages": [
      {"role": "user", "content": "你好，这是一个测试"}
    ],
    "max_tokens": 50
  }'

验证不同负载均衡。

验证基于请求队列长度和GPU Cache利用率的负载均衡策略

默认策略基于请求队列长度和GPU Cache利用率进行智能路由。可以通过压测推理服务、观察推理服务TTFT和吞吐量指标进行观察。

具体测试方法可参考配置LLM服务可观测指标与可观测大盘。

验证前缀感知负载均衡

创建测试文件验证前缀感知负载均衡是否生效。

生成 round1.txt:

echo '{"max_tokens":24,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt

生成 round2.txt:

echo '{"max_tokens":3,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt

执行以下命令进行测试：

curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt

检查Inference Extension Processor的日志确认前缀感知的负载均衡是否生效：
```
kubectl logs deploy/inference-gateway-ext-proc -n envoy-gateway-system | grep "Request Handled"
```
如果看到打印的两条日志中出现相同的pod名称，说明前缀感知的负载均衡生效。
前缀感知的负载均衡的具体测试方法与效果，可参考通过多轮对话测试评估推理服务性能。