为LLM推理服务配置推理网关智能路由

传统的HTTP请求,经典负载均衡算法可以将请求均匀地发送给不同的工作负载。然而,对于LLM推理服务来说,每个请求给后端带来的负载是难以预测的。推理网关(Gateway with Inference Extension)是基于Kubernetes社区Gateway API及其Inference Extension规范实现的增强型组件,它能够通过智能路由优化在多个推理服务工作负载之间的负载均衡性能,根据LLM推理服务不同场景提供不同的负载均衡策略,并实现模型灰度发布、推理请求排队等能力。

前提条件

步骤一:为推理服务配置智能路由

根据推理服务的不同需求,Gateway with Inference Extension提供了两种智能路由负载均衡策略。

  • 基于请求队列长度和GPU Cache利用率的负载均衡(默认策略)。

  • 前缀感知的负载均衡策略(Prefix Cache Aware Routing)。

您可以通过为推理服务声明InferencePoolInferenceModel资源来针对推理服务启用推理网关的智能路由能力,并根据后端推理服务的部署方式和选用的负载均衡策略来灵活调整InferencePoolInferenceModel资源配置。

基于请求队列长度和GPU Cache利用率的负载均衡

InferencePoolannotations为空时,默认采用基于请求队列长度和GPU Cache利用率的推理服务智能路由策略。该策略会根据后端推理服务的实时负载情况(包括请求队列长度和GPU缓存利用率)来动态分配请求,以实现最优的负载均衡效果。

  1. 创建inference_networking.yaml文件。

    单机vLLM部署

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: vllm-inference
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    单机SGLang部署

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/model-server-runtime: sglang
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: sgl-inference
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    分布式vLLM部署

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: vllm-multi-nodes
        role: leader
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    分布式SGLang部署

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/model-server-runtime: sglang
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: sglang-multi-nodes
        role: leader
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    SGLang PD分离部署

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference_backend: sglang # 同时选中prefilldecode工作负载
    ---
    # InferenceTrafficPolicy 指定了针对InferencePool应用的具体流量策略
    apiVersion: inferenceextension.alibabacloud.com/v1alpha1
    kind: InferenceTrafficPolicy
    metadata:
      name: inference-policy
    spec:
      poolRef:
        name: qwen-inference-pool
      modelServerRuntime: sglang # 指定后端服务运行框架为SGLang
      profile:
        pd:  # 指定后端服务以PD分离方式部署
          pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role #通过指定pod标签区分InferencePool中的prefilldecode角色
          kvTransfer:
            bootstrapPort: 34000 # SGLang PD分离服务进行KVCache传输时使用的bootstrap port,和RoleBasedGroup部署中指定的 disaggregation-bootstrap-port 参数一致。
  2. 创建基于请求队列长度和GPU Cache利用率的负载均衡。

    kubectl create -f inference_networking.yaml

前缀感知的负载均衡(Prefix Cache Aware Routing)

前缀感知负载均衡策略(Prefix Cache Aware Routing)是一种将共享相同前缀内容的请求尽可能发送到同一个推理服务器Pod的策略。当模型服务器开启自动前缀缓存(APC)特性时,这种策略可以提高前缀缓存命中率,减少请求响应时间。

重要

在本文档中使用的vLLM v0.9.2版本以及SGLang框架默认已开启前缀缓存功能,因此无需重新部署服务来启用前缀缓存。

要启用前缀感知负载均衡策略,需要在InferencePool中添加注解:inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"

  1. 创建Prefix_Cache.yaml文件。

    单机vLLM部署

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: vllm-inference
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    单机SGLang部署

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/model-server-runtime: sglang
        inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: sgl-inference
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    分布式vLLM部署

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: vllm-multi-nodes
        role: leader
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    分布式SGLang部署

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/model-server-runtime: sglang
        inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: sglang-multi-nodes
        role: leader
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    SGLang PD分离部署

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference_backend: sglang # 同时选中prefilldecode工作负载
    ---
    # InferenceTrafficPolicy 指定了针对InferencePool应用的具体流量策略
    apiVersion: inferenceextension.alibabacloud.com/v1alpha1
    kind: InferenceTrafficPolicy
    metadata:
      name: inference-policy
    spec:
      poolRef:
        name: qwen-inference-pool
      modelServerRuntime: sglang # 指定后端服务运行框架为SGLang
      profile:
        pd:  # 指定后端服务以PD分离方式部署
          trafficPolicy:
            prefixCache: # 声明前缀缓存的负载均衡策略
              mode: estimate
          prefillPolicyRef: prefixCache
          decodePolicyRef: prefixCache # prefill 和 decode 均应用前缀感知的负载均衡
          pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role #通过指定pod标签区分InferencePool中的prefilldecode角色
          kvTransfer:
            bootstrapPort: 34000 # SGLang PD分离服务进行KVCache传输时使用的bootstrap port,和RoleBasedGroup部署中指定的 disaggregation-bootstrap-port 参数一致。
  2. 创建前缀感知的负载均衡。

    kubectl create -f Prefix_Cache.yaml

展开查看InferencePoolInferenceModel配置项说明。

配置项

类型

含义

默认值

metadata.annotations.inference.networking.x-k8s.io/model-server-runtime

string

指定模型服务运行时(如sglang)

metadata.annotations.inference.networking.x-k8s.io/routing-strategy

string

指定路由策略(可选DEFAULT、PREFIX_CACHE)

基于请求队列长度和GPU Cache利用率的智能路由策略

spec.targetPortNumber

int

指定推理服务的端口号

spec.selector

map[string]string

选择器用于匹配推理服务的Pod

spec.extensionRef

ObjectReference

对推理扩展服务的声明

spec.modelName

string

模型名称,用于路由匹配

spec.criticality

string

模型关键性等级,可选值为Critical、Standard

spec.poolRef

PoolReference

关联的InferencePool资源

步骤二:部署网关

  1. 创建gateway_networking.yaml文件。

    apiVersion: gateway.networking.k8s.io/v1
    kind: GatewayClass
    metadata:
      name: inference-gateway-class
    spec:
      controllerName: inference.networking.x-k8s.io/gateway-controller
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: inference-gateway
    spec:
      gatewayClassName: inference-gateway-class
      listeners:
      - name: http-llm
        protocol: HTTP
        port: 8080
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: inference-route
    spec:
      parentRefs:
      - name: inference-gateway
      rules:
      - matches:
        - path:
            type: PathPrefix
            value: /v1
        backendRefs:
        - name: qwen-inference-pool
          kind: InferencePool
          group: inference.networking.x-k8s.io
  2. 创建GatewayClass、GatewayHTTPRoute资源,在8080端口配置LLM推理服务路由。

    kubectl create -f gateway_networking.yaml

步骤三:验证推理网关配置

  1. 执行以下命令获取网关的外部访问地址:

    export GATEWAY_HOST=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
  2. 通过curl命令测试8080端口的服务访问:

    curl http://${GATEWAY_HOST}:8080/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "/models/Qwen3-32B",
        "messages": [
          {"role": "user", "content": "你好,这是一个测试"}
        ],
        "max_tokens": 50
      }'
  3. 验证不同负载均衡。

    验证基于请求队列长度和GPU Cache利用率的负载均衡策略

    默认策略基于请求队列长度和GPU Cache利用率进行智能路由。可以通过压测推理服务、观察推理服务TTFT和吞吐量指标进行观察。

    具体测试方法可参考配置LLM服务可观测指标与可观测大盘

    验证前缀感知负载均衡

    创建测试文件验证前缀感知负载均衡是否生效。

    1. 生成 round1.txt:

      echo '{"max_tokens":24,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt
    2. 生成 round2.txt:

      echo '{"max_tokens":3,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt
    3. 执行以下命令进行测试:

      curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
      curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt
    4. 检查Inference Extension Processor的日志确认前缀感知的负载均衡是否生效:

      kubectl logs deploy/inference-gateway-ext-proc -n envoy-gateway-system | grep "Request Handled"

      如果看到打印的两条日志中出现相同的pod名称,说明前缀感知的负载均衡生效。

      前缀感知的负载均衡的具体测试方法与效果,可参考 通过多轮对话测试评估推理服务性能