使用Gateway with Inference Extension实现前缀感知负载均衡

通过Gateway with Inference Extension组件,您可以根据生成式AI推理服务的不同使用场景、指定使用推理服务路由的不同负载均衡策略。本文介绍如何使用Gateway with Inference Extension组件实现前缀感知的负载均衡策略。

重要
  • 阅读本文前,请确保您已经了解InferencePoolInferenceModel的相关概念。

  • 本文内容依赖1.4.0及以上版本Gateway with Inference Extension

背景信息

vLLM的自动前缀缓存

vLLM支持自动前缀缓存特性。自动前缀缓存(APC)会缓存vLLM已经计算过请求的KV Cache,这样如果新的请求与某个历史请求具有相同的前缀,就可以直接复用现有的KV Cache,从而使新请求得以跳过共享前缀部分的KV Cache计算,从而加速对LLM推理请求的处理流程。

前缀感知的负载均衡策略

前缀感知的负载均衡策略是指将共享同一前缀内容的请求尽可能发送到同一个推理服务器Pod的负载均衡策略。

在模型服务器开启APC特性的情况下,前缀感知的负载均衡策略可以尽可能的提高前缀缓存命中率,减少请求响应时间。此策略主要适用于有大量共享前缀请求的场景,请根据您的实际业务场景进行判断。

典型的使用场景如下:

  • 长文档查询:用户反复使用不同的查询对同一长文档(例如软件手册或年度报告)进行查询。

  • 多轮对话:用户可能在同一聊天会话中多次与应用程序进行交互。

前提条件

说明

本文使用的镜像推荐ACK集群使用A10卡型,ACS GPU算力推荐使用L20(GN8IS)卡型。

同时,由于LLM镜像体积较大,建议您提前转存到ACR,使用内网地址进行拉取。直接从公网拉取的速度取决于集群EIP的带宽配置,会有较长的等待时间。

操作步骤

步骤一:部署示例推理服务

  1. 创建vllm-service.yaml。

    展开查看YAML内容

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: qwen
      name: qwen
    spec:
      progressDeadlineSeconds: 600
      replicas: 5
      selector:
        matchLabels:
          app: qwen
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: "8000"
            prometheus.io/scrape: "true"
          labels:
            app: qwen
            alibabacloud.com/compute-class: gpu
            alibabacloud.com/compute-qos: default
            alibabacloud.com/gpu-model-series: GN8IS
        spec:
          containers:
            - command:
                - sh
                - -c
                - vllm serve /models/Qwen-2.5-7B-Instruct --port 8000 --enable_prefix_caching --trust-remote-code --served-model-name /model/qwen --gpu-memory-utilization 0.95 --enforce-eager --enable-lora --max-loras 2 --max-cpu-loras 4 --lora-modules travel-helper-v1=/models/Qwen-TravelHelper-Lora travel-helper-v2=/models/Qwen-TravelHelper-Lora-v2
              image: registry-cn-hangzhou.ack.aliyuncs.com/dev/qwen-2.5-7b-instruct-lora:v0.1
              imagePullPolicy: IfNotPresent
              name: custom-serving
              ports:
                - containerPort: 8000
                  name: http
                  protocol: TCP
              readinessProbe:
                failureThreshold: 3
                initialDelaySeconds: 30
                periodSeconds: 30
                successThreshold: 1
                tcpSocket:
                  port: 8000
                timeoutSeconds: 1
              resources:
                limits:
                  nvidia.com/gpu: "1"
                  cpu: "8"
                  memory: 30G
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
              volumeMounts:
                - mountPath: /dev/shm
                  name: dshm
          restartPolicy: Always
          volumes:
            - emptyDir:
                medium: Memory
                sizeLimit: 30Gi
              name: dshm
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app: qwen
      name: qwen
    spec:
      ports:
        - name: http-serving
          port: 8000
          protocol: TCP
          targetPort: 8000
      selector:
        app: qwen
  2. 部署示例推理服务。

    kubectl apply -f vllm-service.yaml

步骤二:部署推理路由

本步骤创建InferencePool资源和InferenceModel资源。

  1. 创建inference-pool.yaml。

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      annotations:
        inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
      name: vllm-qwen-pool
    spec:
      targetPortNumber: 8000
      selector:
        app: qwen
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: inferencemodel-qwen
    spec:
      modelName: /model/qwen
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: vllm-qwen-pool
      targetModels:
      - name: /model/qwen
        weight: 100

    InferencePool资源中,通过设定inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"注解,为InferencePool中的Pod启用前缀感知的负载均衡策略。

  2. 部署推理路由。

    kubectl apply -f inference-pool.yaml

步骤三:部署网关和网关路由规则

本步骤将创建一个包含80808081端口的网关。其中在网关的8081端口通过HTTPRoute资源指定了网关路由后端为推理扩展提供的InferencePool,推理请求将会被路由到InferencePool指定的Pod集合中。在网关的8080端口通过HTTPRoute资源指定了网关路由后端为Service,推理请求会通过普通的HTTP最小请求数负载均衡策略路由到相同的Pod集合中。

  1. 创建inference-gateway.yaml。

    apiVersion: gateway.networking.k8s.io/v1
    kind: GatewayClass
    metadata:
      name: qwen-inference-gateway-class
    spec:
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: qwen-inference-gateway
    spec:
      gatewayClassName: qwen-inference-gateway-class
      listeners:
        - name: http
          protocol: HTTP
          port: 8080
        - name: llm-gw
          protocol: HTTP
          port: 8081
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: qwen-backend
    spec:
      parentRefs:
        - name: qwen-inference-gateway
          sectionName: llm-gw
      rules:
        - backendRefs:
            - group: inference.networking.x-k8s.io
              kind: InferencePool
              name: vllm-qwen-pool
          matches:
            - path:
                type: PathPrefix
                value: /
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: qwen-backend-no-inference
    spec:
      parentRefs:
      - group: gateway.networking.k8s.io
        kind: Gateway
        name: qwen-inference-gateway
        sectionName: http
      rules:
      - backendRefs:
        - group: ""
          kind: Service
          name: qwen
          port: 8000
          weight: 1
        matches:
        - path:
            type: PathPrefix
            value: /
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: BackendTrafficPolicy
    metadata:
      name: backend-timeout
    spec:
      timeout:
        http:
          requestTimeout: 1h
      targetRef:
        group: gateway.networking.k8s.io
        kind: Gateway
        name: qwen-inference-gateway
  2. 部署网关。

    kubectl apply -f inference-gateway.yaml

步骤四:验证路由规则

  1. 创建round1.txtround2.txt。在两个txt文件中都包含了一段相同的一段content,通过先后将round1.txtround2.txt作为LLM请求的Body,然后查看智能路由extensionRef的日志内容,来验证是否触发智能路由的前缀感知功能。

    round1.txt:

    echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt

    round2.txt:

    echo '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt
  2. 获取Gateway的公网IP。

    export GATEWAY_IP=$(kubectl get gateway/qwen-inference-gateway -o jsonpath='{.status.addresses[0].value}')
  3. 发起两次会话请求,模拟多轮对话场景。

    curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
    curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt
  4. 查看日志,确认前缀负载均衡是否生效。

    kubectl logs deploy/epp-default-inference-gateway-ext-proc -n envoy-gateway-system|grep "Do prefix"

    预期输出:

    2025-05-23T03:33:09Z    INFO    scheduling/prefixcache_filter.go:311    Do prefix-aware routing!        {"request": "v68m4zx472", "matching ratio": " 0.54 > 0.50"}

    可以看到,日志中有Do prefix-aware routing!的信息,说明前缀负载均衡已经生效。

(可选)步骤五:通过多轮对话测试评估推理服务性能

本步骤以ACK集群为例,演示使用压测工具进行多轮对话测试,来对比普通HTTP路由和推理路由的前缀负载均衡效果。

  1. 部署llm-qa-benchmark压测工具。

    kubectl apply -f- <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: llm-qa-benchmark
      name: llm-qa-benchmark
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: llm-qa-benchmark
      template:
        metadata:
          labels:
            app: llm-qa-benchmark
        spec:
          containers:
          - command:
            - sh
            - -c
            - sleep inf
            image: registry-cn-hangzhou.ack.aliyuncs.com/dev/llm-qa-benchmark:v0.1
            imagePullPolicy: IfNotPresent
            name: llm-qa-benchmark
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          restartPolicy: Always
    EOF
  2. 获取Gateway内网IP。

    export GW_IP=$(kubectl get svc -n envoy-gateway-system -l gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=qwen-inference-gateway -o jsonpath='{.items[0].spec.clusterIP}')
  3. 执行压测。

    重要

    以下测试结果由测试环境生成,具体结果请以实际环境为准。

    普通HTTP路由

    kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \
        --num-users 8 \
        --num-rounds 15 \
        --qps 0.1 \
        --shared-system-prompt 100 \
        --sharegpt \
        --user-history-prompt 2000 \
        --answer-len 100 \
        --model /model/qwen \
        --time 600 \
        --base-url http://${GW_IP}:8080/v1

    预期输出:

    ==================== Performance summary ======================
      QPS: 0.1000 reqs/s
    
      Processing speed: 0.1080 reqs/s
    
      Requests on-the-fly: 0
    
      Input tokens per second: 259.0703 tokens/s
    
      Output tokens per second: 4.8576 tokens/s
    
      Average generation throughput (per request): 26.6710 tokens/req/s
    
      Average TTFT: 0.3669s
    
    Time range: 1748231183.2753935 - 1748231766.4799275 (583.20s)
    ===============================================================

    推理服务路由

     kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \
        --num-users 8 \
        --num-rounds 15 \
        --qps 0.1 \
        --shared-system-prompt 100 \
        --sharegpt \
        --user-history-prompt 2000 \
        --answer-len 100 \
        --model /model/qwen \
        --time 600 \
        --base-url http://${GW_IP}:8081/v1

    预期输出:

    ==================== Performance summary ======================
      QPS: 0.1000 reqs/s
    
      Processing speed: 0.1081 reqs/s
    
      Requests on-the-fly: 0
    
      Input tokens per second: 259.3009 tokens/s
    
      Output tokens per second: 4.8548 tokens/s
    
      Average generation throughput (per request): 26.9300 tokens/req/s
    
      Average TTFT: 0.2761s
    
    Time range: 1748231885.874972 - 1748232468.5918882 (582.72s)
    ===============================================================

    可以看到,推理服务路由的Average TTFTHTTP路由有明显提升。