使用Gateway with Inference Extension实现推理请求排队与优先级调度

Gateway with Inference Extension支持基于推理服务负载感知的推理请求排队与优先级调度。在避免生成式AI推理服务出现过载的情况外,可以根据模型优先级对队列中的推理请求进行优先级调度,保证高优请求优先响应。本文主要介绍Gateway with Inference Extension的推理请求排队与优先级调度能力。

重要

本文内容依赖1.4.0及以上版本Gateway with Inference Extension

背景信息

对于生成式AI推理服务,单个推理服务器的请求吞吐能力会受到GPU资源的严格限制。当大量的请求同时发送到同一个推理服务器时,会导致推理引擎的KV Cache等资源占用率满载,从而影响所有请求的响应时间和token吞吐速度。

Gateway with Inference Extension支持过推理服务器多个维度的指标来评估推理服务器的内部状态,并在推理服务器负载满载时对推理请求进行排队,防止过量请求发送到推理服务器,造成服务整体质量下降。

前提条件

说明

本文使用的镜像推荐ACK集群使用A10卡型,ACS GPU算力推荐使用L20(GN8IS)卡型。

同时,由于LLM镜像体积较大,建议您提前转存到ACR,使用内网地址进行拉取。直接从公网拉取的速度取决于集群EIP的带宽配置,会有较长的等待时间。

操作步骤

步骤一:部署示例推理服务

  1. 创建vllm-service.yaml。

    展开查看YAML内容

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: qwen
      name: qwen
    spec:
      progressDeadlineSeconds: 600
      replicas: 5
      selector:
        matchLabels:
          app: qwen
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: "8000"
            prometheus.io/scrape: "true"
          labels:
            app: qwen
            alibabacloud.com/compute-class: gpu
            alibabacloud.com/compute-qos: default
            alibabacloud.com/gpu-model-series: GN8IS
        spec:
          containers:
            - command:
                - sh
                - -c
                - vllm serve /models/Qwen-2.5-7B-Instruct --port 8000 --enable_prefix_caching --trust-remote-code --served-model-name /model/qwen --gpu-memory-utilization 0.95 --enforce-eager --enable-lora --max-loras 2 --max-cpu-loras 4 --lora-modules travel-helper-v1=/models/Qwen-TravelHelper-Lora travel-helper-v2=/models/Qwen-TravelHelper-Lora-v2
              image: registry-cn-hangzhou.ack.aliyuncs.com/dev/qwen-2.5-7b-instruct-lora:v0.1
              imagePullPolicy: IfNotPresent
              name: custom-serving
              ports:
                - containerPort: 8000
                  name: http
                  protocol: TCP
              readinessProbe:
                failureThreshold: 3
                initialDelaySeconds: 30
                periodSeconds: 30
                successThreshold: 1
                tcpSocket:
                  port: 8000
                timeoutSeconds: 1
              resources:
                limits:
                  nvidia.com/gpu: "1"
                  cpu: "8"
                  memory: 30G
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
              volumeMounts:
                - mountPath: /dev/shm
                  name: dshm
          restartPolicy: Always
          volumes:
            - emptyDir:
                medium: Memory
                sizeLimit: 30Gi
              name: dshm
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app: qwen
      name: qwen
    spec:
      ports:
        - name: http-serving
          port: 8000
          protocol: TCP
          targetPort: 8000
      selector:
        app: qwen
  2. 部署示例推理服务。

    kubectl apply -f vllm-service.yaml

步骤二:部署推理路由

本步骤创建InferencePool资源和InferenceModel资源。并通过为InferencePool添加inference-epp-env.networking.x-k8s.io/experimental-use-queueing: "true"inference-epp-env.networking.x-k8s.io/experimental-use-scheduler-v2: "true"注解来为InferencePool选中的推理服务启用排队能力。

  1. 创建inference-pool.yaml。

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      annotations:
        inference-epp-env.networking.x-k8s.io/experimental-use-queueing: "true"
        inference-epp-env.networking.x-k8s.io/experimental-use-scheduler-v2: "true"
      name: qwen-pool
      namespace: default
    spec:
      extensionRef:
        group: ""
        kind: Service
        name: qwen-ext-proc
      selector:
        app: qwen
      targetPortNumber: 8000
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-model
    spec:
      criticality: Critical
      modelName: qwen
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-pool
      targetModels:
      - name: qwen
        weight: 100
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: travel-helper-model
    spec:
      criticality: Standard
      modelName: travel-helper
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-pool
      targetModels:
      - name: travel-helper-v1
        weight: 100

    InferencePool资源中,同时声明了两个InferenceModel资源,表示示例推理服务可以提供服务的两种模型:

    • qwen-model:声明了示例推理服务提供的基础模型qwen,并通过criticality: Critical字段声明了该模型的关键性等级为关键。

    • travel-helper-model:声明了示例推理服务基于基础模型提供的LoRA模型travel-helper,并通过criticality: Standard字段声明了该模型的关键性等级为标准。

    模型关键性等级可以声明为Critical(关键)、Standard(标准)、Scheduable(低优)。三种关键性等级的优先级为Critical>Standard>Scheduable。启用排队能力后,当后端模型服务器满载时,针对高优先级模型的请求相比针对低优先级模型的请求会优先得到响应。

  2. 部署推理路由。

    kubectl apply -f inference-pool.yaml

步骤三:部署网关和网关路由规则

通过匹配请求中的模型名称,将请求qwentravel-helper模型的请求路由到后端名为qwen-poolInferencePool。

  1. 创建inference-gateway.yaml。

    apiVersion: gateway.networking.k8s.io/v1
    kind: GatewayClass
    metadata:
      name: inference-gateway
    spec:
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: inference-gateway
    spec:
      gatewayClassName: inference-gateway
      listeners:
        - name: llm-gw
          protocol: HTTP
          port: 8081
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: llm-route
      namespace: default
    spec:
      parentRefs:
      - group: gateway.networking.k8s.io
        kind: Gateway
        name: inference-gateway
        sectionName: llm-gw
      rules:
      - backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: qwen-pool
        matches:
        - headers:
          - type: Exact
            name: X-Gateway-Model-Name
            value: qwen
        - headers:
          - type: RegularExpression
            name: X-Gateway-Model-Name
            value: travel-helper.*
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: BackendTrafficPolicy
    metadata:
      name: backend-timeout
    spec:
      timeout:
        http:
          requestTimeout: 24h
      targetRef:
        group: gateway.networking.k8s.io
        kind: Gateway
        name: inference-gateway
  2. 部署网关。

    kubectl apply -f inference-gateway.yaml

步骤四:验证请求排队与优先级调度能力

ACK集群为例,使用vllm benchmark同时对qwen模型和travel-helper模型发起压测,使模型服务器满载。

  1. 部署压测工作负载。

    kubectl apply -f- <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: vllm-benchmark
      name: vllm-benchmark
      namespace: default
    spec:
      progressDeadlineSeconds: 600
      replicas: 1
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: vllm-benchmark
      strategy:
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
        type: RollingUpdate
      template:
        metadata:
          creationTimestamp: null
          labels:
            app: vllm-benchmark
        spec:
          containers:
          - command:
            - sh
            - -c
            - sleep inf
            image: registry-cn-hangzhou.ack.aliyuncs.com/dev/llm-benchmark:random-and-qa
            imagePullPolicy: IfNotPresent
            name: vllm-benchmark
            resources: {}
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
    EOF
  2. 获取Gateway的内网IP。

    export GW_IP=$(kubectl get svc -n envoy-gateway-system -l gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=inference-gateway -o jsonpath='{.items[0].spec.clusterIP}')
  3. 打开两个终端窗口,同时对两个模型发起压测。

    重要

    以下数据使用测试环境生成,仅供参考。实际压测结果可能会由于环境不同而出现差异。

    qwen:

    kubectl exec -it deploy/vllm-benchmark -- env GW_IP=${GW_IP} python3 /root/vllm/benchmarks/benchmark_serving.py \
    --backend vllm \
    --model /models/DeepSeek-R1-Distill-Qwen-7B \
    --served-model-name qwen \
    --trust-remote-code \
    --dataset-name random \
    --random-prefix-len 1000 \
    --random-input-len 3000 \
    --random-output-len 3000 \
    --random-range-ratio 0.2 \
    --num-prompts 300 \
    --max-concurrency 60 \
    --host $GW_IP \
    --port 8081 \
    --endpoint /v1/completions \
    --save-result \
    2>&1 | tee benchmark_serving.txt

    预期输出:

    ============ Serving Benchmark Result ============
    Successful requests:                     293       
    Benchmark duration (s):                  1005.55   
    Total input tokens:                      1163919   
    Total generated tokens:                  837560    
    Request throughput (req/s):              0.29      
    Output token throughput (tok/s):         832.94    
    Total Token throughput (tok/s):          1990.43   
    ---------------Time to First Token----------------
    Mean TTFT (ms):                          21329.91  
    Median TTFT (ms):                        15754.01  
    P99 TTFT (ms):                           140782.55 
    -----Time per Output Token (excl. 1st token)------
    Mean TPOT (ms):                          58.58     
    Median TPOT (ms):                        58.36     
    P99 TPOT (ms):                           91.09     
    ---------------Inter-token Latency----------------
    Mean ITL (ms):                           58.32     
    Median ITL (ms):                         50.56     
    P99 ITL (ms):                            64.12     
    ==================================================

    travel-helper:

    kubectl exec -it deploy/vllm-benchmark -- env GW_IP=${GW_IP} python3 /root/vllm/benchmarks/benchmark_serving.py \
    --backend vllm \
    --model /models/DeepSeek-R1-Distill-Qwen-7B \
    --served-model-name travel-helper \
    --trust-remote-code \
    --dataset-name random \
    --random-prefix-len 1000 \
    --random-input-len 3000 \
    --random-output-len 3000 \
    --random-range-ratio 0.2 \
    --num-prompts 300 \
    --max-concurrency 60 \
    --host $GW_IP \
    --port 8081 \
    --endpoint /v1/completions \
    --save-result \
    2>&1 | tee benchmark_serving.txt

    预期输出:

    ============ Serving Benchmark Result ============
    Successful requests:                     165       
    Benchmark duration (s):                  889.41    
    Total input tokens:                      660560    
    Total generated tokens:                  492207    
    Request throughput (req/s):              0.19      
    Output token throughput (tok/s):         553.41    
    Total Token throughput (tok/s):          1296.10   
    ---------------Time to First Token----------------
    Mean TTFT (ms):                          44201.12  
    Median TTFT (ms):                        28757.03  
    P99 TTFT (ms):                           214710.13 
    -----Time per Output Token (excl. 1st token)------
    Mean TPOT (ms):                          67.38     
    Median TPOT (ms):                        60.51     
    P99 TPOT (ms):                           118.36    
    ---------------Inter-token Latency----------------
    Mean ITL (ms):                           66.98     
    Median ITL (ms):                         51.25     
    P99 ITL (ms):                            64.87     
    ==================================================

    可以看到,在模型服务满载的情况下,qwen模型请求平均TTFT指标相比travel-helper模型请求降低约50%,错误请求数量相比travel-helper模型下降约96%。