使用Gateway with Inference Extension为生成式AI服务实现请求排队与优先级调度-容器计算服务-阿里云

Gateway with Inference Extension支持基于推理服务负载感知的推理请求排队与优先级调度。在避免生成式AI推理服务出现过载的情况外，可以根据模型优先级对队列中的推理请求进行优先级调度，保证高优请求优先响应。本文主要介绍Gateway with Inference Extension的推理请求排队与优先级调度能力。

重要

本文内容依赖1.4.0及以上版本的Gateway with Inference Extension。

背景信息

对于生成式AI推理服务，单个推理服务器的请求吞吐能力会受到GPU资源的严格限制。当大量的请求同时发送到同一个推理服务器时，会导致推理引擎的KV Cache等资源占用率满载，从而影响所有请求的响应时间和token吞吐速度。

Gateway with Inference Extension支持过推理服务器多个维度的指标来评估推理服务器的内部状态，并在推理服务器负载满载时对推理请求进行排队，防止过量请求发送到推理服务器，造成服务整体质量下降。

前提条件

已创建带有GPU节点池的ACK托管集群。您也可以在ACK托管集群中安装ACK Virtual Node组件，以使用ACS GPU算力。
已安装1.4.0版本的Gateway with Inference Extension并勾选启用Gateway API推理扩展。操作入口，请参见安装组件。

说明

本文使用的镜像推荐ACK集群使用A10卡型，ACS GPU算力推荐使用L20(GN8IS)卡型。

同时，由于LLM镜像体积较大，建议您提前转存到ACR，使用内网地址进行拉取。直接从公网拉取的速度取决于集群EIP的带宽配置，会有较长的等待时间。

操作步骤

步骤一：部署示例推理服务

创建vllm-service.yaml。

展开查看YAML内容

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: qwen
  name: qwen
spec:
  progressDeadlineSeconds: 600
  replicas: 5
  selector:
    matchLabels:
      app: qwen
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8000"
        prometheus.io/scrape: "true"
      labels:
        app: qwen
        alibabacloud.com/compute-class: gpu
        alibabacloud.com/compute-qos: default
        alibabacloud.com/gpu-model-series: GN8IS
    spec:
      containers:
        - command:
            - sh
            - -c
            - vllm serve /models/Qwen-2.5-7B-Instruct --port 8000 --enable_prefix_caching --trust-remote-code --served-model-name /model/qwen --gpu-memory-utilization 0.95 --enforce-eager --enable-lora --max-loras 2 --max-cpu-loras 4 --lora-modules travel-helper-v1=/models/Qwen-TravelHelper-Lora travel-helper-v2=/models/Qwen-TravelHelper-Lora-v2
          image: registry-cn-hangzhou.ack.aliyuncs.com/dev/qwen-2.5-7b-instruct-lora:v0.1
          imagePullPolicy: IfNotPresent
          name: custom-serving
          ports:
            - containerPort: 8000
              name: http
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            initialDelaySeconds: 30
            periodSeconds: 30
            successThreshold: 1
            tcpSocket:
              port: 8000
            timeoutSeconds: 1
          resources:
            limits:
              nvidia.com/gpu: "1"
              cpu: "8"
              memory: 30G
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /dev/shm
              name: dshm
      restartPolicy: Always
      volumes:
        - emptyDir:
            medium: Memory
            sizeLimit: 30Gi
          name: dshm
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: qwen
  name: qwen
spec:
  ports:
    - name: http-serving
      port: 8000
      protocol: TCP
      targetPort: 8000
  selector:
    app: qwen

部署示例推理服务。
```
kubectl apply -f vllm-service.yaml
```

步骤二：部署推理路由

本步骤创建InferencePool资源和InferenceModel资源。并通过为InferencePool添加inference-epp-env.networking.x-k8s.io/experimental-use-queueing: "true"和inference-epp-env.networking.x-k8s.io/experimental-use-scheduler-v2: "true"注解来为InferencePool选中的推理服务启用排队能力。

创建inference-pool.yaml。

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  annotations:
    inference-epp-env.networking.x-k8s.io/experimental-use-queueing: "true"
    inference-epp-env.networking.x-k8s.io/experimental-use-scheduler-v2: "true"
  name: qwen-pool
  namespace: default
spec:
  extensionRef:
    group: ""
    kind: Service
    name: qwen-ext-proc
  selector:
    app: qwen
  targetPortNumber: 8000
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-model
spec:
  criticality: Critical
  modelName: qwen
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-pool
  targetModels:
  - name: qwen
    weight: 100
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: travel-helper-model
spec:
  criticality: Standard
  modelName: travel-helper
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-pool
  targetModels:
  - name: travel-helper-v1
    weight: 100

在InferencePool资源中，同时声明了两个InferenceModel资源，表示示例推理服务可以提供服务的两种模型：

qwen-model：声明了示例推理服务提供的基础模型qwen，并通过criticality: Critical字段声明了该模型的关键性等级为关键。
travel-helper-model：声明了示例推理服务基于基础模型提供的LoRA模型travel-helper，并通过criticality: Standard字段声明了该模型的关键性等级为标准。

模型关键性等级可以声明为Critical（关键）、Standard（标准）、Scheduable（低优）。三种关键性等级的优先级为Critical>Standard>Scheduable。启用排队能力后，当后端模型服务器满载时，针对高优先级模型的请求相比针对低优先级模型的请求会优先得到响应。

部署推理路由。
```
kubectl apply -f inference-pool.yaml
```

步骤三：部署网关和网关路由规则

通过匹配请求中的模型名称，将请求qwen和travel-helper模型的请求路由到后端名为qwen-pool的InferencePool。

创建inference-gateway.yaml。

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: inference-gateway
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: inference-gateway
  listeners:
    - name: llm-gw
      protocol: HTTP
      port: 8081
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: llm-route
  namespace: default
spec:
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway
    sectionName: llm-gw
  rules:
  - backendRefs:
    - group: inference.networking.x-k8s.io
      kind: InferencePool
      name: qwen-pool
    matches:
    - headers:
      - type: Exact
        name: X-Gateway-Model-Name
        value: qwen
    - headers:
      - type: RegularExpression
        name: X-Gateway-Model-Name
        value: travel-helper.*
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: backend-timeout
spec:
  timeout:
    http:
      requestTimeout: 24h
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway

部署网关。
```
kubectl apply -f inference-gateway.yaml
```

步骤四：验证请求排队与优先级调度能力

以ACK集群为例，使用vllm benchmark同时对qwen模型和travel-helper模型发起压测，使模型服务器满载。

部署压测工作负载。

kubectl apply -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: vllm-benchmark
  name: vllm-benchmark
  namespace: default
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: vllm-benchmark
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: vllm-benchmark
    spec:
      containers:
      - command:
        - sh
        - -c
        - sleep inf
        image: registry-cn-hangzhou.ack.aliyuncs.com/dev/llm-benchmark:random-and-qa
        imagePullPolicy: IfNotPresent
        name: vllm-benchmark
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
EOF

获取Gateway的内网IP。

export GW_IP=$(kubectl get svc -n envoy-gateway-system -l gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=inference-gateway -o jsonpath='{.items[0].spec.clusterIP}')

打开两个终端窗口，同时对两个模型发起压测。

重要

以下数据使用测试环境生成，仅供参考。实际压测结果可能会由于环境不同而出现差异。

qwen：

kubectl exec -it deploy/vllm-benchmark -- env GW_IP=${GW_IP} python3 /root/vllm/benchmarks/benchmark_serving.py \
--backend vllm \
--model /models/DeepSeek-R1-Distill-Qwen-7B \
--served-model-name qwen \
--trust-remote-code \
--dataset-name random \
--random-prefix-len 1000 \
--random-input-len 3000 \
--random-output-len 3000 \
--random-range-ratio 0.2 \
--num-prompts 300 \
--max-concurrency 60 \
--host $GW_IP \
--port 8081 \
--endpoint /v1/completions \
--save-result \
2>&1 | tee benchmark_serving.txt

预期输出：

============ Serving Benchmark Result ============
Successful requests:                     293       
Benchmark duration (s):                  1005.55   
Total input tokens:                      1163919   
Total generated tokens:                  837560    
Request throughput (req/s):              0.29      
Output token throughput (tok/s):         832.94    
Total Token throughput (tok/s):          1990.43   
---------------Time to First Token----------------
Mean TTFT (ms):                          21329.91  
Median TTFT (ms):                        15754.01  
P99 TTFT (ms):                           140782.55 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          58.58     
Median TPOT (ms):                        58.36     
P99 TPOT (ms):                           91.09     
---------------Inter-token Latency----------------
Mean ITL (ms):                           58.32     
Median ITL (ms):                         50.56     
P99 ITL (ms):                            64.12     
==================================================

travel-helper：

kubectl exec -it deploy/vllm-benchmark -- env GW_IP=${GW_IP} python3 /root/vllm/benchmarks/benchmark_serving.py \
--backend vllm \
--model /models/DeepSeek-R1-Distill-Qwen-7B \
--served-model-name travel-helper \
--trust-remote-code \
--dataset-name random \
--random-prefix-len 1000 \
--random-input-len 3000 \
--random-output-len 3000 \
--random-range-ratio 0.2 \
--num-prompts 300 \
--max-concurrency 60 \
--host $GW_IP \
--port 8081 \
--endpoint /v1/completions \
--save-result \
2>&1 | tee benchmark_serving.txt

预期输出：

============ Serving Benchmark Result ============
Successful requests:                     165       
Benchmark duration (s):                  889.41    
Total input tokens:                      660560    
Total generated tokens:                  492207    
Request throughput (req/s):              0.19      
Output token throughput (tok/s):         553.41    
Total Token throughput (tok/s):          1296.10   
---------------Time to First Token----------------
Mean TTFT (ms):                          44201.12  
Median TTFT (ms):                        28757.03  
P99 TTFT (ms):                           214710.13 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          67.38     
Median TPOT (ms):                        60.51     
P99 TPOT (ms):                           118.36    
---------------Inter-token Latency----------------
Mean ITL (ms):                           66.98     
Median ITL (ms):                         51.25     
P99 ITL (ms):                            64.87     
==================================================

可以看到，在模型服务满载的情况下，qwen模型请求平均TTFT指标相比travel-helper模型请求降低约50%，错误请求数量相比travel-helper模型下降约96%。