Gateway with Inference Extension支持基于推理服务负载感知的推理请求排队与优先级调度。在避免生成式AI推理服务出现过载的情况外,可以根据模型优先级对队列中的推理请求进行优先级调度,保证高优请求优先响应。本文主要介绍Gateway with Inference Extension的推理请求排队与优先级调度能力。
本文内容依赖1.4.0及以上版本的Gateway with Inference Extension。
背景信息
对于生成式AI推理服务,单个推理服务器的请求吞吐能力会受到GPU资源的严格限制。当大量的请求同时发送到同一个推理服务器时,会导致推理引擎的KV Cache等资源占用率满载,从而影响所有请求的响应时间和token吞吐速度。
Gateway with Inference Extension支持过推理服务器多个维度的指标来评估推理服务器的内部状态,并在推理服务器负载满载时对推理请求进行排队,防止过量请求发送到推理服务器,造成服务整体质量下降。
前提条件
已创建带有GPU节点池的ACK托管集群。您也可以在ACK托管集群中安装ACK Virtual Node组件,以使用ACS GPU算力。
已安装1.4.0版本的Gateway with Inference Extension并勾选启用Gateway API推理扩展。操作入口,请参见安装组件。
本文使用的镜像推荐ACK集群使用A10卡型,ACS GPU算力推荐使用L20(GN8IS)卡型。
同时,由于LLM镜像体积较大,建议您提前转存到ACR,使用内网地址进行拉取。直接从公网拉取的速度取决于集群EIP的带宽配置,会有较长的等待时间。
操作步骤
步骤一:部署示例推理服务
创建vllm-service.yaml。
部署示例推理服务。
kubectl apply -f vllm-service.yaml
步骤二:部署推理路由
本步骤创建InferencePool资源和InferenceModel资源。并通过为InferencePool添加inference-epp-env.networking.x-k8s.io/experimental-use-queueing: "true"
和inference-epp-env.networking.x-k8s.io/experimental-use-scheduler-v2: "true"
注解来为InferencePool选中的推理服务启用排队能力。
创建inference-pool.yaml。
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: annotations: inference-epp-env.networking.x-k8s.io/experimental-use-queueing: "true" inference-epp-env.networking.x-k8s.io/experimental-use-scheduler-v2: "true" name: qwen-pool namespace: default spec: extensionRef: group: "" kind: Service name: qwen-ext-proc selector: app: qwen targetPortNumber: 8000 --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-model spec: criticality: Critical modelName: qwen poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-pool targetModels: - name: qwen weight: 100 --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: travel-helper-model spec: criticality: Standard modelName: travel-helper poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-pool targetModels: - name: travel-helper-v1 weight: 100
在InferencePool资源中,同时声明了两个InferenceModel资源,表示示例推理服务可以提供服务的两种模型:
qwen-model:声明了示例推理服务提供的基础模型
qwen
,并通过criticality: Critical
字段声明了该模型的关键性等级为关键。travel-helper-model:声明了示例推理服务基于基础模型提供的LoRA模型
travel-helper
,并通过criticality: Standard
字段声明了该模型的关键性等级为标准。
模型关键性等级可以声明为
Critical
(关键)、Standard
(标准)、Scheduable
(低优)。三种关键性等级的优先级为Critical
>Standard
>Scheduable
。启用排队能力后,当后端模型服务器满载时,针对高优先级模型的请求相比针对低优先级模型的请求会优先得到响应。部署推理路由。
kubectl apply -f inference-pool.yaml
步骤三:部署网关和网关路由规则
通过匹配请求中的模型名称,将请求qwen
和travel-helper
模型的请求路由到后端名为qwen-pool
的InferencePool。
创建inference-gateway.yaml。
apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass metadata: name: inference-gateway spec: controllerName: gateway.envoyproxy.io/gatewayclass-controller --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: inference-gateway spec: gatewayClassName: inference-gateway listeners: - name: llm-gw protocol: HTTP port: 8081 --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: llm-route namespace: default spec: parentRefs: - group: gateway.networking.k8s.io kind: Gateway name: inference-gateway sectionName: llm-gw rules: - backendRefs: - group: inference.networking.x-k8s.io kind: InferencePool name: qwen-pool matches: - headers: - type: Exact name: X-Gateway-Model-Name value: qwen - headers: - type: RegularExpression name: X-Gateway-Model-Name value: travel-helper.* --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: backend-timeout spec: timeout: http: requestTimeout: 24h targetRef: group: gateway.networking.k8s.io kind: Gateway name: inference-gateway
部署网关。
kubectl apply -f inference-gateway.yaml
步骤四:验证请求排队与优先级调度能力
以ACK集群为例,使用vllm benchmark同时对qwen
模型和travel-helper
模型发起压测,使模型服务器满载。
部署压测工作负载。
kubectl apply -f- <<EOF apiVersion: apps/v1 kind: Deployment metadata: labels: app: vllm-benchmark name: vllm-benchmark namespace: default spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app: vllm-benchmark strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: creationTimestamp: null labels: app: vllm-benchmark spec: containers: - command: - sh - -c - sleep inf image: registry-cn-hangzhou.ack.aliyuncs.com/dev/llm-benchmark:random-and-qa imagePullPolicy: IfNotPresent name: vllm-benchmark resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 EOF
获取Gateway的内网IP。
export GW_IP=$(kubectl get svc -n envoy-gateway-system -l gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=inference-gateway -o jsonpath='{.items[0].spec.clusterIP}')
打开两个终端窗口,同时对两个模型发起压测。
重要以下数据使用测试环境生成,仅供参考。实际压测结果可能会由于环境不同而出现差异。
qwen:
kubectl exec -it deploy/vllm-benchmark -- env GW_IP=${GW_IP} python3 /root/vllm/benchmarks/benchmark_serving.py \ --backend vllm \ --model /models/DeepSeek-R1-Distill-Qwen-7B \ --served-model-name qwen \ --trust-remote-code \ --dataset-name random \ --random-prefix-len 1000 \ --random-input-len 3000 \ --random-output-len 3000 \ --random-range-ratio 0.2 \ --num-prompts 300 \ --max-concurrency 60 \ --host $GW_IP \ --port 8081 \ --endpoint /v1/completions \ --save-result \ 2>&1 | tee benchmark_serving.txt
预期输出:
============ Serving Benchmark Result ============ Successful requests: 293 Benchmark duration (s): 1005.55 Total input tokens: 1163919 Total generated tokens: 837560 Request throughput (req/s): 0.29 Output token throughput (tok/s): 832.94 Total Token throughput (tok/s): 1990.43 ---------------Time to First Token---------------- Mean TTFT (ms): 21329.91 Median TTFT (ms): 15754.01 P99 TTFT (ms): 140782.55 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 58.58 Median TPOT (ms): 58.36 P99 TPOT (ms): 91.09 ---------------Inter-token Latency---------------- Mean ITL (ms): 58.32 Median ITL (ms): 50.56 P99 ITL (ms): 64.12 ==================================================
travel-helper:
kubectl exec -it deploy/vllm-benchmark -- env GW_IP=${GW_IP} python3 /root/vllm/benchmarks/benchmark_serving.py \ --backend vllm \ --model /models/DeepSeek-R1-Distill-Qwen-7B \ --served-model-name travel-helper \ --trust-remote-code \ --dataset-name random \ --random-prefix-len 1000 \ --random-input-len 3000 \ --random-output-len 3000 \ --random-range-ratio 0.2 \ --num-prompts 300 \ --max-concurrency 60 \ --host $GW_IP \ --port 8081 \ --endpoint /v1/completions \ --save-result \ 2>&1 | tee benchmark_serving.txt
预期输出:
============ Serving Benchmark Result ============ Successful requests: 165 Benchmark duration (s): 889.41 Total input tokens: 660560 Total generated tokens: 492207 Request throughput (req/s): 0.19 Output token throughput (tok/s): 553.41 Total Token throughput (tok/s): 1296.10 ---------------Time to First Token---------------- Mean TTFT (ms): 44201.12 Median TTFT (ms): 28757.03 P99 TTFT (ms): 214710.13 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 67.38 Median TPOT (ms): 60.51 P99 TPOT (ms): 118.36 ---------------Inter-token Latency---------------- Mean ITL (ms): 66.98 Median ITL (ms): 51.25 P99 ITL (ms): 64.87 ==================================================
可以看到,在模型服务满载的情况下,
qwen
模型请求平均TTFT指标相比travel-helper
模型请求降低约50%,错误请求数量相比travel-helper
模型下降约96%。