生成式AI推理框架支持详情配置-容器服务 Kubernetes 版 ACK-阿里云

Gateway with Inference Extension支持多种生成式AI推理服务框架，并为基于不同推理服务框架部署的生成式AI推理服务提供一致的能力，包括制定灰度发布策略、推理负载均衡、基于模型名称的路由等。本文介绍Gateway with Inference Extension对不同生成式AI推理服务框架的支持与使用方式。

支持的推理服务框架

推理服务框架	推理服务框架版本要求
vLLM v0	v0.6.4及以上。
vLLM v1	v0.8.0及以上。
SGLang	v0.3.6及以上。
使用TensorRT-LLM后端的Triton	25.03及以上。

vLLM支持

vLLM是Gateway with Inference Extension默认支持的后端推理框架，当您在使用基于vLLM构建的推理服务时，无需任何多余配置即可使用生成式AI服务增强能力。

SGLang支持

当您在使用基于SGLang构建的生成式AI推理服务时，您可以通过为InferencePool加入inference.networking.x-k8s.io/model-server-runtime: sglang注解，来启用针对SGLang推理服务框架的智能路由与负载均衡能力。

以下是使用SGLang时的InferencePool示例。除此之外，您无需对其他资源进行额外更改。

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
  name: deepseek-sglang-pool
spec:
  extensionRef:
    group: ""
    kind: Service
    name: deepseek-sglang-ext-proc
  selector:
    app: deepseek-r1-sglang
  targetPortNumber: 30000

TensorRT-LLM支持

TensorRT-LLM 是NVIDIA开源的LLM（Large Language Model）模型优化引擎，用于定义LLM模型并将模型构建为TensorRT引擎，以提升服务在NVIDIA GPU上的推理效率。TensorRT-LLM还可以与Triton框架结合，作为Triton推理框架的一种后端TensorRT-LLM Backend。TensorRT-LLM构建的模型可以在单个或多个GPU上运行，支持Tensor Parallelism及Pipeline Parallelism。

当您使用基于TensorRT-LLM后端的Triton模型服务器构建生成式AI推理服务时，您可以通过为InferencePool加入inference.networking.x-k8s.io/model-server-runtime: trt-llm注解，来启用针对TensorRT-LLM的智能路由与负载均衡能力。

以下是使用TensorRT-LLM时的InferencePool示例。除此之外，您无需对其他资源进行额外更改。

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: trt-llm
  name: qwen-trt-pool
spec:
  extensionRef:
    group: ""
    kind: Service
    name: trt-llm-ext-proc
  selector:
    app: qwen-trt-llm
  targetPortNumber: 8000