通过Gateway with Inference Extension组件,您可以在部署使用OpenAI API格式的生成式AI推理服务后,基于请求中的模型名称指定请求路由策略,包括流量灰度、流量镜像、流量熔断等。本文介绍如何通过Gateway with Inference Extension组件实现基于模型名称的推理服务路由。
阅读本文前,请确保您已经了解InferencePool和InferenceModel的相关概念。
本文内容依赖1.4.0及以上版本的Gateway with Inference Extension。
背景信息
OpenAI兼容API
OpenAI兼容API是指一类在接口、参数和响应格式上与OpenAI官方API(如GPT-3.5、GPT-4等)高度兼容的生成式大语言模型(LLM)推理服务API。兼容性通常体现在以下方面:
接口结构:使用相同的HTTP请求方法(如 POST)、端点格式和认证方式(如 API 密钥)。
参数支持:支持与OpenAI API类似的参数,例如model、prompt、temperature、max_tokens等。
响应格式:返回与OpenAI相同的JSON结构,例如包含choices、usage和id字段。
目前,主流的第三方LLM服务和vLLM、SgLang等主流LLM推理引擎均支持提供OpenAI兼容API,以保持用户在迁移和使用体验上的一致性。
场景说明
对于生成式AI推理服务来说,用户请求的模型名称是请求中重要的元数据,基于请求中模型名称进行路由策略的指定是通过网关暴露推理服务时的常见使用场景。但对于提供OpenAI兼容API的LLM推理服务来说,请求的模型名称信息位于请求体中,而普通的路由策略并不支持基于请求体进行路由。
Gateway with Inference Extension支持在OpenAI兼容API下基于模型名称指定路由策略。通过解析并提取请求体中的模型名称,并将其附加到请求头中,提供开箱即用的基于模型名称的路由能力。使用时,只需要在HTTPRoute
资源中,通过匹配X-Gateway-Model-Name
请求头,即可实现基于模型名称的路由能力、无需客户端进行改造。
本文示例将演示如何在同一个网关实例上,基于请求中的模型名称对Qwen-2.5-7B-Instruct和DeepSeek-R1-Distill-Qwen-7B两个推理服务进行路由:当请求qwen模型时,将请求路由到qwen推理服务;当请求deepseek-r1模型时,将请求路由到deepseek-r1服务。以下为路由的主要流程:
前提条件
已创建带有GPU节点池的ACK托管集群。您也可以在ACK托管集群中安装ACK Virtual Node组件,以使用ACS GPU算力。
已安装1.4.0版本的Gateway with Inference Extension并勾选启用Gateway API推理扩展。操作入口,请参见安装组件。
本文使用的镜像推荐ACK集群使用A10卡型,ACS GPU算力推荐使用L20(GN8IS)卡型。
同时,由于LLM镜像体积较大,建议您提前转存到ACR,使用内网地址进行拉取。直接从公网拉取的速度取决于集群EIP的带宽配置,会有较长的等待时间。
操作步骤
步骤一:部署示例推理服务
创建vllm-service.yaml。
部署示例推理服务。
kubectl apply -f vllm-service.yaml
步骤二:部署推理路由
本步骤创建InferencePool资源和InferenceModel资源。
创建inference-pool.yaml。
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-pool namespace: default spec: extensionRef: group: "" kind: Service name: qwen-ext-proc selector: app: qwen targetPortNumber: 8000 --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen spec: criticality: Critical modelName: qwen poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-pool targetModels: - name: qwen weight: 100 --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: deepseek-pool namespace: default spec: extensionRef: group: "" kind: Service name: deepseek-ext-proc selector: app: deepseek-r1 targetPortNumber: 8000 --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: deepseek-r1 spec: criticality: Critical modelName: deepseek-r1 poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: deepseek-pool targetModels: - name: deepseek-r1 weight: 100
部署推理路由。
kubectl apply -f inference-pool.yaml
步骤三:部署网关和网关路由规则
创建inference-gateway.yaml。
apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass metadata: name: inference-gateway spec: controllerName: gateway.envoyproxy.io/gatewayclass-controller --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: inference-gateway spec: gatewayClassName: inference-gateway listeners: - name: llm-gw protocol: HTTP port: 8080 --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: ClientTrafficPolicy metadata: name: client-buffer-limit spec: connection: bufferLimit: 20Mi targetRefs: - group: gateway.networking.k8s.io kind: Gateway name: inference-gateway --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: backend-timeout spec: timeout: http: requestTimeout: 24h targetRef: group: gateway.networking.k8s.io kind: Gateway name: inference-gateway
创建inference-route.yaml
在
HTTPRoute
指定的路由规则中,请求体中的模型名称会被自动解析到X-Gateway-Model-Name
请求头。apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: inference-route spec: parentRefs: - group: gateway.networking.k8s.io kind: Gateway name: inference-gateway sectionName: llm-gw rules: - backendRefs: - group: inference.networking.x-k8s.io kind: InferencePool name: qwen-pool weight: 1 matches: - headers: - type: Exact name: X-Gateway-Model-Name value: qwen - backendRefs: - group: inference.networking.x-k8s.io kind: InferencePool name: deepseek-pool weight: 1 matches: - headers: - type: Exact name: X-Gateway-Model-Name value: deepseek-r1
部署网关和网关规则。
kubectl apply -f inference-gateway.yaml kubectl apply -f inference-route.yaml
步骤四:验证网关效果
获取网关IP。
export GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
请求qwen模型。
curl -X POST ${GATEWAY_IP}:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{ "model": "qwen", "temperature": 0, "messages": [ { "role": "user", "content": "who are you?" } ] }'
预期输出:
{"id":"chatcmpl-475bc88d-b71d-453f-8f8e-0601338e11a9","object":"chat.completion","created":1748311216,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"I am Qwen, a large language model created by Alibaba Cloud. I am here to assist you with any questions or conversations you might have! How can I help you today?","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":33,"total_tokens":70,"completion_tokens":37,"prompt_tokens_details":null},"prompt_logprobs":null}
请求deepseek-r1模型。
curl -X POST ${GATEWAY_IP}:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{ "model": "deepseek-r1", "temperature": 0, "messages": [ { "role": "user", "content": "who are you?" } ] }'
预期输出:
{"id":"chatcmpl-9a143fc5-8826-46bc-96aa-c677d130aef9","object":"chat.completion","created":1748312185,"model":"deepseek-r1","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Alright, someone just asked, \"who are you?\" Hmm, I need to explain who I am in a clear and friendly way. Let's see, I'm an AI created by DeepSeek, right? I don't have a physical form, so I don't have a \"name\" like you do. My purpose is to help with answering questions and providing information. I'm here to assist with a wide range of topics, from general knowledge to more specific inquiries. I understand that I can't do things like think or feel, but I'm here to make your day easier by offering helpful responses. So, I'll keep it simple and approachable, making sure to convey that I'm here to help with whatever they need.\n</think>\n\nI'm DeepSeek-R1-Lite-Preview, an AI assistant created by the Chinese company DeepSeek. I'm here to help you with answering questions, providing information, and offering suggestions. I don't have personal experiences or emotions, but I'm designed to make your interactions with me as helpful and pleasant as possible. How can I assist you today?","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":9,"total_tokens":232,"completion_tokens":223,"prompt_tokens_details":null},"prompt_logprobs":null}
可以看到,两个推理服务已经正常对外提供服务,外部请求可以根据请求名称被路由到不同的推理服务。