使用Gateway with Inference Extension根据模型名称路由推理请求-容器计算服务-阿里云

通过Gateway with Inference Extension组件，您可以在部署使用OpenAI API格式的生成式AI推理服务后，基于请求中的模型名称指定请求路由策略，包括流量灰度、流量镜像、流量熔断等。本文介绍如何通过Gateway with Inference Extension组件实现基于模型名称的推理服务路由。

重要

阅读本文前，请确保您已经了解InferencePool和InferenceModel的相关概念。
本文内容依赖1.4.0及以上版本的Gateway with Inference Extension。

背景信息

OpenAI兼容API

OpenAI兼容API是指一类在接口、参数和响应格式上与OpenAI官方API（如GPT-3.5、GPT-4等）高度兼容的生成式大语言模型（LLM）推理服务API。兼容性通常体现在以下方面：

接口结构：使用相同的HTTP请求方法（如 POST）、端点格式和认证方式（如 API 密钥）。
参数支持：支持与OpenAI API类似的参数，例如model、prompt、temperature、max_tokens等。
响应格式：返回与OpenAI相同的JSON结构，例如包含choices、usage和id字段。

目前，主流的第三方LLM服务和vLLM、SgLang等主流LLM推理引擎均支持提供OpenAI兼容API，以保持用户在迁移和使用体验上的一致性。

场景说明

对于生成式AI推理服务来说，用户请求的模型名称是请求中重要的元数据，基于请求中模型名称进行路由策略的指定是通过网关暴露推理服务时的常见使用场景。但对于提供OpenAI兼容API的LLM推理服务来说，请求的模型名称信息位于请求体中，而普通的路由策略并不支持基于请求体进行路由。

Gateway with Inference Extension支持在OpenAI兼容API下基于模型名称指定路由策略。通过解析并提取请求体中的模型名称，并将其附加到请求头中，提供开箱即用的基于模型名称的路由能力。使用时，只需要在HTTPRoute资源中，通过匹配X-Gateway-Model-Name请求头，即可实现基于模型名称的路由能力、无需客户端进行改造。

本文示例将演示如何在同一个网关实例上，基于请求中的模型名称对Qwen-2.5-7B-Instruct和DeepSeek-R1-Distill-Qwen-7B两个推理服务进行路由：当请求qwen模型时，将请求路由到qwen推理服务；当请求deepseek-r1模型时，将请求路由到deepseek-r1服务。以下为路由的主要流程：

yuque_diagram (2)

前提条件

已创建带有GPU节点池的ACK托管集群。您也可以在ACK托管集群中安装ACK Virtual Node组件，以使用ACS GPU算力。
已安装1.4.0版本的Gateway with Inference Extension并勾选启用Gateway API推理扩展。操作入口，请参见安装组件。

说明

本文使用的镜像推荐ACK集群使用A10卡型，ACS GPU算力推荐使用L20(GN8IS)卡型。

同时，由于LLM镜像体积较大，建议您提前转存到ACR，使用内网地址进行拉取。直接从公网拉取的速度取决于集群EIP的带宽配置，会有较长的等待时间。

操作步骤

步骤一：部署示例推理服务

创建vllm-service.yaml。

展开查看YAML内容

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: qwen
  name: qwen
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: qwen
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8000"
        prometheus.io/scrape: "true"
      labels:
        app: qwen
        alibabacloud.com/compute-class: gpu
        alibabacloud.com/compute-qos: default
        alibabacloud.com/gpu-model-series: GN8IS
    spec:
      containers:
      - command:
        - sh
        - -c
        - vllm serve /models/Qwen-2.5-7B-Instruct --port 8000 --trust-remote-code --served-model-name qwen --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager --enable-lora --max-loras 2 --max-cpu-loras 4 --lora-modules travel-helper-v1=/models/Qwen-TravelHelper-Lora travel-helper-v2=/models/Qwen-TravelHelper-Lora-v2
        image: registry-cn-hangzhou.ack.aliyuncs.com/dev/qwen-2.5-7b-instruct-lora:v0.1
        imagePullPolicy: IfNotPresent
        name: custom-serving
        ports:
        - containerPort: 8000
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          initialDelaySeconds: 30
          periodSeconds: 30
          successThreshold: 1
          tcpSocket:
            port: 8000
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "8"
            memory: 30G
            nvidia.com/gpu: "1"
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir:
          medium: Memory
          sizeLimit: 30Gi
        name: dshm
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: qwen
  name: qwen
spec:
  ports:
  - name: http-serving
    port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    app: qwen
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: deepseek-r1
  name: deepseek-r1
spec:
  progressDeadlineSeconds: 600
  replicas: 1 
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: deepseek-r1
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8000"
        prometheus.io/scrape: "true"
      labels:
        app: deepseek-r1
        alibabacloud.com/compute-class: gpu
        alibabacloud.com/compute-qos: default
        alibabacloud.com/gpu-model-series: GN8IS
    spec:
      containers:
      - command:
        - sh
        - -c
        - vllm serve /models/DeepSeek-R1-Distill-Qwen-7B --port 8000 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager
        image: registry-cn-hangzhou.ack.aliyuncs.com/dev/ds-r1-qwen-7b-without-lora:v0.1
        imagePullPolicy: IfNotPresent
        name: custom-serving
        ports:
        - containerPort: 8000
          name: restful
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          initialDelaySeconds: 30
          periodSeconds: 30
          successThreshold: 1
          tcpSocket:
            port: 8000
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "8"
            memory: 30G
            nvidia.com/gpu: "1"
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir:
          medium: Memory
          sizeLimit: 30Gi
        name: dshm
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: deepseek-r1
  name: deepseek-r1
spec:
  ports:
  - name: http-serving
    port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    app: deepseek-r1

部署示例推理服务。
```
kubectl apply -f vllm-service.yaml
```

步骤二：部署推理路由

本步骤创建InferencePool资源和InferenceModel资源。

创建inference-pool.yaml。

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-pool
  namespace: default
spec:
  extensionRef:
    group: ""
    kind: Service
    name: qwen-ext-proc
  selector:
    app: qwen
  targetPortNumber: 8000
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen
spec:
  criticality: Critical
  modelName: qwen
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-pool
  targetModels:
  - name: qwen
    weight: 100
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: deepseek-pool
  namespace: default
spec:
  extensionRef:
    group: ""
    kind: Service
    name: deepseek-ext-proc
  selector:
    app: deepseek-r1
  targetPortNumber: 8000
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: deepseek-r1
spec:
  criticality: Critical
  modelName: deepseek-r1
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: deepseek-pool
  targetModels:
  - name: deepseek-r1
    weight: 100

部署推理路由。
```
kubectl apply -f inference-pool.yaml
```

步骤三：部署网关和网关路由规则

创建inference-gateway.yaml。

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: inference-gateway
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: inference-gateway
  listeners:
    - name: llm-gw
      protocol: HTTP
      port: 8080
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: ClientTrafficPolicy
metadata:
  name: client-buffer-limit
spec:
  connection:
    bufferLimit: 20Mi
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: inference-gateway
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: backend-timeout
spec:
  timeout:
    http:
      requestTimeout: 24h
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway

创建inference-route.yaml

在HTTPRoute指定的路由规则中，请求体中的模型名称会被自动解析到X-Gateway-Model-Name请求头。

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: inference-route
spec:
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway
    sectionName: llm-gw
  rules:
  - backendRefs:
    - group: inference.networking.x-k8s.io
      kind: InferencePool
      name: qwen-pool
      weight: 1
    matches:
    - headers:
      - type: Exact
        name: X-Gateway-Model-Name
        value: qwen
  - backendRefs:
    - group: inference.networking.x-k8s.io
      kind: InferencePool
      name: deepseek-pool
      weight: 1
    matches:
    - headers:
      - type: Exact
        name: X-Gateway-Model-Name
        value: deepseek-r1

部署网关和网关规则。

kubectl apply -f inference-gateway.yaml
kubectl apply -f inference-route.yaml

步骤四：验证网关效果

获取网关IP。

export GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')

请求qwen模型。

curl -X POST ${GATEWAY_IP}:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
    "model": "qwen",
    "temperature": 0,
    "messages": [
      {
        "role": "user",
        "content": "who are you?" 
      }
    ]
}'

预期输出：

{"id":"chatcmpl-475bc88d-b71d-453f-8f8e-0601338e11a9","object":"chat.completion","created":1748311216,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"I am Qwen, a large language model created by Alibaba Cloud. I am here to assist you with any questions or conversations you might have! How can I help you today?","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":33,"total_tokens":70,"completion_tokens":37,"prompt_tokens_details":null},"prompt_logprobs":null}

请求deepseek-r1模型。

curl -X POST ${GATEWAY_IP}:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
    "model": "deepseek-r1",
    "temperature": 0,
    "messages": [
      {
        "role": "user",
        "content": "who are you?" 
      }
    ]
}'

预期输出：

{"id":"chatcmpl-9a143fc5-8826-46bc-96aa-c677d130aef9","object":"chat.completion","created":1748312185,"model":"deepseek-r1","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Alright, someone just asked, \"who are you?\" Hmm, I need to explain who I am in a clear and friendly way. Let's see, I'm an AI created by DeepSeek, right? I don't have a physical form, so I don't have a \"name\" like you do. My purpose is to help with answering questions and providing information. I'm here to assist with a wide range of topics, from general knowledge to more specific inquiries. I understand that I can't do things like think or feel, but I'm here to make your day easier by offering helpful responses. So, I'll keep it simple and approachable, making sure to convey that I'm here to help with whatever they need.\n</think>\n\nI'm DeepSeek-R1-Lite-Preview, an AI assistant created by the Chinese company DeepSeek. I'm here to help you with answering questions, providing information, and offering suggestions. I don't have personal experiences or emotions, but I'm designed to make your interactions with me as helpful and pleasant as possible. How can I assist you today?","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":9,"total_tokens":232,"completion_tokens":223,"prompt_tokens_details":null},"prompt_logprobs":null}

可以看到，两个推理服务已经正常对外提供服务，外部请求可以根据请求名称被路由到不同的推理服务。