使用Gateway with Inference Extension实现基于模型名称的推理服务路由

通过Gateway with Inference Extension组件,您可以在部署使用OpenAI API格式的生成式AI推理服务后,基于请求中的模型名称指定请求路由策略,包括流量灰度、流量镜像、流量熔断等。本文介绍如何通过Gateway with Inference Extension组件实现基于模型名称的推理服务路由。

重要
  • 阅读本文前,请确保您已经了解InferencePoolInferenceModel的相关概念。

  • 本文内容依赖1.4.0及以上版本Gateway with Inference Extension

背景信息

OpenAI兼容API

OpenAI兼容API是指一类在接口、参数和响应格式上与OpenAI官方API(如GPT-3.5、GPT-4等)高度兼容的生成式大语言模型(LLM)推理服务API。兼容性通常体现在以下方面:

  • 接口结构:使用相同的HTTP请求方法(如 POST)、端点格式和认证方式(如 API 密钥)。

  • 参数支持:支持与OpenAI API类似的参数,例如model、prompt、temperature、max_tokens等。

  • 响应格式:返回与OpenAI相同的JSON结构,例如包含choices、usageid字段。

目前,主流的第三方LLM服务和vLLM、SgLang等主流LLM推理引擎均支持提供OpenAI兼容API,以保持用户在迁移和使用体验上的一致性。

场景说明

对于生成式AI推理服务来说,用户请求的模型名称是请求中重要的元数据,基于请求中模型名称进行路由策略的指定是通过网关暴露推理服务时的常见使用场景。但对于提供OpenAI兼容APILLM推理服务来说,请求的模型名称信息位于请求体中,而普通的路由策略并不支持基于请求体进行路由。

Gateway with Inference Extension支持在OpenAI兼容API下基于模型名称指定路由策略。通过解析并提取请求体中的模型名称,并将其附加到请求头中,提供开箱即用的基于模型名称的路由能力。使用时,只需要在HTTPRoute资源中,通过匹配X-Gateway-Model-Name请求头,即可实现基于模型名称的路由能力、无需客户端进行改造。

本文示例将演示如何在同一个网关实例上,基于请求中的模型名称对Qwen-2.5-7B-InstructDeepSeek-R1-Distill-Qwen-7B两个推理服务进行路由:当请求qwen模型时,将请求路由到qwen推理服务;当请求deepseek-r1模型时,将请求路由到deepseek-r1服务。以下为路由的主要流程:

yuque_diagram (2)

前提条件

说明

本文使用的镜像推荐ACK集群使用A10卡型,ACS GPU算力推荐使用L20(GN8IS)卡型。

同时,由于LLM镜像体积较大,建议您提前转存到ACR,使用内网地址进行拉取。直接从公网拉取的速度取决于集群EIP的带宽配置,会有较长的等待时间。

操作步骤

步骤一:部署示例推理服务

  1. 创建vllm-service.yaml。

    展开查看YAML内容

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: qwen
      name: qwen
    spec:
      progressDeadlineSeconds: 600
      replicas: 1
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: qwen
      strategy:
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
        type: RollingUpdate
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: "8000"
            prometheus.io/scrape: "true"
          labels:
            app: qwen
            alibabacloud.com/compute-class: gpu
            alibabacloud.com/compute-qos: default
            alibabacloud.com/gpu-model-series: GN8IS
        spec:
          containers:
          - command:
            - sh
            - -c
            - vllm serve /models/Qwen-2.5-7B-Instruct --port 8000 --trust-remote-code --served-model-name qwen --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager --enable-lora --max-loras 2 --max-cpu-loras 4 --lora-modules travel-helper-v1=/models/Qwen-TravelHelper-Lora travel-helper-v2=/models/Qwen-TravelHelper-Lora-v2
            image: registry-cn-hangzhou.ack.aliyuncs.com/dev/qwen-2.5-7b-instruct-lora:v0.1
            imagePullPolicy: IfNotPresent
            name: custom-serving
            ports:
            - containerPort: 8000
              name: http
              protocol: TCP
            readinessProbe:
              failureThreshold: 3
              initialDelaySeconds: 30
              periodSeconds: 30
              successThreshold: 1
              tcpSocket:
                port: 8000
              timeoutSeconds: 1
            resources:
              limits:
                cpu: "8"
                memory: 30G
                nvidia.com/gpu: "1"
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /dev/shm
              name: dshm
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          volumes:
          - emptyDir:
              medium: Memory
              sizeLimit: 30Gi
            name: dshm
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app: qwen
      name: qwen
    spec:
      ports:
      - name: http-serving
        port: 8000
        protocol: TCP
        targetPort: 8000
      selector:
        app: qwen
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: deepseek-r1
      name: deepseek-r1
    spec:
      progressDeadlineSeconds: 600
      replicas: 1 
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: deepseek-r1
      strategy:
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
        type: RollingUpdate
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: "8000"
            prometheus.io/scrape: "true"
          labels:
            app: deepseek-r1
            alibabacloud.com/compute-class: gpu
            alibabacloud.com/compute-qos: default
            alibabacloud.com/gpu-model-series: GN8IS
        spec:
          containers:
          - command:
            - sh
            - -c
            - vllm serve /models/DeepSeek-R1-Distill-Qwen-7B --port 8000 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager
            image: registry-cn-hangzhou.ack.aliyuncs.com/dev/ds-r1-qwen-7b-without-lora:v0.1
            imagePullPolicy: IfNotPresent
            name: custom-serving
            ports:
            - containerPort: 8000
              name: restful
              protocol: TCP
            readinessProbe:
              failureThreshold: 3
              initialDelaySeconds: 30
              periodSeconds: 30
              successThreshold: 1
              tcpSocket:
                port: 8000
              timeoutSeconds: 1
            resources:
              limits:
                cpu: "8"
                memory: 30G
                nvidia.com/gpu: "1"
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /dev/shm
              name: dshm
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          volumes:
          - emptyDir:
              medium: Memory
              sizeLimit: 30Gi
            name: dshm
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app: deepseek-r1
      name: deepseek-r1
    spec:
      ports:
      - name: http-serving
        port: 8000
        protocol: TCP
        targetPort: 8000
      selector:
        app: deepseek-r1
  2. 部署示例推理服务。

    kubectl apply -f vllm-service.yaml

步骤二:部署推理路由

本步骤创建InferencePool资源和InferenceModel资源。

  1. 创建inference-pool.yaml。

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-pool
      namespace: default
    spec:
      extensionRef:
        group: ""
        kind: Service
        name: qwen-ext-proc
      selector:
        app: qwen
      targetPortNumber: 8000
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen
    spec:
      criticality: Critical
      modelName: qwen
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-pool
      targetModels:
      - name: qwen
        weight: 100
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: deepseek-pool
      namespace: default
    spec:
      extensionRef:
        group: ""
        kind: Service
        name: deepseek-ext-proc
      selector:
        app: deepseek-r1
      targetPortNumber: 8000
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: deepseek-r1
    spec:
      criticality: Critical
      modelName: deepseek-r1
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: deepseek-pool
      targetModels:
      - name: deepseek-r1
        weight: 100
  2. 部署推理路由。

    kubectl apply -f inference-pool.yaml

步骤三:部署网关和网关路由规则

  1. 创建inference-gateway.yaml。

    apiVersion: gateway.networking.k8s.io/v1
    kind: GatewayClass
    metadata:
      name: inference-gateway
    spec:
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: inference-gateway
    spec:
      gatewayClassName: inference-gateway
      listeners:
        - name: llm-gw
          protocol: HTTP
          port: 8080
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: ClientTrafficPolicy
    metadata:
      name: client-buffer-limit
    spec:
      connection:
        bufferLimit: 20Mi
      targetRefs:
        - group: gateway.networking.k8s.io
          kind: Gateway
          name: inference-gateway
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: BackendTrafficPolicy
    metadata:
      name: backend-timeout
    spec:
      timeout:
        http:
          requestTimeout: 24h
      targetRef:
        group: gateway.networking.k8s.io
        kind: Gateway
        name: inference-gateway
  2. 创建inference-route.yaml

    HTTPRoute指定的路由规则中,请求体中的模型名称会被自动解析到X-Gateway-Model-Name请求头。

    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: inference-route
    spec:
      parentRefs:
      - group: gateway.networking.k8s.io
        kind: Gateway
        name: inference-gateway
        sectionName: llm-gw
      rules:
      - backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: qwen-pool
          weight: 1
        matches:
        - headers:
          - type: Exact
            name: X-Gateway-Model-Name
            value: qwen
      - backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: deepseek-pool
          weight: 1
        matches:
        - headers:
          - type: Exact
            name: X-Gateway-Model-Name
            value: deepseek-r1
  3. 部署网关和网关规则。

    kubectl apply -f inference-gateway.yaml
    kubectl apply -f inference-route.yaml

步骤四:验证网关效果

  1. 获取网关IP。

    export GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
  2. 请求qwen模型。

    curl -X POST ${GATEWAY_IP}:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
        "model": "qwen",
        "temperature": 0,
        "messages": [
          {
            "role": "user",
            "content": "who are you?" 
          }
        ]
    }'

    预期输出:

    {"id":"chatcmpl-475bc88d-b71d-453f-8f8e-0601338e11a9","object":"chat.completion","created":1748311216,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"I am Qwen, a large language model created by Alibaba Cloud. I am here to assist you with any questions or conversations you might have! How can I help you today?","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":33,"total_tokens":70,"completion_tokens":37,"prompt_tokens_details":null},"prompt_logprobs":null}
  3. 请求deepseek-r1模型。

    curl -X POST ${GATEWAY_IP}:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
        "model": "deepseek-r1",
        "temperature": 0,
        "messages": [
          {
            "role": "user",
            "content": "who are you?" 
          }
        ]
    }'

    预期输出:

    {"id":"chatcmpl-9a143fc5-8826-46bc-96aa-c677d130aef9","object":"chat.completion","created":1748312185,"model":"deepseek-r1","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Alright, someone just asked, \"who are you?\" Hmm, I need to explain who I am in a clear and friendly way. Let's see, I'm an AI created by DeepSeek, right? I don't have a physical form, so I don't have a \"name\" like you do. My purpose is to help with answering questions and providing information. I'm here to assist with a wide range of topics, from general knowledge to more specific inquiries. I understand that I can't do things like think or feel, but I'm here to make your day easier by offering helpful responses. So, I'll keep it simple and approachable, making sure to convey that I'm here to help with whatever they need.\n</think>\n\nI'm DeepSeek-R1-Lite-Preview, an AI assistant created by the Chinese company DeepSeek. I'm here to help you with answering questions, providing information, and offering suggestions. I don't have personal experiences or emotions, but I'm designed to make your interactions with me as helpful and pleasant as possible. How can I assist you today?","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":9,"total_tokens":232,"completion_tokens":223,"prompt_tokens_details":null},"prompt_logprobs":null}

    可以看到,两个推理服务已经正常对外提供服务,外部请求可以根据请求名称被路由到不同的推理服务。