通过Gateway with Inference Extension实现基础模型与LoRA模型的灰度发布-容器服务 Kubernetes 版 ACK-阿里云

通过Gateway with Inference Extension组件，您可以在生成式AI推理服务中实现更换、升级使用的基础模型或者对多个LoRA模型进行灰度更新，将服务中断的时间降至最低。本文介绍如何使用Gateway with Inference Extension组件对生成式AI推理服务进行渐进式灰度发布。

重要

阅读本文前，请确保您已经了解InferencePool和InferenceModel的相关概念。

前提条件

已创建带有GPU节点池的ACK托管集群。您也可以在ACK托管集群中安装ACK Virtual Node组件，以使用ACS GPU算力。
已安装Gateway with Inference Extension并勾选启用Gateway API推理扩展。操作入口，请参见安装组件。

准备工作

在演示推理服务渐进式灰度发布之前，需要先完成示例推理服务的部署和验证。

部署基于Qwen-2.5-7B-Instruct基础模型的示例推理服务。

展开查看部署命令

kubectl apply -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: custom-serving
    release: qwen
  name: qwen
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: custom-serving
      release: qwen
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8000"
        prometheus.io/scrape: "true"
      labels:
        app: custom-serving
        release: qwen
    spec:
      containers:
      - command:
        - sh
        - -c
        - vllm serve /models/Qwen-2.5-7B-Instruct --port 8000 --trust-remote-code --served-model-name mymodel --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager --enable-lora --max-loras 2 --max-cpu-loras 4 --lora-modules travel-helper-v1=/models/Qwen-TravelHelper-Lora travel-helper-v2=/models/Qwen-TravelHelper-Lora-v2
        image: registry-cn-hangzhou.ack.aliyuncs.com/dev/qwen-2.5-7b-instruct-lora:v0.1
        imagePullPolicy: IfNotPresent
        name: custom-serving
        ports:
        - containerPort: 8000
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          initialDelaySeconds: 30
          periodSeconds: 30
          successThreshold: 1
          tcpSocket:
            port: 8000
          timeoutSeconds: 1
        resources:
          limits:
            nvidia.com/gpu: "1"
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir:
          medium: Memory
          sizeLimit: 30Gi
        name: dshm
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: custom-serving
    release: qwen
  name: qwen
spec:
  ports:
  - name: http-serving
    port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    app: custom-serving
    release: qwen
EOF

部署InferencePool和InferenceModel资源。

kubectl apply -f- <<EOF
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: mymodel-pool-v1
  namespace: default
spec:
  extensionRef:
    group: ""
    kind: Service
    name: mymodel-v1-ext-proc
  selector:
    app: custom-serving
    release: qwen
  targetPortNumber: 8000
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: mymodel-v1
spec:
  criticality: Critical
  modelName: mymodel
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: mymodel-pool-v1
  targetModels:
  - name: mymodel
    weight: 100
EOF

部署网关和网关路由规则。

展开查看部署命令

kubectl apply -f- <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: inference-gateway
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: inference-gateway
  listeners:
    - name: llm-gw
      protocol: HTTP
      port: 8080
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: inference-route
spec:
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway
    sectionName: llm-gw
  rules:
  - backendRefs:
    - group: inference.networking.x-k8s.io
      kind: InferencePool
      name: mymodel-pool-v1
      weight: 1
    matches:
    - path:
        type: PathPrefix
        value: /v1/completions
    - path:
        type: PathPrefix
        value: /v1/chat/completions
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: ClientTrafficPolicy
metadata:
  name: client-buffer-limit
spec:
  connection:
    bufferLimit: 20Mi
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: inference-gateway
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: backend-timeout
spec:
  timeout:
    http:
      requestTimeout: 24h
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway
EOF

获取网关IP。

export GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')

验证推理服务。

curl -X POST ${GATEWAY_IP}:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
    "model": "mymodel",
    "temperature": 0,
    "messages": [
      {
        "role": "user",
        "content": "你是谁？" 
      }
    ]
}'

预期输出：

{"id":"chatcmpl-6bd37f84-55e0-4278-8f16-7b7bf04c6513","object":"chat.completion","created":1744364930,"model":"mymodel","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"我是Qwen，由阿里云开发的人工智能模型。我被设计用来提供信息、回答问题和进行各种对话任务。如果您有任何问题或需要帮助，都可以尝试和我交流！","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":32,"total_tokens":74,"completion_tokens":42,"prompt_tokens_details":null},"prompt_logprobs":null}

预期输出表明，推理服务已经通过Gateway with Inference Extension正常对外提供服务。

场景一：通过更新InferencePool进行基础设施和基础模型灰度发布

在实际场景中，通过更新InferencePool可以实现模型服务的灰度发布。例如，您可以配置两个InferencePool，基于相同的InferenceModel定义和相同的模型名称，但分别运行在不同计算配置、GPU卡型或基础模型上。适用于以下场景。

基础设施灰度更新：创建新InferencePool，使用新GPU卡型或新的模型配置，通过灰度的方式逐步完成工作负载的迁移。在不中断推理请求流量的情况下完成节点硬件的升级、驱动程序的更新或安全问题的解决等。
基础模型灰度更新：创建新InferencePool，加载新模型架构或微调后的模型权重，通过灰度的方式逐步上线新推理模型，以提升推理服务性能、或解决基础模型相关的问题等。

以下为灰度更新的主要流程。

通过为新基础模型创建新的InferencePool，并配置HTTPRoute来分配不同InferencePool之间的流量比例，可逐步将流量灰度到新InferencePool代表的新基础模型推理服务上，实现无中断的基础模型更新。以下演示如何将部署的Qwen-2.5-7B-Instruct基础模型服务灰度更新为DeepSeek-R1-Distill-Qwen-7B。您可以通过更新HTTPRoute中的流量比例，来体验基础模型的完全切换。

部署基于DeepSeek-R1-Distill-Qwen-7B基础模型的推理服务。

展开查看部署命令

kubectl apply -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: custom-serving
    release: deepseek-r1
  name: deepseek-r1
spec:
  progressDeadlineSeconds: 600
  replicas: 1 
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: custom-serving
      release: deepseek-r1
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8000"
        prometheus.io/scrape: "true"
      labels:
        app: custom-serving
        release: deepseek-r1
    spec:
      containers:
      - command:
        - sh
        - -c
        - vllm serve /models/DeepSeek-R1-Distill-Qwen-7B --port 8000 --trust-remote-code --served-model-name mymodel --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager
        image: registry-cn-hangzhou.ack.aliyuncs.com/dev/ds-r1-qwen-7b-without-lora:v0.1
        imagePullPolicy: IfNotPresent
        name: custom-serving
        ports:
        - containerPort: 8000
          name: restful
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          initialDelaySeconds: 30
          periodSeconds: 30
          successThreshold: 1
          tcpSocket:
            port: 8000
          timeoutSeconds: 1
        resources:
          limits:
            nvidia.com/gpu: "1"
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir:
          medium: Memory
          sizeLimit: 30Gi
        name: dshm
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: custom-serving
    release: deepseek-r1
  name: deepseek-r1
spec:
  ports:
  - name: http-serving
    port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    app: custom-serving
    release: deepseek-r1
EOF

配置新的推理服务的InferencePool和InferenceModel。InferencePool mymodel-pool-v2通过新的标签选择基于DeepSeek-R1-Distill-Qwen-7B基础模型的推理服务，并在其中声明相同模型名称mymodel的InferenceModel。

kubectl apply -f- <<EOF
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: mymodel-pool-v2
  namespace: default
spec:
  extensionRef:
    group: ""
    kind: Service
    name: mymodel-v2-ext-proc
  selector:
    app: custom-serving
    release: deepseek-r1
  targetPortNumber: 8000
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: mymodel-v2
spec:
  criticality: Critical
  modelName: mymodel
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: mymodel-pool-v2
  targetModels:
  - name: mymodel
    weight: 100
EOF

配置流量灰度策略。

配置HTTPRoute在现有的InferencePool（mymodel-pool-v1）和新的InferencePool（mymodel-pool-v2）之间分配流量。backendRefs权重字段控制分配给每个InferencePool的流量百分比，以下示例配置模型的流量权重为9:1，即将10%流量转发给mymodel-pool-v2对应的DeepSeek-R1-Distill-Qwen-7B基础服务。

kubectl apply -f- <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: inference-route
spec:
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway
    sectionName: llm-gw
  rules:
  - backendRefs:
    - group: inference.networking.x-k8s.io
      kind: InferencePool
      name: mymodel-pool-v1
      port: 8000
      weight: 90
    - group: inference.networking.x-k8s.io
      kind: InferencePool
      name: mymodel-pool-v2
      weight: 10
    matches:
    - path:
        type: PathPrefix
        value: /
EOF

验证基础模型灰度效果。

反复执行以下指令，通过模型输出验证基础模型的灰度效果：

curl -X POST ${GATEWAY_IP}:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
    "model": "mymodel",
    "temperature": 0,
    "messages": [
      {
        "role": "user",
        "content": "你是谁？" 
      }
    ]
}'

大多数请求的预期输出：

{"id":"chatcmpl-6e361a5e-b0cb-4b57-8994-a293c5a9a6ad","object":"chat.completion","created":1744601277,"model":"mymodel","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"我是Qwen，由阿里云开发的人工智能模型。我被设计用来提供信息、回答问题和进行各种对话任务。如果您有任何问题或需要帮助，都可以尝试和我交流！","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":32,"total_tokens":74,"completion_tokens":42,"prompt_tokens_details":null},"prompt_logprobs":null}

大约10%请求的预期输出：

{"id":"chatcmpl-9e3cda6e-b284-43a9-9625-2e8fcd1fe0c7","object":"chat.completion","created":1744601333,"model":"mymodel","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"您好！我是由中国的深度求索（DeepSeek）公司开发的智能助手DeepSeek-R1。如您有任何问题，我会尽我所能为您提供帮助。\n</think>\n\n您好！我是由中国的深度求索（DeepSeek）公司开发的智能助手DeepSeek-R1。如您有任何问题，我会尽我所能为您提供帮助。","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":8,"total_tokens":81,"completion_tokens":73,"prompt_tokens_details":null},"prompt_logprobs":null}

可以看到，大部分推理请求仍由旧的Qwen-2.5-7B-Instruct基础模型提供服务，小部分请求由新的DeepSeek-R1-Distill-Qwen-7B基础模型提供服务。

场景二：通过配置InferenceModel进行LoRA模型灰度发布

在Multi-LoRA场景下，通过Gateway with Inference Extension，您可以在同一基础大模型上同时部署多个版本的LoRA模型，灵活分配流量进行灰度测试，验证各版本在性能优化、缺陷修复或功能迭代上的效果。

以下以 Qwen-2.5-7B-Instruct 微调的两个LoRA版本为例，介绍如何通过InferenceModel实现LoRA模型的灰度发布流程。

实现LoRA模型的灰度发布前，需确保新版本模型已成功部署至推理服务实例。本示例中的基础服务已预先挂载了travel-helper-v1和travel-helper-v2两个LoRA模型。

通过更新InferenceModel中不同LoRA模型之间的流量比例，可以逐步增加新版本LoRA模型的流量权重，在不中断流量的情况下逐步更新到新的LoRA模型。

部署InferenceModel配置，定义LoRA模型的多个版本并指定LoRA模型之间的流量比例。完成配置后，请求travelhelper模型时，在后端不同版本的LoRA模型之间进行灰度的流量比例，示例中配置为9:1。即90%流量发往travel-helper-v1模型，10%发往travel-helper-v2模型。

kubectl apply -f- <<EOF
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: loramodel
spec:
  criticality: Critical
  modelName: travelhelper
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: mymodel-pool-v1
  targetModels:
  - name: travel-helper-v1
    weight: 90
  - name: travel-helper-v2
    weight: 10
EOF

验证灰度效果。

反复执行以下指令，通过模型输出验证LoRA模型的灰度效果：

curl -X POST ${GATEWAY_IP}:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
    "model": "travelhelper",
    "temperature": 0,
    "messages": [
      {
        "role": "user",
        "content": "我刚来北京，帮我推荐个景点" 
      }
    ]
}'

大多数请求的预期输出：

{"id":"chatcmpl-2343f2ec-b03f-4882-a601-aca9e88d45ef","object":"chat.completion","created":1744602234,"model":"travel-helper-v1","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"北京是一个充满历史和文化的城市，有很多值得一游的景点。以下是一些推荐的景点：\n\n1. 故宫：这是中国最大的古代宫殿，也是世界上最大的古代木结构建筑群之一。你可以在这里了解中国古代的宫廷生活和历史。\n\n2. 长城：北京的长城是最著名的长城之一，你可以在这里欣赏到壮丽的山景和长城的雄伟。\n\n3. 天安门广场：这是世界上最大的城市广场，你可以在这里看到天安门城楼和人民英雄纪念碑。\n\n4. 颐和园：这是中国最大的皇家园林，你可以在这里欣赏到美丽的湖泊和山景，以及精美的建筑和雕塑。\n\n5. 北京动物园：如果你喜欢动物，这里有很多种类的动物，你可以在这里看到大熊猫、金丝猴等珍稀动物。\n\n6. 798艺术区：这是一个充满艺术气息的地方，有很多画廊、艺术工作室和咖啡馆，你可以在这里欣赏到各种艺术作品。\n\n7. 北京751D·PARK：这是一个集艺术、文化、科技于一体的创意园区，你可以在这里看到各种展览和活动。\n\n以上就是我为你推荐的北京景点，希望你会喜欢。","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":38,"total_tokens":288,"completion_tokens":250,"prompt_tokens_details":null},"prompt_logprobs":null}

大约10%请求的预期输出：

{"id":"chatcmpl-c6df57e9-ff95-41d6-8b35-19978f40525f","object":"chat.completion","created":1744602223,"model":"travel-helper-v2","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"北京是一个充满历史和文化的城市，有很多值得一游的景点。以下是一些推荐的景点：\n\n1. 故宫：这是中国最大的古代宫殿建筑群，也是世界上保存最完整的古代皇宫之一。你可以在这里了解到中国古代的宫廷生活和历史。\n\n2. 长城：北京的长城段落是世界上最著名的长城之一，你可以在这里欣赏到壮丽的山景和长城的雄伟。\n\n3. 天安门广场：这是世界上最大的城市广场，你可以在这里看到庄严的人民英雄纪念碑和天安门城楼。\n\n4. 颐和园：这是中国最大的皇家园林，你可以在这里欣赏到精美的园林建筑和美丽的湖景。\n\n5. 北京动物园：如果你喜欢动物，这里是一个很好的选择。你可以看到各种各样的动物，包括大熊猫。\n\n6. 798艺术区：这是一个充满艺术气息的地方，你可以在这里看到各种各样的艺术展览和创意市集。\n\n希望这些建议对你有所帮助！","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":38,"total_tokens":244,"completion_tokens":206,"prompt_tokens_details":null},"prompt_logprobs":null}

可以看到，大部分推理请求由travel-helper-v1 LoRA模型提供服务，小部分请求由travel-helper-v2 LoRA模型提供服务。