ACK Gateway with Inference Extension组件在支持推理服务智能负载均衡的同时,也支持推理请求的流量镜像功能。在生产环境中部署新推理模型时,您可以通过流量镜像复制生产流量来评估新模型的表现,确保其性能和稳定性符合要求之后再正式上线。本文介绍如何使用ACK Gateway with Inference Extension来实现推理请求的流量镜像。
阅读本文前,请确保您已经了解InferencePool和InferenceModel的相关概念。
前提条件
已创建带有GPU节点池的ACK托管集群。您也可以在ACK托管集群中安装ACK Virtual Node组件,以使用ACK使用ACS GPU算力示例。
已安装ACK Gateway with Inference Extension并勾选启用Gateway API推理扩展。操作入口,请参见安装组件。
操作流程
本文示例将部署以下资源:
两个推理服务vllm-llama2-7b-pool和vllm-llama2-7b-pool-1(下图中的APP和APP1)。
Service类型为ClusterIP的网关。
HTTPRoute资源,配置了具体的流量转发以及镜像规则。
InferencePool和对应的InferenceModel资源,为APP开启智能负载均衡。一个普通Service,对接APP1(当前不支持对镜像流量开启智能负载均衡,因此需要创建一个普通的Service)。
Sleep应用,作为测试客户端。
以下为演示流量镜像的流程示意图。
客户端访问网关,HTTPRoute根据前缀匹配规则识别生产流量。
规则匹配成功后:
生产流量正常转发给对应的InferencePool,经过智能负载均衡后转发给后端APP。
规则的HTTPFilter将镜像流量发送给指定的Service,然后将镜像流量转发给后端APP1。
后端APP和APP1的响应都正常返回,但网关只会处理从InferencePool返回的响应,忽略镜像服务的响应,客户端仅感知主服务的处理结果。
操作步骤
部署示例推理服务vllm-llama2-7b-pool和vllm-llama2-7b-pool-1。
本步骤只给出了vllm-llama2-7b-pool的YAML,vllm-llama2-7b-pool-1与vllm-llama2-7b-pool的配置只有名称不同,请自行修改YAML中对应字段进行部署。
部署InferencePool和InferenceModel资源,和vllm-llama2-7b-pool-1应用对应的服务。
# ============================================================= # inference_rules.yaml # ============================================================= apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: vllm-llama2-7b-pool spec: targetPortNumber: 8000 selector: app: vllm-llama2-7b-pool extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: inferencemodel-sample spec: modelName: /model/llama2 criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: vllm-llama2-7b-pool targetModels: - name: /model/llama2 weight: 100 --- apiVersion: v1 kind: Service metadata: name: vllm-llama2-7b-pool-1 spec: selector: app: vllm-llama2-7b-pool-1 ports: - protocol: TCP port: 8000 targetPort: 8000 type: ClusterIP
部署Gateway和HTTPRoute。
网关的Service类型是ClusterIP,只能从集群内访问。您可以根据实际需求修改为LoadBalancer。
# ============================================================= # gateway.yaml # ============================================================= kind: GatewayClass apiVersion: gateway.networking.k8s.io/v1 metadata: name: example-gateway-class labels: example: http-routing spec: controllerName: gateway.envoyproxy.io/gatewayclass-controller --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: labels: example: http-routing name: example-gateway namespace: default spec: gatewayClassName: example-gateway-class infrastructure: parametersRef: group: gateway.envoyproxy.io kind: EnvoyProxy name: custom-proxy-config listeners: - allowedRoutes: namespaces: from: Same name: http port: 80 protocol: HTTP --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: EnvoyProxy metadata: name: custom-proxy-config namespace: default spec: provider: type: Kubernetes kubernetes: envoyService: type: ClusterIP --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: mirror-route labels: example: http-routing spec: parentRefs: - name: example-gateway hostnames: - "example.com" rules: - matches: - path: type: PathPrefix value: / backendRefs: - group: inference.networking.x-k8s.io kind: InferencePool name: vllm-llama2-7b-pool weight: 1 filters: - type: RequestMirror requestMirror: backendRef: kind: Service name: vllm-llama2-7b-pool-1 port: 8000
部署sleep应用。
# ============================================================= # sleep.yaml # ============================================================= apiVersion: v1 kind: ServiceAccount metadata: name: sleep --- apiVersion: v1 kind: Service metadata: name: sleep labels: app: sleep service: sleep spec: ports: - port: 80 name: http selector: app: sleep --- apiVersion: apps/v1 kind: Deployment metadata: name: sleep spec: replicas: 1 selector: matchLabels: app: sleep template: metadata: labels: app: sleep spec: terminationGracePeriodSeconds: 0 serviceAccountName: sleep containers: - name: sleep image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep command: ["/bin/sleep", "infinity"] imagePullPolicy: IfNotPresent volumeMounts: - mountPath: /etc/sleep/tls name: secret-volume volumes: - name: secret-volume secret: secretName: sleep-secret optional: true
验证流量镜像。
获取网关地址。
export GATEWAY_ADDRESS=$(kubectl get gateway/example-gateway -o jsonpath='{.status.addresses[0].value}')
发起测试请求。
kubectl exec deployment/sleep -it -- curl -X POST ${GATEWAY_ADDRESS}/v1/chat/completions -H 'Content-Type: application/json' -H "host: example.com" -d '{ "model": "/model/llama2", "max_completion_tokens": 100, "temperature": 0, "messages": [ { "role": "user", "content": "introduce yourself" } ] }'
预期输出:
{"id":"chatcmpl-eb67bf29-1f87-4e29-8c3e-a83f3c74cd87","object":"chat.completion","created":1745207283,"model":"/model/llama2","choices":[{"index":0,"message":{"role":"assistant","content":"\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n ","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":15,"total_tokens":115,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null}%
查看应用日志。
echo "original logs↓↓↓" && kubectl logs deployments/vllm-llama2-7b-pool | grep /v1/chat/completions | grep OK echo "mirror logs↓↓↓" && kubectl logs deployments/vllm-llama2-7b-pool-1 | grep /v1/chat/completions | grep OK
预期输出:
original logs↓↓↓ INFO: 10.2.14.146:39478 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 10.2.14.146:60660 - "POST /v1/chat/completions HTTP/1.1" 200 OK mirror logs↓↓↓ INFO: 10.2.14.146:39742 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 10.2.14.146:59976 - "POST /v1/chat/completions HTTP/1.1" 200 OK
可以看到,vllm-llama2-7b-pool和vllm-llama2-7b-pool-1中都有请求,流量镜像生效。
- 本页导读
- 前提条件
- 操作流程
- 操作步骤