Gateway with Inference Extension组件支持在开启推理服务智能负载均衡的同时配置熔断规则。当服务出现异常时,熔断机制可以自动切断有问题的服务连接,防止故障蔓延。本文介绍如何使用Gateway with Inference Extension为推理服务配置流量熔断规则。
阅读本文前,请确保您已经了解InferencePool和InferenceModel的相关概念。
前提条件
已创建带有GPU节点池的ACK托管集群。您也可以在ACK托管集群中安装ACK Virtual Node组件,以使用ACS GPU算力。
已安装Gateway with Inference Extension并勾选启用Gateway API推理扩展。操作入口,请参见安装组件。
本文使用的镜像推荐ACK集群使用A10卡型,ACS GPU算力推荐使用L20(GN8IS)卡型。
同时,由于LLM镜像体积较大,建议您提前转存到ACR,使用内网地址进行拉取。直接从公网拉取的速度取决于集群EIP的带宽配置,会有较长的等待时间。
操作流程
本文示例将部署以下资源:
推理服务vllm-llama2-7b-pool(下图中的APP)。
Service类型为ClusterIP的网关。
HTTPRoute资源,配置了具体的流量转发规则和限制Pending请求数为1的熔断规则即(最大并发请求数为1)。
本文为了方便演示使用了较小的并发数限制,在实际环境中请按需调整。
InferencePool和对应的InferenceModel资源,为APP开启智能负载均衡。
Sleep应用,作为测试客户端。
配置了以下为流量熔断的请求路径示意图。
客户端发起请求①,在响应返回之前,再次发起请求②。
熔断规则判断①之前没有待处理的请求,①被转发到APP。
熔断规则判断②之前已经有请求①在处理,直接阻断请求,并向客户端返回熔断信息③(本例中为请求被拒绝)。
①在APP被处理后,向客户端返回了响应④。
操作步骤
部署示例推理服务vllm-llama2-7b-pool。
部署InferencePool和InferenceModel资源。
# ============================================================= # inference_rules.yaml # ============================================================= apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: vllm-llama2-7b-pool spec: targetPortNumber: 8000 selector: app: vllm-llama2-7b-pool extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: inferencemodel-sample spec: modelName: /model/llama2 criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: vllm-llama2-7b-pool targetModels: - name: /model/llama2 weight: 100
部署Gateway和HTTPRoute,并配置熔断规则。
网关的Service类型是ClusterIP,只能从集群内访问。您可以根据实际需求修改为LoadBalancer。
# ============================================================= # gateway.yaml # ============================================================= kind: GatewayClass apiVersion: gateway.networking.k8s.io/v1 metadata: name: example-gateway-class labels: example: http-routing spec: controllerName: gateway.envoyproxy.io/gatewayclass-controller --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: labels: example: http-routing name: example-gateway namespace: default spec: gatewayClassName: example-gateway-class infrastructure: parametersRef: group: gateway.envoyproxy.io kind: EnvoyProxy name: custom-proxy-config listeners: - allowedRoutes: namespaces: from: Same name: http port: 80 protocol: HTTP --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: EnvoyProxy metadata: name: custom-proxy-config namespace: default spec: provider: type: Kubernetes kubernetes: envoyService: type: ClusterIP --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: test-httproute labels: example: http-routing spec: parentRefs: - name: example-gateway hostnames: - "example.com" rules: - matches: - path: type: PathPrefix value: / backendRefs: - group: inference.networking.x-k8s.io kind: InferencePool name: vllm-llama2-7b-pool weight: 1 --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: circuitbreaker-for-route spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: test-httproute circuitBreaker: maxPendingRequests: 1 maxParallelRequests: 1 # 限制并发请求为1
部署sleep应用。
# ============================================================= # sleep.yaml # ============================================================= apiVersion: v1 kind: ServiceAccount metadata: name: sleep --- apiVersion: v1 kind: Service metadata: name: sleep labels: app: sleep service: sleep spec: ports: - port: 80 name: http selector: app: sleep --- apiVersion: apps/v1 kind: Deployment metadata: name: sleep spec: replicas: 1 selector: matchLabels: app: sleep template: metadata: labels: app: sleep spec: terminationGracePeriodSeconds: 0 serviceAccountName: sleep containers: - name: sleep image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep command: ["/bin/sleep", "infinity"] imagePullPolicy: IfNotPresent volumeMounts: - mountPath: /etc/sleep/tls name: secret-volume volumes: - name: secret-volume secret: secretName: sleep-secret optional: true
验证流量熔断配置。
本文为了方便演示使用了较小的并发数限制,在实际环境中请按需调整。
获取网关地址。
export GATEWAY_ADDRESS=$(kubectl get gateway/example-gateway -o jsonpath='{.status.addresses[0].value}')
开启两个终端窗口,在窗口1发起测试请求,在请求未返回时在窗口2再次发起请求。
kubectl exec deployment/sleep -it -- curl -X POST ${GATEWAY_ADDRESS}/v1/chat/completions -H 'Content-Type: application/json' -H "host: example.com" -d '{ "model": "/model/llama2", "max_completion_tokens": 100, "temperature": 0, "messages": [ { "role": "user", "content": "introduce yourself" } ] }'
窗口1预期输出:
{"id":"chatcmpl-eb67bf29-1f87-4e29-8c3e-a83f3c74cd87","object":"chat.completion","created":1745207283,"model":"/model/llama2","choices":[{"index":0,"message":{"role":"assistant","content":"\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n ","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":15,"total_tokens":115,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null}%
窗口2预期输出:
upstream connect error or disconnect/reset before headers. reset reason: overflow
可以看到,配置熔断规则后,并发请求数超出配置数量
1
,后发送的请求会触发熔断。