使用ACK Gateway with Inference Extension实现推理服务的请求熔断

Gateway with Inference Extension组件支持在开启推理服务智能负载均衡的同时配置熔断规则。当服务出现异常时,熔断机制可以自动切断有问题的服务连接,防止故障蔓延。本文介绍如何使用Gateway with Inference Extension为推理服务配置流量熔断规则。

重要

阅读本文前,请确保您已经了解InferencePoolInferenceModel的相关概念。

前提条件

说明

本文使用的镜像推荐ACK集群使用A10卡型,ACS GPU算力推荐使用L20(GN8IS)卡型。

同时,由于LLM镜像体积较大,建议您提前转存到ACR,使用内网地址进行拉取。直接从公网拉取的速度取决于集群EIP的带宽配置,会有较长的等待时间。

操作流程

本文示例将部署以下资源:

  • 推理服务vllm-llama2-7b-pool(下图中的APP)。

  • Service类型为ClusterIP的网关。

  • HTTPRoute资源,配置了具体的流量转发规则和限制Pending请求数为1的熔断规则即(最大并发请求数为1)。

    本文为了方便演示使用了较小的并发数限制,在实际环境中请按需调整。
  • InferencePool和对应的InferenceModel资源,为APP开启智能负载均衡。

  • Sleep应用,作为测试客户端。

配置了以下为流量熔断的请求路径示意图。

image
  • 客户端发起请求①,在响应返回之前,再次发起请求②。

  • 熔断规则判断①之前没有待处理的请求,①被转发到APP。

  • 熔断规则判断②之前已经有请求①在处理,直接阻断请求,并向客户端返回熔断信息③(本例中为请求被拒绝)。

  • ①在APP被处理后,向客户端返回了响应④。

操作步骤

  1. 部署示例推理服务vllm-llama2-7b-pool。

    展开查看YAML内容

    # =============================================================
    # inference_app.yaml
    # =============================================================
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: chat-template
    data:
      llama-2-chat.jinja: |
        {% if messages[0]['role'] == 'system' %}
          {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %}
          {% set messages = messages[1:] %}
        {% else %}
            {% set system_message = '' %}
        {% endif %}
    
        {% for message in messages %}
            {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
                {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
            {% endif %}
    
            {% if loop.index0 == 0 %}
                {% set content = system_message + message['content'] %}
            {% else %}
                {% set content = message['content'] %}
            {% endif %}
            {% if message['role'] == 'user' %}
                {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}
            {% elif message['role'] == 'assistant' %}
                {{ ' ' + content | trim + ' ' + eos_token }}
            {% endif %}
        {% endfor %}
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vllm-llama2-7b-pool
      namespace: default
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: vllm-llama2-7b-pool
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: '8000'
            prometheus.io/scrape: 'true'
          labels:
            app: vllm-llama2-7b-pool
        spec:
          containers:
            - name: lora
              image: "registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/llama2-with-lora:v0.2"
              imagePullPolicy: IfNotPresent
              command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
              args:
              - "--model"
              - "/model/llama2"
              - "--tensor-parallel-size"
              - "1"
              - "--port"
              - "8000"
              - '--gpu_memory_utilization'
              - '0.8'
              - "--enable-lora"
              - "--max-loras"
              - "4"
              - "--max-cpu-loras"
              - "12"
              - "--lora-modules"
              - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0'
              - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1'
              - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2'
              - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3'
              - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4'
              - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0'
              - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1'
              - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2'
              - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3'
              - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4'
              - '--chat-template'
              - '/etc/vllm/llama-2-chat.jinja'
              env:
                - name: PORT
                  value: "8000"
              ports:
                - containerPort: 8000
                  name: http
                  protocol: TCP
              livenessProbe:
                failureThreshold: 2400
                httpGet:
                  path: /health
                  port: http
                  scheme: HTTP
                initialDelaySeconds: 5
                periodSeconds: 5
                successThreshold: 1
                timeoutSeconds: 1
              readinessProbe:
                failureThreshold: 6000
                httpGet:
                  path: /health
                  port: http
                  scheme: HTTP
                initialDelaySeconds: 5
                periodSeconds: 5
                successThreshold: 1
                timeoutSeconds: 1
              resources:
                limits:
                  nvidia.com/gpu: 1
                requests:
                  nvidia.com/gpu: 1
              volumeMounts:
                - mountPath: /data
                  name: data
                - mountPath: /dev/shm
                  name: shm
                - mountPath: /etc/vllm
                  name: chat-template
          restartPolicy: Always
          schedulerName: default-scheduler
          terminationGracePeriodSeconds: 30
          volumes:
            - name: data
              emptyDir: {}
            - name: shm
              emptyDir:
                medium: Memory
            - name: chat-template
              configMap:
                name: chat-template
  2. 部署InferencePoolInferenceModel资源。

    # =============================================================
    # inference_rules.yaml
    # =============================================================
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: vllm-llama2-7b-pool
    spec:
      targetPortNumber: 8000
      selector:
        app: vllm-llama2-7b-pool
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: inferencemodel-sample
    spec:
      modelName: /model/llama2
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: vllm-llama2-7b-pool
      targetModels:
      - name: /model/llama2
        weight: 100
  3. 部署GatewayHTTPRoute,并配置熔断规则。

    网关的Service类型是ClusterIP,只能从集群内访问。您可以根据实际需求修改为LoadBalancer。
    # =============================================================
    # gateway.yaml
    # =============================================================
    kind: GatewayClass
    apiVersion: gateway.networking.k8s.io/v1
    metadata:
      name: example-gateway-class
      labels:
        example: http-routing
    spec:
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      labels:
        example: http-routing
      name: example-gateway
      namespace: default
    spec:
      gatewayClassName: example-gateway-class
      infrastructure:
        parametersRef:
          group: gateway.envoyproxy.io
          kind: EnvoyProxy
          name: custom-proxy-config
      listeners:
      - allowedRoutes:
          namespaces:
            from: Same
        name: http
        port: 80
        protocol: HTTP
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: EnvoyProxy
    metadata:
      name: custom-proxy-config
      namespace: default
    spec:
      provider:
        type: Kubernetes
        kubernetes:
          envoyService:
            type: ClusterIP
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: test-httproute
      labels:
        example: http-routing
    spec:
      parentRefs:
        - name: example-gateway
      hostnames:
        - "example.com"
      rules:
        - matches:
            - path:
                type: PathPrefix
                value: /
          backendRefs:
          - group: inference.networking.x-k8s.io
            kind: InferencePool
            name: vllm-llama2-7b-pool
            weight: 1
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: BackendTrafficPolicy
    metadata:
      name: circuitbreaker-for-route
    spec:
      targetRefs:
        - group: gateway.networking.k8s.io
          kind: HTTPRoute
          name: test-httproute
      circuitBreaker:
        maxPendingRequests: 1
        maxParallelRequests: 1 # 限制并发请求为1
  4. 部署sleep应用。

    # =============================================================
    # sleep.yaml
    # =============================================================
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: sleep
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: sleep
      labels:
        app: sleep
        service: sleep
    spec:
      ports:
      - port: 80
        name: http
      selector:
        app: sleep
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: sleep
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: sleep
      template:
        metadata:
          labels:
            app: sleep
        spec:
          terminationGracePeriodSeconds: 0
          serviceAccountName: sleep
          containers:
          - name: sleep
            image:  registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep
            command: ["/bin/sleep", "infinity"]
            imagePullPolicy: IfNotPresent
            volumeMounts:
            - mountPath: /etc/sleep/tls
              name: secret-volume
          volumes:
          - name: secret-volume
            secret:
              secretName: sleep-secret
              optional: true
  5. 验证流量熔断配置。

    本文为了方便演示使用了较小的并发数限制,在实际环境中请按需调整。
    1. 获取网关地址。

      export GATEWAY_ADDRESS=$(kubectl get gateway/example-gateway -o jsonpath='{.status.addresses[0].value}')
    2. 开启两个终端窗口,在窗口1发起测试请求,在请求未返回时在窗口2再次发起请求。

      kubectl exec deployment/sleep -it -- curl -X POST ${GATEWAY_ADDRESS}/v1/chat/completions -H 'Content-Type: application/json' -H "host: example.com" -d '{
          "model": "/model/llama2",
          "max_completion_tokens": 100,
          "temperature": 0,
          "messages": [
            {
              "role": "user",
              "content": "introduce yourself"
            }
          ]
      }'

      窗口1预期输出:

      {"id":"chatcmpl-eb67bf29-1f87-4e29-8c3e-a83f3c74cd87","object":"chat.completion","created":1745207283,"model":"/model/llama2","choices":[{"index":0,"message":{"role":"assistant","content":"\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n        ","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":15,"total_tokens":115,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null}%

      窗口2预期输出:

      upstream connect error or disconnect/reset before headers. reset reason: overflow

      可以看到,配置熔断规则后,并发请求数超出配置数量1,后发送的请求会触发熔断。