使用ASM熔断访问失败或慢响应的外部服务-服务网格-阿里云

服务网格 ASM（Service Mesh）允许您为访问外部服务的流量配置熔断规则。当ASM检测到外部服务出现故障，如连续返回错误或响应缓慢时，熔断器会在一段时间内直接阻止后续的出站请求，并立即返回一个预设的错误响应。这个过程由Sidecar代理自动完成，无需修改任何应用代码。本文将以访问 httpbin.org 为例来介绍如何为访问外部服务的应用配置熔断规则。

前提条件

已开启Sidecar自动注入。

操作步骤

步骤一：部署用于测试的客户端应用

部署sleep应用。

kubectl apply -f - <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
  name: sleep
---
apiVersion: v1
kind: Service
metadata:
  name: sleep
  labels:
    app: sleep
    service: sleep
spec:
  ports:
  - port: 80
    name: http
  selector:
    app: sleep
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sleep
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sleep
  template:
    metadata:
      labels:
        app: sleep
    spec:
      terminationGracePeriodSeconds: 0
      serviceAccountName: sleep
      containers:
      - name: sleep
        image: registry.cn-hangzhou.aliyuncs.com/acs/curl:8.1.2
        command: ["/bin/sleep", "infinity"]
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - mountPath: /etc/sleep/tls
          name: secret-volume
      volumes:
      - name: secret-volume
        secret:
          secretName: sleep-secret
          optional: true
EOF

步骤二：验证无熔断时的访问行为

访问外部服务httpbin.org的/status/503，此地址会稳定地返回HTTP 503错误。

kubectl exec -it deploy/sleep -- sh -c \
  "for i in \$(seq 1 100); do \
    curl -s -o /dev/null -w 'Time: %{time_total}s, Status: %{http_code}\n' \
    httpbin.org/status/503; \
    sleep 0.2; \
  done"

预期输出：

Time: 0.524067s, Status: 503
Time: 0.947159s, Status: 503
Time: 0.459089s, Status: 503
Time: 0.264017s, Status: 503
Time: 0.661447s, Status: 503
Time: 0.484715s, Status: 503
Time: 0.952842s, Status: 503
...

可以看到，所有请求正常返回了503状态码。

访问外部服务httpbin.org的/delay，并配置延迟时间为2s，此地址会使响应稳定地超过2s后才会响应。

kubectl exec -it deploy/sleep -- sh -c \
  "for i in \$(seq 1 100); do \
    curl -s -o /dev/null -w 'Time: %{time_total}s, Status: %{http_code}\n' \
    httpbin.org/delay/2; \
    sleep 0.2; \
  done"

预期输出：

Time: 2.467788s, Status: 200 
Time: 4.651051s, Status: 200 
Time: 3.222184s, Status: 200 
Time: 2.592945s, Status: 200 
Time: 2.473543s, Status: 200 
Time: 2.464152s, Status: 200 
...

可以看到，所有的请求的响应时间都在2s以上。

上述两个示例表明，在没有熔断规则的情况下，即使外部服务持续返回错误或响应缓慢，客户端的Sidecar代理也会将每一个请求都转发出去，这将持续消耗网络和计算资源。

步骤三：部署熔断规则并重新验证

在本步骤中将为sleep应用配置熔断规则。对比上一步中请求了100次httpbin.org均正常返回的情况，此规则将配置sleep应用访问httpbin应用在10秒内有60%以上的请求发生错误，或是超过10个慢请求时，对客户端进行熔断，返回499状态码。

部署ServiceEntry和熔断规则。

kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: httpbin-org
spec:
  exportTo:
  - '*'
  hosts:
  - httpbin.org
  location: MESH_EXTERNAL
  ports:
  - name: http
    number: 80
    protocol: HTTP
  resolution: DNS
---
apiVersion: istio.alibabacloud.com/v1
kind: ASMCircuitBreaker
metadata:
  name: httpbin-org-breaker
spec:
  workloadSelector:
    labels:
      app: sleep
  applyToTraffic: sidecar_outbound
  configs:
    - target_services:
      - kind: ServiceEntry
        name: httpbin-org
        port: 80
      breaker_config:
        slow_request_rt: 1s
        break_duration: 90s
        window_size: 10s
        max_slow_requests: 10
        min_request_amount: 3
        error_percent:
          value: 60
        custom_response:
          header_to_add:
            x-envoy-circuitbreak: "true"
          body: "hello, break!"
          status_code: 499
EOF

再次验证503请求。

kubectl exec -it deploy/sleep -- sh -c \
  "for i in \$(seq 1 100); do \
    curl -s -o /dev/null -w 'Time: %{time_total}s, Status: %{http_code}\n' \
    httpbin.org/status/503; \
    sleep 0.2; \
  done"

预期输出：

Time: 1.033321s, Status: 503
Time: 0.492785s, Status: 503
Time: 0.786655s, Status: 503
Time: 0.009272s, Status: 499
Time: 0.009629s, Status: 499
Time: 0.010111s, Status: 499
...

可以看到，从第四次请求开始返回码变为了499。

再次验证delay请求。

kubectl exec -it deploy/sleep -- sh -c \
  "for i in \$(seq 1 100); do \
    curl -s -o /dev/null -w 'Time: %{time_total}s, Status: %{http_code}\n' \
    httpbin.org/delay/2; \
    sleep 0.2; \
  done"

预期输出：

Time: 3.293483s, Status: 200
Time: 2.851457s, Status: 200
Time: 2.772694s, Status: 200
Time: 4.012661s, Status: 200
Time: 2.505847s, Status: 200
Time: 4.203690s, Status: 200
Time: 4.063237s, Status: 200
Time: 2.514796s, Status: 200
Time: 2.783456s, Status: 200
Time: 2.303864s, Status: 200
Time: 0.009872s, Status: 499
Time: 0.009720s, Status: 499
Time: 0.009374s, Status: 499
...

可以看到，在十次慢请求后，返回码开始变为499。

以上两个请求表明了当前的熔断规则已经生效，这有效地保护了应用，避免了不必要的等待和资源消耗。

使用建议

合理设置阈值：避免误触发，建议根据历史 P99 延迟和错误率设置 slow_request_rt 和 error_percent。
结合监控：使用 Prometheus 监控判断熔断状态。
灰度发布：先在测试环境验证，再上线生产。

熔断是流量治理的重要一环，建议与超时、重试、限流等策略结合使用，构建完整的弹性防护体系。