服务网格ASM的Mixerless Telemetry技术,为业务容器提供了无侵入式的遥测数据。遥测数据作为监控指标被Prometheus采集。Flagger是一个应用发布流程全自动的工具,Flagger可以监控Prometheus中访问指标控制灰度发布的流量。本文介绍如何基于Mixerless Telemetry实现渐进式灰度发布。

前提条件

使用Prometheus采集应用监控指标。具体操作,请参见 基于Mixerless Telemetry实现服务网格的可观测性

渐进式灰度发布流程

  1. 接入Prometheus,使Prometheus采集应用监控指标。
  2. 部署Flagger和Gateway。
  3. 部署flagger-loadtester,用于探测灰度发布阶段应用的Pod实例。
  4. 部署3.1.0版本的podinfo应用,作为示例应用。
  5. 部署HPA,设置当podinfo应用的CPU利用率达到99%时,容器会进行扩容。
  6. 部署Canary,设置当P99分布的数值持续30s达到500时,逐步按照10%的比例增加导向podinfo应用的流量。
  7. Flagger会复制podinfo应用,生成一个名为podinfo-primary的应用。podinfo将作为灰度版本的Deployment,podinfo-primary将作为生产版本的Deployment。
  8. 升级podinfo,将灰度版本的podinfo应用升级为3.1.1版本。
  9. Flagger监控Prometheus中访问指标控制灰度发布的流量。当P99分布的数值持续30s达到500时,Flagger逐步按照10%的比例增加导向3.1.1版本的podinfo应用的流量。同时HPA会根据灰度情况,逐步扩容podinfo的Pod,缩容podinfo-primary的Pod。

操作步骤

  1. 通过kubectl管理Kubernetes集群
  2. 执行以下命令,部署Flagger。
    alias k="kubectl --kubeconfig $USER_CONFIG"
    alias h="helm --kubeconfig $USER_CONFIG"
    
    cp $MESH_CONFIG kubeconfig
    k -n istio-system create secret generic istio-kubeconfig --from-file kubeconfig
    k -n istio-system label secret istio-kubeconfig istio/multiCluster=true
    
    h repo add flagger https://flagger.app
    h repo update
    k apply -f $FLAAGER_SRC/artifacts/flagger/crd.yaml
    h upgrade -i flagger flagger/flagger --namespace=istio-system \
        --set crd.create=false \
        --set meshProvider=istio \
        --set metricsServer=http://prometheus:9090 \
        --set istio.kubeconfig.secretName=istio-kubeconfig \
        --set istio.kubeconfig.key=kubeconfig
  3. 通过kubectl连接ASM实例
  4. 部署Gateway。
    1. 使用以下内容,创建名为public-gateway的YAML文件。
      apiVersion: networking.istio.io/v1alpha3
      kind: Gateway
      metadata:
        name: public-gateway
        namespace: istio-system
      spec:
        selector:
          istio: ingressgateway
        servers:
          - port:
              number: 80
              name: http
              protocol: HTTP
            hosts:
              - "*"
    2. 执行以下命令,部署Gateway。
      kubectl --kubeconfig <ASM kubeconfig的位置> apply -f resources_canary/public-gateway.yaml
  5. 执行以下命令,在ACK集群部署flagger-loadtester。
    kubectl --kubeconfig <ACK kubeconfig的位置> apply -k "https://github.com/fluxcd/flagger//kustomize/tester?ref=main"
  6. 执行以下命令,在ACK集群部署podInfo和HPA。
    kubectl --kubeconfig <ACK kubeconfig的位置> apply -k "https://github.com/fluxcd/flagger//kustomize/podinfo?ref=main"
  7. 在ACK集群部署Canary。
    说明 关于Canary的详细介绍,请参见 How it works
    1. 使用以下内容,创建名为podinfo-canary的YAML文件。
      apiVersion: flagger.app/v1beta1
      kind: Canary
      metadata:
        name: podinfo
        namespace: test
      spec:
        # deployment reference
        targetRef:
          apiVersion: apps/v1
          kind: Deployment
          name: podinfo
        # the maximum time in seconds for the canary deployment
        # to make progress before it is rollback (default 600s)
        progressDeadlineSeconds: 60
        # HPA reference (optional)
        autoscalerRef:
          apiVersion: autoscaling/v2beta2
          kind: HorizontalPodAutoscaler
          name: podinfo
        service:
          # service port number
          port: 9898
          # container port number or name (optional)
          targetPort: 9898
          # Istio gateways (optional)
          gateways:
          - public-gateway.istio-system.svc.cluster.local
          # Istio virtual service host names (optional)
          hosts:
          - '*'
          # Istio traffic policy (optional)
          trafficPolicy:
            tls:
              # use ISTIO_MUTUAL when mTLS is enabled
              mode: DISABLE
          # Istio retry policy (optional)
          retries:
            attempts: 3
            perTryTimeout: 1s
            retryOn: "gateway-error,connect-failure,refused-stream"
        analysis:
          # schedule interval (default 60s)
          interval: 1m
          # max number of failed metric checks before rollback
          threshold: 5
          # max traffic percentage routed to canary
          # percentage (0-100)
          maxWeight: 50
          # canary increment step
          # percentage (0-100)
          stepWeight: 10
          metrics:
          - name: request-success-rate
            # minimum req success rate (non 5xx responses)
            # percentage (0-100)
            thresholdRange:
              min: 99
            interval: 1m
          - name: request-duration
            # maximum req duration P99
            # milliseconds
            thresholdRange:
              max: 500
            interval: 30s
          # testing (optional)
          webhooks:
            - name: acceptance-test
              type: pre-rollout
              url: http://flagger-loadtester.test/
              timeout: 30s
              metadata:
                type: bash
                cmd: "curl -sd 'test' http://podinfo-canary:9898/token | grep token"
            - name: load-test
              url: http://flagger-loadtester.test/
              timeout: 5s
              metadata:
                cmd: "hey -z 1m -q 10 -c 2 http://podinfo-canary.test:9898/"apiVersion: flagger.app/v1beta1
      kind: Canary
      metadata:
        name: podinfo
        namespace: test
      spec:
        # deployment reference
        targetRef:
          apiVersion: apps/v1
          kind: Deployment
          name: podinfo
        # the maximum time in seconds for the canary deployment
        # to make progress before it is rollback (default 600s)
        progressDeadlineSeconds: 60
        # HPA reference (optional)
        autoscalerRef:
          apiVersion: autoscaling/v2beta2
          kind: HorizontalPodAutoscaler
          name: podinfo
        service:
          # service port number
          port: 9898
          # container port number or name (optional)
          targetPort: 9898
          # Istio gateways (optional)
          gateways:
          - public-gateway.istio-system.svc.cluster.local
          # Istio virtual service host names (optional)
          hosts:
          - '*'
          # Istio traffic policy (optional)
          trafficPolicy:
            tls:
              # use ISTIO_MUTUAL when mTLS is enabled
              mode: DISABLE
          # Istio retry policy (optional)
          retries:
            attempts: 3
            perTryTimeout: 1s
            retryOn: "gateway-error,connect-failure,refused-stream"
        analysis:
          # schedule interval (default 60s)
          interval: 1m
          # max number of failed metric checks before rollback
          threshold: 5
          # max traffic percentage routed to canary
          # percentage (0-100)
          maxWeight: 50
          # canary increment step
          # percentage (0-100)
          stepWeight: 10
          metrics:
          - name: request-success-rate
            # minimum req success rate (non 5xx responses)
            # percentage (0-100)
            thresholdRange:
              min: 99
            interval: 1m
          - name: request-duration
            # maximum req duration P99
            # milliseconds
            thresholdRange:
              max: 500
            interval: 30s
          # testing (optional)
          webhooks:
            - name: acceptance-test
              type: pre-rollout
              url: http://flagger-loadtester.test/
              timeout: 30s
              metadata:
                type: bash
                cmd: "curl -sd 'test' http://podinfo-canary:9898/token | grep token"
            - name: load-test
              url: http://flagger-loadtester.test/
              timeout: 5s
              metadata:
                cmd: "hey -z 1m -q 10 -c 2 http://podinfo-canary.test:9898/"
      • stepWeight:设置逐步切流的百分比,本文设置为10。
      • max:设置P99分布的值。
      • interval:设置P99分布持续时间。
    2. 执行以下命令,部署Canary。
      kubectl --kubeconfig <ACK kubeconfig的位置> apply -f resources_canary/podinfo-canary.yaml
  8. 执行以下命令,将podinfo从3.1.0升级到3.1.1版本。
    kubectl --kubeconfig <ACK kubeconfig的位置> -n test set image deployment/podinfo podinfod=stefanprodan/podinfo:3.1.1

验证渐进式发布

执行以下命令,查看渐进式切流过程。

while true; do kubectl --kubeconfig <ACK kubeconfig的位置> -n test describe canary/podinfo; sleep 10s;done

预期输出:

Events:
  Type     Reason  Age                From     Message
  ----     ------  ----               ----     -------
  Warning  Synced  39m                flagger  podinfo-primary.test not ready: waiting for rollout to finish: observed deployment generation less then desired generation
  Normal   Synced  38m (x2 over 39m)  flagger  all the metrics providers are available!
  Normal   Synced  38m                flagger  Initialization done! podinfo.test
  Normal   Synced  37m                flagger  New revision detected! Scaling up podinfo.test
  Normal   Synced  36m                flagger  Starting canary analysis for podinfo.test
  Normal   Synced  36m                flagger  Pre-rollout check acceptance-test passed
  Normal   Synced  36m                flagger  Advance podinfo.test canary weight 10
  Normal   Synced  35m                flagger  Advance podinfo.test canary weight 20
  Normal   Synced  34m                flagger  Advance podinfo.test canary weight 30
  Normal   Synced  33m                flagger  Advance podinfo.test canary weight 40
  Normal   Synced  29m (x4 over 32m)  flagger  (combined from similar events): Promotion completed! Scaling down podinfo.test

可以看到,流向3.1.1版本的podinfo应用的流量逐渐从10%切到40%。