自动监控和响应ECS系统事件实现故障处理、动态调度等自动化运维

阿里云提供了ECS系统事件用于记录和通知云资源信息,例如ECS实例的启停、是否到期、任务执行情况等。在大规模集群、实时资源调度等场景,如果您需要主动监控和响应阿里云提供的ECS系统事件,来实现故障处理、动态调度等自动化运维,可通过云助手插件ecs-tool-event实现。

说明
  • ECS系统事件是由阿里云定义的,用于记录和通知云资源的信息,例如运维任务执行情况、资源是否出现异常、资源状态变化等。系统事件类型和详细说明,请参见ECS系统事件概述

  • 云助手插件是集成在云助手里的插件能力,使用简单的命令就能够完成复杂的配置操作,提升运维管理效率。更多信息,请参见云助手概述使用云助手插件

方案原理

监控和响应ECS系统事件可通过控制台或对接OpenAPI两种方式。然而,这两种方式都存在一定的局限:

  • 通过控制台监控或响应系统事件:要手动干预,且对于多实例场景容易出现事件遗漏,无法做到自动化的响应。

  • 通过对接ECS OpenAPI监控或响应系统事件:需要自行开发程序,有一定的开发成本和技术要求。

为了解决上述问题,阿里云提供了云助手插件ecs-tool-event,该插件会每分钟定时请求metaserver获取ECS系统事件,并将ECS系统事件转化为日志格式存储在操作系统内部。用户无需进行额外的程序开发,直接在操作系统内部采集系统事件日志来实现监控和响应ECS系统事件。例如,具备K8s自动化运维能力的用户,可以通过采集host_event.log的流式日志来适配自身运维系统。

方案实践

重要
  • 请确保您的实例已安装云助手Agent如何安装云助手Agent?

  • 启动、停止云助手插件或查看云助手插件状态需要使用root权限。

  1. 登录ECS实例,启用云助手插件ecs-tool-event

    启用后,该插件会每分钟定时请求metaserver获取ECS系统事件,并将ECS系统事件转化为日志格式存储在操作系统内部。

    sudo acs-plugin-manager --exec --plugin=ecs-tool-event --params --start
    说明

    启动后,可通过ls /var/log查看自动生成的host_event.log文件。

    • 日志保存地址:/var/log/host_event.log

    • 日志格式

      %Y-%m-%d %H:%M:%S - WARNING - Ecs event type is: ${事件类型},event status is: ${事件状态}, action ISO 8601 time is ${实际执行ISO 8601时间}

      示例:

      2024-01-08 17:02:01 - WARNING - Ecs event type is: InstanceFailure.Reboot,event status is: Executed,action ISO 8601 time is 2023-12-27T11:49:28Z

  2. 查询插件状态。

    sudo acs-plugin-manager --status
  3. 结合自身业务场景,采集host_event.log的流式日志来适配自身运维系统。

    应用示例:Kubernetes集群场景自动化响应ECS系统事件

  4. (可选)如果您不再需要主动响应ECS系统事件,可停止云助手插件ecs-tool-event

    sudo acs-plugin-manager --remove --plugin ecs-tool-event

应用示例:Kubernetes集群场景自动化响应ECS系统事件

场景介绍

当ECS被用作Kubernetes集群的Node节点时,若单个节点出现异常(例如重启、内存耗尽、操作系统错误等),可能会影响线上业务的稳定性,主动监测和响应节点异常事件至关重要。您可以通过云助手插件ecs-tool-event,将ECS系统事件转化为操作系统日志,并结合K8s社区开源组件NPD(Node Problem Detector)、Draino和Autoscaler,实现免程序开发、便捷高效地监测和响应ECS系统事件,从而提升集群的稳定性和可靠性。

什么是NPD、Draino和Autoscaler?

  • NPD:Kubernetes社区的开源组件,可用于监控节点的健康状态、检测节点的故障,比如硬件故障、网络问题等。更多信息,请参见NPD官方文档

  • Draino:Kubernetes中的一个控制器,可监视集群中的节点,并将异常节点上的Pod迁移到其他节点。更多信息,请参见Draino官方文档

  • Autoscaler:Kubernetes社区的开源组件,可动态调整Kubernetes集群大小,监控集群中的Pods,以确保所有Pods都有足够的资源运行,同时保证没有闲置的无效节点。更多信息,请参见Autoscaler官方文档

方案架构

方案的实现原理和技术架构如下所示:

  1. 云助手插件ecs-tool-event每分钟定时请求metaserver获取ECS系统事件,转化为系统日志存储到操作系统内部(存储路径为/var/log/host_event.log)。

  2. 集群组件NPD采集到系统事件日志后,将问题上报给APIServer。

  3. 集群控制器Draino从APIServer接收K8s事件(ECS系统事件),将异常节点上的Pod迁移到其他正常节点。

  4. 完成容器驱逐后,您可以结合业务场景使用已有的集群下线方案完成异常节点下线,或者可以选择使用Kubernetes社区的开源组件Autoscaler自动释放异常节点并创建新实例加入到集群中。

image

方案实践

步骤一:为节点启动ecs-tool-event插件

登录节点内部(即ECS实例),启动ecs-tool-event插件。

重要

实际应用场景中,需要给集群的每个节点都启动该插件。您可以通过云助手批量为多个实例执行如下启动命令。具体操作,请参见创建并执行命令

sudo acs-plugin-manager --exec --plugin=ecs-tool-event --params --start

启动后,ecs-tool-event插件会自动把ECS系统事件以日志形式输出并保存到操作系统内部。

步骤二:为集群配置NPD和Draino

  1. 登录集群中的任一节点。

  2. 为集群配置NPD组件(该配置作用于整个集群)。

    1. 配置NPD文件,需要用到如下3个文件。

      说明

      详细配置说明,可参见官方文档

      • node-problem-detector-config.yaml:定义NPD需要监控的指标,例如系统日志。

      • node-problem-detector.yaml:定义了NPD的在集群中的运行方式。

      • rbac.yaml:定义NPD在Kubernetes集群中所需的权限。

        实例未配置NPD

        在ECS实例添加上述3个YAML文件。

        node-problem-detector-config.yaml

        apiVersion: v1
        data:
          kernel-monitor.json: |
            {
                "plugin": "kmsg",
                "logPath": "/dev/kmsg",
                "lookback": "5m",
                "bufferSize": 10,
                "source": "kernel-monitor",
                "conditions": [
                    {
                        "type": "KernelDeadlock",
                        "reason": "KernelHasNoDeadlock",
                        "message": "kernel has no deadlock"
                    },
                    {
                        "type": "ReadonlyFilesystem",
                        "reason": "FilesystemIsNotReadOnly",
                        "message": "Filesystem is not read-only"
                    }
                ],
                "rules": [
                    {
                        "type": "temporary",
                        "reason": "OOMKilling",
                        "pattern": "Kill process \\d+ (.+) score \\d+ or sacrifice child\\nKilled process \\d+ (.+) total-vm:\\d+kB, anon-rss:\\d+kB, file-rss:\\d+kB.*"
                    },
                    {
                        "type": "temporary",
                        "reason": "TaskHung",
                        "pattern": "task \\S+:\\w+ blocked for more than \\w+ seconds\\."
                    },
                    {
                        "type": "temporary",
                        "reason": "UnregisterNetDevice",
                        "pattern": "unregister_netdevice: waiting for \\w+ to become free. Usage count = \\d+"
                    },
                    {
                        "type": "temporary",
                        "reason": "KernelOops",
                        "pattern": "BUG: unable to handle kernel NULL pointer dereference at .*"
                    },
                    {
                        "type": "temporary",
                        "reason": "KernelOops",
                        "pattern": "divide error: 0000 \\[#\\d+\\] SMP"
                    },
                    {
                                "type": "temporary",
                                "reason": "MemoryReadError",
                                "pattern": "CE memory read error .*"
                    },
                    {
                        "type": "permanent",
                        "condition": "KernelDeadlock",
                        "reason": "DockerHung",
                        "pattern": "task docker:\\w+ blocked for more than \\w+ seconds\\."
                    },
                    {
                        "type": "permanent",
                        "condition": "ReadonlyFilesystem",
                        "reason": "FilesystemIsReadOnly",
                        "pattern": "Remounting filesystem read-only"
                    }
                ]
            }
          host_event.json: |
            {
                "plugin": "filelog",                     
                "pluginConfig": {
                    "timestamp": "^.{19}",
                    "message": "Ecs event type is: .*",
                    "timestampFormat": "2006-01-02 15:04:05"
                },
                "logPath": "/var/log/host_event.log",   
                "lookback": "5m",
                "bufferSize": 10,
                "source": "host-event",                     
                "conditions": [
                    {
                        "type": "HostEventRebootAfter48",       
                        "reason": "HostEventWillRebootAfter48",
                        "message": "The Host Is Running In Good Condition"
                    }
                ],
                "rules": [
                    {
                        "type": "temporary",
                        "reason": "HostEventRebootAfter48temporary",
                        "pattern": "Ecs event type is: SystemMaintenance.Reboot,event status is: Scheduled.*|Ecs event type is: SystemMaintenance.Reboot,event status is: Inquiring.*"
                    },
                    {
                        "type": "permanent",
                        "condition": "HostEventRebootAfter48", 
                        "reason": "HostEventRebootAfter48Permanent",
                        "pattern": "Ecs event type is: SystemMaintenance.Reboot,event status is: Scheduled.*|Ecs event type is: SystemMaintenance.Reboot,event status is: Inquiring.*"
                    }
                ]
            }
        
          docker-monitor.json: |
            {
                "plugin": "journald",
                "pluginConfig": {
                    "source": "dockerd"
                },
                "logPath": "/var/log/journal",
                "lookback": "5m",
                "bufferSize": 10,
                "source": "docker-monitor",
                "conditions": [],
                "rules": [
                    {
                        "type": "temporary",
                        "reason": "CorruptDockerImage",
                        "pattern": "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*"
                    }
                ]
            }
        kind: ConfigMap
        metadata:
          name: node-problem-detector-config
          namespace: kube-system

        node-problem-detector.yaml

        apiVersion: apps/v1
        kind: DaemonSet
        metadata:
          name: node-problem-detector
          namespace: kube-system
          labels:
            app: node-problem-detector
        spec:
          selector:
            matchLabels:
              app: node-problem-detector
          template:
            metadata:
              labels:
                app: node-problem-detector
            spec:
              affinity:
                nodeAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                    nodeSelectorTerms:
                      - matchExpressions:
                          - key: kubernetes.io/os
                            operator: In
                            values:
                              - linux
              containers:
              - name: node-problem-detector
                command:
                - /node-problem-detector
                - --logtostderr
                - --config.system-log-monitor=/config/kernel-monitor.json,/config/docker-monitor.json,/config/host_event.json
                image: cncamp/node-problem-detector:v0.8.10
                resources:
                  limits:
                    cpu: 10m
                    memory: 80Mi
                  requests:
                    cpu: 10m
                    memory: 80Mi
                imagePullPolicy: Always
                securityContext:
                  privileged: true
                env:
                - name: NODE_NAME
                  valueFrom:
                    fieldRef:
                      fieldPath: spec.nodeName
                volumeMounts:
                - name: log
                  mountPath: /var/log
                  readOnly: true
                - name: kmsg
                  mountPath: /dev/kmsg
                  readOnly: true
                # Make sure node problem detector is in the same timezone
                # with the host.
                - name: localtime
                  mountPath: /etc/localtime
                  readOnly: true
                - name: config
                  mountPath: /config
                  readOnly: true
              serviceAccountName: node-problem-detector
              volumes:
              - name: log
                # Config `log` to your system log directory
                hostPath:
                  path: /var/log/
              - name: kmsg
                hostPath:
                  path: /dev/kmsg
              - name: localtime
                hostPath:
                  path: /etc/localtime
              - name: config
                configMap:
                  name: node-problem-detector-config
                  items:
                  - key: kernel-monitor.json
                    path: kernel-monitor.json
                  - key: docker-monitor.json
                    path: docker-monitor.json
                  - key: host_event.json
                    path: host_event.json
              tolerations:
                - effect: NoSchedule
                  operator: Exists
                - effect: NoExecute
                  operator: Exists

        rbac.yaml

        apiVersion: v1
        kind: ServiceAccount
        metadata:
          name: node-problem-detector
          namespace: kube-system
        
        ---
        apiVersion: rbac.authorization.k8s.io/v1
        kind: ClusterRoleBinding
        metadata:
          name: npd-binding
        roleRef:
          apiGroup: rbac.authorization.k8s.io
          kind: ClusterRole
          name: system:node-problem-detector
        subjects:
          - kind: ServiceAccount
            name: node-problem-detector
            namespace: kube-system

        实例已配置NPD

        • node-problem-detector-config.yaml文件中,添加host_event.json日志监控。如下所示:

          ...
          
          host_event.json: |
              {
                  "plugin": "filelog",   #指定使用的日志采集插件,固定为filelog       
                  "pluginConfig": {
                      "timestamp": "^.{19}",
                      "message": "Ecs event type is: .*",
                      "timestampFormat": "2006-01-02 15:04:05"
                  },
                  "logPath": "/var/log/host_event.log",    #系统事件日志路径,固定为/var/log/host_event.log
                  "lookback": "5m",
                  "bufferSize": 10,
                  "source": "host-event",                     
                  "conditions": [
                      {
                          "type": "HostEventRebootAfter48",    #自定义事件名称,Draino配置中会用到
                          "reason": "HostEventWillRebootAfter48",
                          "message": "The Host Is Running In Good Condition"
                      }
                  ],
                  "rules": [
                      {
                          "type": "temporary",
                          "reason": "HostEventRebootAfter48temporary",
                          "pattern": "Ecs event type is: SystemMaintenance.Reboot,event status is: Scheduled.*|Ecs event type is: SystemMaintenance.Reboot,event status is: Inquiring.*"
                      },
                      {
                          "type": "permanent",
                          "condition": "HostEventRebootAfter48", 
                          "reason": "HostEventRebootAfter48Permanent",
                          "pattern": "Ecs event type is: SystemMaintenance.Reboot,event status is: Scheduled.*|Ecs event type is: SystemMaintenance.Reboot,event status is: Inquiring.*"
                      }
                  ]
              }
          
          ...
        • node-problem-detector.yaml文件中

          • - --config.system-log-monitor行中添加/config/host_event.json,告诉NPD监控系统事件日志。如下所示:

            containers:
                  - name: node-problem-detector
                    command:
                     ...
                    - --config.system-log-monitor=/config/kernel-monitor.json,/config/docker-monitor.json,/config/host_event.json
            
          • - name: configitems:行下,按照如下注释添加相关行。

            ...
            - name: config
                    configMap:
                      name: node-problem-detector-config
                      items:
                      - key: kernel-monitor.json
                        path: kernel-monitor.json
                      - key: docker-monitor.json
                        path: docker-monitor.json
                      - key: host_event.json     #待添加的行
                        path: host_event.json    #待添加的行
            ...
    2. 执行以下命令,使文件生效。

      sudo kubectl create -f rbac.yaml
      sudo kubectl create -f node-problem-detector-config.yaml
      sudo kubectl create -f node-problem-detector.yaml
    3. 执行如下命令,查看NPD配置是否生效。

      sudo kubectl describe nodes -n kube-system

      如以下回显所示,condition已经新增HostEventRebootAfter48,表示NPD配置已完成并生效(若未出现,可稍等3~5分钟)。

      image.png

  3. 为集群配置控制器Draino(该配置作用于整个集群)。

    1. 根据实际情况,配置或修改Draino配置。

    2. 实例未配置过Draino:安装Draino

      在实例内部添加如下YAML文件。

      draino.yaml

      ---
      apiVersion: v1
      kind: ServiceAccount
      metadata:
        labels: {component: draino}
        name: draino
        namespace: kube-system
      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRole
      metadata:
        labels: {component: draino}
        name: draino
      rules:
      - apiGroups: ['']
        resources: [events]
        verbs: [create, patch, update]
      - apiGroups: ['']
        resources: [nodes]
        verbs: [get, watch, list, update]
      - apiGroups: ['']
        resources: [nodes/status]
        verbs: [patch]
      - apiGroups: ['']
        resources: [pods]
        verbs: [get, watch, list]
      - apiGroups: ['']
        resources: [pods/eviction]
        verbs: [create]
      - apiGroups: [extensions]
        resources: [daemonsets]
        verbs: [get, watch, list]
      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRoleBinding
      metadata:
        labels: {component: draino}
        name: draino
      roleRef: {apiGroup: rbac.authorization.k8s.io, kind: ClusterRole, name: draino}
      subjects:
      - {kind: ServiceAccount, name: draino, namespace: kube-system}
      ---
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        labels: {component: draino}
        name: draino
        namespace: kube-system
      spec:
        # Draino does not currently support locking/master election, so you should
        # only run one draino at a time. Draino won't start draining nodes immediately
        # so it's usually safe for multiple drainos to exist for a brief period of
        # time.
        replicas: 1
        selector:
          matchLabels: {component: draino}
        template:
          metadata:
            labels: {component: draino}
            name: draino
            namespace: kube-system
          spec:
            containers:
            - name: draino
              image: planetlabs/draino:dbadb44
              # You'll want to change these labels and conditions to suit your deployment.
              command:
              - /draino
              - --debug
              - --evict-daemonset-pods
              - --evict-emptydir-pods
              - --evict-unreplicated-pods
              - KernelDeadlock
              - OutOfDisk
              - HostEventRebootAfter48
              # - ReadonlyFilesystem
              # - MemoryPressure
              # - DiskPressure
              # - PIDPressure
              livenessProbe:
                httpGet: {path: /healthz, port: 10002}
                initialDelaySeconds: 30
            serviceAccountName: draino

      实例已配置Draino:修改Draino配置

      打开Draino配置文件,找到containers:行,添加步骤2在node-problem-detector-config.yaml文件中定义的事件名称(例如HostEventRebootAfter48),如下所示:

      containers:
            - name: draino
              image: planetlabs/draino:dbadb44
              # You'll want to change these labels and conditions to suit your deployment.
              command:
              - /draino
              - --debug
              ......
              - KernelDeadlock
              - OutOfDisk
              - HostEventRebootAfter48  # 添加的行    
    3. 执行如下命令,使Draino配置生效。

    4. sudo kubectl create -f draino.yaml

步骤三:下线异常节点并增加新节点

完成容器驱逐后,您可以结合业务场景用已有的集群下线方案完成异常节点下线,或者可以选择使用社区开源的Autoscaler自动释放异常节点并创建新实例加入到集群节点。如果需要使用Autoscaler,请参见Autoscaler官方文档

结果验证

  1. 登录任意节点,执行以下命令,模拟生成一条ECS系统事件日志。

    重要

    时间需替换为系统当前最新时间。

    sudo echo  '2024-02-23 12:29:29 - WARNING - Ecs event type is: InstanceFailure.Reboot,event status is: Executed,action ISO 8601 time is 2023-12-27T11:49:28Z'  > /var/log/host_event.log
  2. 执行如下命令,可看到插件会根据检测到事件自动生成k8s事件,并将该节点置为不可调度。

    sudo kubectl describe nodes -n kube-system

    image