Container Service alert management

更新时间:
复制 MD 格式

ACK alert management lets you centrally manage container alerts, including those for anomalous events, key metrics of cluster resources, and core components and applications. You can also modify the default alert rules in your cluster using a CustomResourceDefinition (CRD) to promptly detect anomalies.

Billing

The alerting function pulls data from Log Service, Managed Service for Prometheus, and Cloud Monitor. After an alert is triggered, SMS or phone call notifications incur additional charges. Before enabling the alerting function, review the default alert rule template to identify the source of each alert item and enable the required services.

Data source

Requirements

Details

Log Service

Requires event monitoring, which is enabled by default when you enable the alerting function.

Pay-by-feature

Managed Service for Prometheus

Follow the instructions in Access and configure Alibaba Cloud Prometheus monitoring for your cluster.

Billing details

Cloud Monitor

Enable Cloud Monitor for your Container Service for Kubernetes cluster.

pay-as-you-go

Enable alert management

After you enable alert management, you can set metric-based alerts for specific resources in your cluster and automatically receive alert notifications when exceptions occur. This helps you efficiently manage and maintain your cluster and ensure service stability. For more information about alerts for related resources, see Default alert rule templates.

ACK managed cluster

Enable alert configuration for an existing cluster or when you create a new one.

Existing cluster

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Operations > Alerts.

  3. On the Alerts page, follow the on-screen instructions to install or upgrade the component.

  4. After the installation or upgrade is complete, go to the Alerts page to configure alerts.

    Tab

    Description

    Alert Rules

    • Status: Enable or disable a target alert rule set.

    • Edit notification object: Set the contact group for alert notifications.

    Before you proceed, you must create contacts and contact groups, and add the contacts to the groups. Notification objects only support contact groups. To send notifications to an individual, create a group that contains only that contact and select the group.

    Alert History

    You can view the 100 most recent alert records sent within the last 24 hours.

    • Click a link in the Alert Rule column to go to the corresponding monitoring system and view the detailed rule configuration.

    • Click Details to navigate to the page of the affected resource, such as a resource with an abnormal event or metric anomalies.

    • Click Intelligent Analytics to use the AI assistant to analyze the issue and receive resolution guidance.

    Alert Contacts

    Create, edit, or delete contacts.

    Contact methods:

    • Phone/SMS: After you set a mobile number for a contact, they can receive alert notifications by phone call or text message.

      Only verified mobile numbers can receive phone call notifications. To verify a mobile number, see Verify a mobile number.
    • Email: After you set an email address for a contact, they can receive alert notifications by email.

    • Chatbot: DingTalk chatbot, WeCom chatbot, and Lark chatbot .

      For a DingTalk chatbot, you must add the following security keywords: Alert, Assign.
    Before you configure email and chatbot settings, you can verify them in the Cloud Monitor console. In the console, choose Alerts > Alert Contacts to ensure that you can receive alert notifications.

    Alert Contact Groups

    Create, edit, or delete contact groups. You can only select contact groups when you configure the Edit notification object setting.

    If no contact group exists, the console automatically creates a default contact group based on your Alibaba Cloud account information.

New cluster

On the Component Configurations page of the cluster creation wizard, next to Alerts, select Use Default Alert Rule Template, and then select an Alert notification contact group. For more information, see Create an ACK managed cluster.

If you enable alert configuration when you create a cluster, the system enables the default alert rules and sends alert notifications to the default contact group. You can also modify the alert contacts or contact groups.

ACK dedicated cluster

For an ACK dedicated cluster, you must first grant permissions to the Worker RAM role and then enable the default alert rules.

Authorize the Worker RAM role

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Cluster Information.

  3. On the Cluster Information page, in the Cluster Resources section, copy the name next to Worker RAM Role. Then, click the link to go to the Resource Access Management (RAM) console to grant permissions to the Worker RAM role.

    1. Create a custom policy with the following content. For more information, see Create a custom policy by using the script editor.

      {
                  "Action": [
                      "log:*",
                      "arms:*",
                      "cms:*",
                      "cs:UpdateContactGroup"
                  ],
                  "Resource": [
                      "*"
                  ],
                  "Effect": "Allow"
      }
    2. On the Role page, find the Worker RAM role and attach the custom policy that you created. For more information, see Manage permissions for a RAM role.

      Note: To simplify the procedure, this topic grants broad permissions. In a production environment, we recommend that you follow the principle of least privilege and grant only the required permissions.
  4. Check the logs to verify that the required permissions are configured.

    1. In the left-side navigation pane of the cluster details page, choose Workloads > Deployments.

    2. Set the Namespace to kube-system, and then click the Name of alicloud-monitor-controller in the list.

    3. Click the Logs tab. The pod logs indicate successful authorization.

Enable default alert rules

  1. In the left-side navigation pane of the target cluster page, choose Operations > Alerts.

  2. On the Alerts page, configure the alert settings.

    Tab

    Description

    Alert Rules

    • Status: Enable or disable a target alert rule set.

    • Edit notification object: Set the contact group for alert notifications.

    Before you proceed, you must create contacts and contact groups, and add the contacts to the groups. Notification objects only support contact groups. To send notifications to an individual, create a group that contains only that contact and select the group.

    Alert History

    You can view the 100 most recent alert records sent within the last 24 hours.

    • Click a link in the Alert Rule column to go to the corresponding monitoring system and view the detailed rule configuration.

    • Click Details to navigate to the page of the affected resource, such as a resource with an abnormal event or metric anomalies.

    • Click Intelligent Analytics to use the AI assistant to analyze the issue and receive resolution guidance.

    Alert Contacts

    Create, edit, or delete contacts.

    Contact methods:

    • Phone/SMS: After you set a mobile number for a contact, they can receive alert notifications by phone call or text message.

      Only verified mobile numbers can receive phone call notifications. To verify a mobile number, see Verify a mobile number.
    • Email: After you set an email address for a contact, they can receive alert notifications by email.

    • Chatbot: DingTalk chatbot, WeCom chatbot, and Lark chatbot .

      For a DingTalk chatbot, you must add the following security keywords: Alert, Assign.
    Before you configure email and chatbot settings, you can verify them in the Cloud Monitor console. In the console, choose Alerts > Alert Contacts to ensure that you can receive alert notifications.

    Alert Contact Groups

    Create, edit, or delete contact groups. You can only select contact groups when you configure the Edit notification object setting.

    If no contact group exists, the console automatically creates a default contact group based on your Alibaba Cloud account information.

Configure alert rules

After you enable the alerting feature, the system automatically creates an AckAlertRule resource named default in the kube-system namespace. This resource contains the default alert rule template. You can modify this resource to customize the alert rules for Container Service for Kubernetes (ACK).

Console

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Operations > Alerts.

  3. On the Alert Rules tab, click Configure Alert Rule in the upper-right corner. Then, for the target rule, click YAML in the Actions column to view the AckAlertRule resource configuration for the current cluster.

  4. Modify the YAML file as needed. For parameter details, see the default alert rule template.

    The following YAML shows an example alert rule configuration.

    Alert Rule Configuration YAML

    apiVersion: alert.alibabacloud.com/v1beta1
    kind: AckAlertRule
    metadata:
      name: default
    spec:
      groups:
        # The following is an example of a cluster event alert rule configuration.
        - name: pod-exceptions                             # The name of the alert rule group. This corresponds to the Group_Name field in the alert template.
          rules:
            - name: pod-oom                                # The name of the alert rule.
              type: event                                  # The type of the alert rule (Rule_Type). Valid values: event (event), metric-cms (CloudMonitor metric), and metric-prometheus (Prometheus metric).
              expression: sls.app.ack.pod.oom              # The expression for the alert rule. If the rule type is event, the value of this parameter is the Rule_Expression_Id from the default alert rule template.
              enable: enable                               # The status of the alert rule. Valid values: enable, disable.
            - name: pod-failed
              type: event
              expression: sls.app.ack.pod.failed
              enable: enable
        # The following is an example of a basic cluster resource alert rule configuration.
        - name: res-exceptions                              # The name of the alert rule group. This corresponds to the Group_Name field in the alert template.
          rules:
            - name: node_cpu_util_high                      # The name of the alert rule.
              type: metric-cms                              # The type of the alert rule (Rule_Type). Valid values: event (event), metric-cms (CloudMonitor metric), and metric-prometheus (Prometheus metric).
              expression: cms.host.cpu.utilization          # The expression for the alert rule. If the rule type is metric-cms, the value of this parameter is the Rule_Expression_Id from the default alert rule template.
              contactGroups:                                # The contact group configuration for the alert rule. The ACK console generates this configuration. It is the same for all contacts under an account and can be reused across multiple clusters.
                - arms_contact_group_id_v2: '69xxx'
                  cms_contact_group_name: xxx Contact Group
                  id: '10xxx'
              enable: enable                                # The status of the alert rule. Valid values: enable, disable.
              thresholds:                                   # The thresholds for the alert rule.          
                - key: CMS_ESCALATIONS_CRITICAL_Threshold
                  unit: percent
                  value: '85'                                # CPU utilization threshold: 85% (default)    
                - key: CMS_ESCALATIONS_CRITICAL_Times
                  value: '3'                                # Triggers an alert if the metric exceeds the threshold 3 consecutive times.
                - key: CMS_RULE_SILENCE_SEC                 # Silence period in seconds after the first alert is sent. 
                  value: '900'    

    You can use rules.thresholds (see the table below for parameters) to customize alert thresholds. For example, in the preceding sample configuration, an alert notification is triggered when the cluster node CPU utilization exceeds 85%, the threshold is met three consecutive times, and more than 900 seconds have passed since the last alert.

    Parameter

    Required

    Description

    Default

    CMS_ESCALATIONS_CRITICAL_Threshold

    Yes

    The threshold for the alert rule. If you do not configure this parameter, the rule fails to synchronize and becomes disabled.

    • unit: The unit of the threshold. Valid values include percent, count, and qps.

    • value: The threshold value.

    Depends on the default alert rule template.

    CMS_ESCALATIONS_CRITICAL_Times

    No

    The number of consecutive times the threshold must be exceeded to trigger an alert. If you do not configure this parameter, the system uses the default value.

    3

    CMS_RULE_SILENCE_SEC

    No

    The silence period in seconds after an alert is first sent. This prevents frequent notifications for a persistent issue. If you do not configure this parameter, the system uses the default value.

    900

CLI

  1. Run the following command to edit the alert rule YAML file.

    kubectl edit ackalertrules default -n kube-system
  2. Modify the YAML file as needed. For parameter details, see the default alert rule template. After you finish editing, save the file and exit.

    The following YAML shows an example alert rule configuration.

    apiVersion: alert.alibabacloud.com/v1beta1
    kind: AckAlertRule
    metadata:
      name: default
    spec:
      groups:
        # The following is an example of a cluster event alert rule configuration.
        - name: pod-exceptions                             # The name of the alert rule group. This corresponds to the Group_Name field in the alert template.
          rules:
            - name: pod-oom                                # The name of the alert rule.
              type: event                                  # The type of the alert rule (Rule_Type). Valid values: event (event), metric-cms (CloudMonitor metric), and metric-prometheus (Prometheus metric).
              expression: sls.app.ack.pod.oom              # The expression for the alert rule. If the rule type is event, the value of this parameter is the Rule_Expression_Id from the default alert rule template.
              enable: enable                               # The status of the alert rule. Valid values: enable, disable.
            - name: pod-failed
              type: event
              expression: sls.app.ack.pod.failed
              enable: enable
        # The following is an example of a basic cluster resource alert rule configuration.
        - name: res-exceptions                              # The name of the alert rule group. This corresponds to the Group_Name field in the alert template.
          rules:
            - name: node_cpu_util_high                      # The name of the alert rule.
              type: metric-cms                              # The type of the alert rule (Rule_Type). Valid values: event (event), metric-cms (CloudMonitor metric), and metric-prometheus (Prometheus metric).
              expression: cms.host.cpu.utilization          # The expression for the alert rule. If the rule type is metric-cms, the value of this parameter is the Rule_Expression_Id from the default alert rule template.
              contactGroups:                                # The contact group configuration for the alert rule. The ACK console generates this configuration. It is the same for all contacts under an account and can be reused across multiple clusters.
                - arms_contact_group_id_v2: 'xxx'
                  cms_contact_group_name: xxx
                  id: 'xxx'
              enable: enable                                # The status of the alert rule. Valid values: enable, disable.
              thresholds:                                   # The thresholds for the alert rule.          
                - key: CMS_ESCALATIONS_CRITICAL_Threshold
                  unit: percent
                  value: '85'                                # CPU utilization threshold: 85% (default)    
                - key: CMS_ESCALATIONS_CRITICAL_Times
                  value: '3'                                # Triggers an alert if the metric exceeds the threshold 3 consecutive times.
                - key: CMS_RULE_SILENCE_SEC                 # Silence period in seconds after the first alert is sent. 
                  value: '900'    

    You can use rules.thresholds to flexibly customize alert thresholds. For example, with the sample configuration above, an alert notification is triggered when the CPU utilization of a cluster node exceeds 85%, the threshold is reached three consecutive times, and more than 900 seconds have passed since the last alert.

    Parameter

    Required

    Description

    Default

    CMS_ESCALATIONS_CRITICAL_Threshold

    Yes

    The threshold for the alert rule. If you do not configure this parameter, the rule fails to synchronize and becomes disabled.

    • unit: The unit of the threshold. Valid values include percent, count, and qps.

    • value: The threshold value.

    Depends on the default alert rule template.

    CMS_ESCALATIONS_CRITICAL_Times

    No

    The number of consecutive times the threshold must be exceeded to trigger an alert. If you do not configure this parameter, the system uses the default value.

    3

    CMS_RULE_SILENCE_SEC

    No

    The silence period in seconds after an alert is first sent. This prevents frequent notifications for a persistent issue. If you do not configure this parameter, the system uses the default value.

    900

Default alert rules

Alerts are synchronized from Log Service, Alibaba Cloud Prometheus, and Cloud Monitor. On the Alerts page, under the Alerts column for each alert, you can view its rule configuration in Advanced Settings.

Error events

Alert

Description

Source

Rule type

CRD rule name

Event ID

error event

This alert triggers for all error events in the cluster.

SLS

event

error-event

sls.app.ack.error

Warning events

Alert

Description

Source

Rule type

CRD rule name

Event ID

warning event

Triggers on critical warning events in the cluster, excluding certain ignorable ones.

SLS

event

warn-event

sls.app.ack.warn

Alert rules for core component anomalies (ACK managed clusters)

Alert

Description

Source

Rule type

ACK CR rule name

SLS event ID

API server availability anomaly

This alert triggers when the API server has availability issues, which can disrupt cluster management.

Managed Service for Prometheus

metric-prometheus

apiserver-unhealthy

prom.apiserver.notHealthy.down

etcd availability anomaly

This alert triggers when etcd is unavailable, affecting the entire cluster's state.

Managed Service for Prometheus

metric-prometheus

etcd-unhealthy

prom.etcd.notHealthy.down

kube-scheduler availability anomaly

The kube-scheduler schedules Pods. If unavailable, it may prevent new Pods from starting.

Managed Service for Prometheus

metric-prometheus

scheduler-unhealthy

prom.scheduler.notHealthy.down

kube-controller-manager (KCM) availability anomaly

The KCM manages control loops. An anomaly can disrupt the cluster's self-healing and resource adjustment.

Managed Service for Prometheus

metric-prometheus

kcm-unhealthy

prom.kcm.notHealthy.down

cloud-controller-manager availability anomaly

This component integrates with external cloud services. An anomaly can disrupt dynamic resource adjustment.

Managed Service for Prometheus

metric-prometheus

ccm-unhealthy

prom.ccm.notHealthy.down

CoreDNS request rate drops to zero

CoreDNS provides cluster DNS. An anomaly will disrupt service discovery and domain name resolution.

Managed Service for Prometheus

metric-prometheus

coredns-unhealthy-requestdown

prom.coredns.notHealthy.requestdown

CoreDNS panic

This alert triggers when CoreDNS panics. Analyze the logs immediately to find the cause.

Managed Service for Prometheus

metric-prometheus

coredns-unhealthy-panic

prom.coredns.notHealthy.panic

High Ingress request error rate

A high HTTP request error rate from the Ingress controller can degrade service availability.

Managed Service for Prometheus

metric-prometheus

ingress-err-request

prom.ingress.request.errorRateHigh

Ingress controller SSL certificate expiring soon

An expiring SSL certificate will cause HTTPS requests to fail. Renew the certificate early to prevent service disruption.

Managed Service for Prometheus

metric-prometheus

ingress-ssl-expire

prom.ingress.ssl.expire

Number of pending Pods > 1,000

Too many Pods in the Pending state for a prolonged time may indicate resource shortages or misconfigured scheduling policies.

Managed Service for Prometheus

metric-prometheus

pod-pending-accumulate

prom.pod.pending.accumulate

High API server mutating admission webhook response time (RT)

A high response time (RT) from a mutating admission webhook can slow down resource creation and modification.

Managed Service for Prometheus

metric-prometheus

apiserver-admit-rt-high

prom.apiserver.mutating.webhook.rt.high

High API server validating admission webhook response time (RT)

A high RT from a validating admission webhook can delay configuration changes.

Managed Service for Prometheus

metric-prometheus

apiserver-validate-rt-high

prom.apiserver.validation.webhook.rt.high

Control plane component OOM

A control plane core component ran out of memory (OOM). Investigate the root cause to prevent a service outage.

Log Service

event

ack-controlplane-oom

sls.app.ack.controlplane.pod.oom

Alert rules for node pool O&M events

Alert

Description

Source

Rule type

ACK CR rule name

SLS event ID

Node auto-healing failure

This alert triggers when the node auto-healing process fails. You should investigate the cause immediately and resolve the issue to ensure high availability.

SLS

event

node-repair_failed

sls.app.ack.rc.node_repair_failed

Node pool CVE fix failure

This alert triggers when a CVE fix for the node pool fails. This may compromise your cluster's security. You should assess and resolve the issue immediately.

SLS

event

nodepool-cve-fix-failed

sls.app.ack.rc.node_vulnerability_fix_failed

Node pool CVE fix success

This alert confirms that a CVE fix was successfully applied to the node pool, mitigating the associated security risk.

SLS

event

nodepool-cve-fix-succ

sls.app.ack.rc.node_vulnerability_fix_succeed

Node pool CVE auto-fix skipped

This alert indicates that the automatic fix for a CVE was skipped, possibly due to compatibility issues or specific configurations. You should review your security policy to determine if manual action is needed.

SLS

event

nodepool-cve-fix-skip

sls.app.ack.rc.node_vulnerability_fix_skipped

Node pool kubelet configuration failure

This alert triggers when the kubelet configuration for the node pool fails to update. This may affect node performance and resource scheduling.

SLS

event

nodepool-kubelet-cfg-failed

sls.app.ack.rc.node_kubelet_config_failed

Node pool kubelet configuration success

The new kubelet configuration was successfully applied to the node pool. Verify that the configuration has taken effect and meets your expectations.

SLS

event

nodepool-kubelet-config-succ

sls.app.ack.rc.node_kubelet_config_succeed

Node pool kubelet upgrade failure

A failed kubelet upgrade in the node pool can affect cluster stability and functionality. You should review the upgrade process and configuration.

SLS

event

nodepool-k-c-upgrade-failed

sls.app.ack.rc.node_kubelet_config_upgrade_failed

Node pool kubelet upgrade success

The kubelet in the node pool was upgraded successfully. Ensure the new version meets your cluster and application requirements.

SLS

event

nodepool-k-c-upgrade-succ

sls.app.ack.rc.kubelet_upgrade_succeed

Node pool runtime upgrade success

The container runtime in the node pool was upgraded successfully.

SLS

event

nodepool-runtime-upgrade-succ

sls.app.ack.rc.runtime_upgrade_succeed

Node pool runtime upgrade failure

The container runtime in the node pool failed to upgrade.

SLS

event

nodepool-runtime-upgrade-fail

sls.app.ack.rc.runtime_upgrade_failed

Node pool OS image upgrade success

The OS image in the node pool was upgraded successfully.

SLS

event

nodepool-os-upgrade-succ

sls.app.ack.rc.os_image_upgrade_succeed

Node pool OS image upgrade failure

The OS image in the node pool failed to upgrade.

SLS

event

nodepool-os-upgrade-failed

sls.app.ack.rc.os_image_upgrade_failed

Lingjun node pool configuration change success

The configuration change for the Lingjun node pool completed successfully.

SLS

event

nodepool-lingjun-config-succ

sls.app.ack.rc.lingjun_configuration_apply_succeed

Lingjun node pool configuration change failure

The configuration change for the Lingjun node pool failed.

SLS

event

nodepool-lingjun-cfg-failed

sls.app.ack.rc.lingjun_configuration_apply_failed

Node instance system exception

This alert triggers when a system exception, such as an ECS or Lingjun system event, occurs on the underlying cloud resource instance of a Container Service node. Investigate immediately to mitigate business impact and initiate auto-healing or replacement procedures as needed.

SLS

event

node-system-exception

sls.app.ack.nlc.out_of_band_event_observer

Node anomaly alerts

Alert

Description

Source

Rule type

CRD rule name

Event ID

Docker process anomaly

The Dockerd or containerd runtime on a node is abnormal.

Log Service

event

docker-hang

sls.app.ack.docker.hang

Eviction event

An eviction event has occurred in the cluster.

Log Service

event

eviction-event

sls.app.ack.eviction

GPU XID error

A GPU XID error has occurred in the cluster.

Log Service

event

gpu-xid-error

sls.app.ack.gpu.xid_error

Node goes offline

A node in the cluster has gone offline.

Log Service

event

node-down

sls.app.ack.node.down

Node restarts

A node in the cluster has restarted.

Log Service

event

node-restart

sls.app.ack.node.restart

NTP service anomaly

A node's time synchronization service is abnormal.

Log Service

event

node-ntp-down

sls.app.ack.ntp.down

PLEG anomaly

A node's PLEG is abnormal.

Log Service

event

node-pleg-error

sls.app.ack.node.pleg_error

Process anomaly

A node has an abnormal number of processes.

Log Service

event

ps-hang

sls.app.ack.ps.hang

Too many file handles

A node has too many file handles.

Log Service

event

node-fd-pressure

sls.app.ack.node.fd_pressure

Too many processes

A node has too many processes.

Log Service

event

node-pid-pressure

sls.app.ack.node.pid_pressure

Failed to delete a node

The cluster failed to delete a node.

Log Service

event

node-del-err

sls.app.ack.ccm.del_node_failed

Failed to add a node

The cluster failed to add a node.

Log Service

event

node-add-err

sls.app.ack.ccm.add_node_failed

Managed node pool command execution fails

Command execution failed in a managed node pool.

Log Service

event

nlc-run-cmd-err

sls.app.ack.nlc.run_command_fail

Empty task command in managed node pool

A task in a managed node pool is missing a command.

Log Service

event

nlc-empty-cmd

sls.app.ack.nlc.empty_task_cmd

Unimplemented task mode in managed node pool

An unimplemented task mode was used in a managed node pool.

Log Service

event

nlc-url-m-unimp

sls.app.ack.nlc.url_mode_unimpl

Unknown repair operation in managed node pool

An unknown repair operation was triggered in a managed node pool.

Log Service

event

nlc-opt-no-found

sls.app.ack.nlc.op_not_found

Error destroying a node in a managed node pool

An error occurred while destroying a node in a managed node pool.

Log Service

event

nlc-des-node-err

sls.app.ack.nlc.destroy_node_fail

Failed to drain a node in a managed node pool

A node in a managed node pool failed to drain.

Log Service

event

nlc-drain-node-err

sls.app.ack.nlc.drain_node_fail

Restarted ECS instance not reaching desired state

An ECS instance in a managed node pool failed to reach its desired state after a restart.

Log Service

event

nlc-restart-ecs-wait

sls.app.ack.nlc.restart_ecs_wait_fail

Failed to restart ECS instance in managed node pool

An ECS instance in a managed node pool failed to restart.

Log Service

event

nlc-restart-ecs-err

sls.app.ack.nlc.restart_ecs_fail

Failed to reset ECS instance in managed node pool

An ECS instance in a managed node pool failed to reset.

Log Service

event

nlc-reset-ecs-err

sls.app.ack.nlc.reset_ecs_fail

Self-healing task fails in managed node pool

A self-healing task in a managed node pool failed.

Log Service

event

nlc-sel-repair-err

sls.app.ack.nlc.repair_fail

Alert rules for cluster resource anomalies

Alert

Description

Source

Rule type

ACK CR rule name

SLS event ID

Node CPU utilization ≥ 85%

A node's CPU utilization exceeds the threshold. The default threshold is 85%.

When the remaining resources fall below 15%, the CPU utilization may exceed the resources reserved for the container engine. This can cause frequent CPU throttling and severely affect process responsiveness. For more information, see Resource reservation policy. We recommend that you optimize CPU usage or adjust the threshold.

To adjust the threshold, see Configure alert rules.

CloudMonitor

metric-cms

node_cpu_util_high

cms.host.cpu.utilization

Node memory utilization ≥ 85%

A node's memory utilization exceeds the threshold. The default threshold is 85%.

If the remaining resources fall below 15%, memory usage may exceed the memory resources reserved for the container engine. In this scenario, the kubelet initiates a forced eviction. For more information, see Resource reservation policy. We recommend that you optimize memory usage or adjust the threshold.

To adjust the threshold, see Configure alert rules.

CloudMonitor

metric-cms

node_mem_util_high

cms.host.memory.utilization

Node disk usage ≥ 85%

A node's disk usage exceeds the threshold. The default threshold is 85%.

To adjust the threshold, see Configure alert rules.

CloudMonitor

metric-cms

node_disk_util_high

cms.host.disk.utilization

Node outbound internet bandwidth utilization ≥ 85%

A node's outbound internet bandwidth utilization exceeds the threshold. The default threshold is 85%.

To adjust the threshold, see Configure alert rules.

CloudMonitor

metric-cms

node_public_net_util_high

cms.host.public.network.utilization

Node inode usage ≥ 85%

A node's inode usage exceeds the threshold. The default threshold is 85%.

To adjust the threshold, see Configure alert rules.

CloudMonitor

metric-cms

node_fs_inode_util_high

cms.host.fs.inode.utilization

SLB Layer-7 QPS utilization ≥ 85%

The QPS of an SLB instance exceeds the threshold. The default threshold is 85%.

Note

This alert applies to the SLB instances associated with the API server and Ingress.

To adjust the threshold, see Configure alert rules.

CloudMonitor

metric-cms

slb_qps_util_high

cms.slb.qps.utilization

SLB outbound bandwidth utilization ≥ 85%

An SLB instance's outbound bandwidth utilization exceeds the threshold. The default threshold is 85%.

Note

This alert applies to the SLB instances associated with the API server and Ingress.

To adjust the threshold, see Configure alert rules.

CloudMonitor

metric-cms

slb_traff_tx_util_high

cms.slb.traffic.tx.utilization

SLB max connections utilization ≥ 85%

An SLB instance's max connections utilization exceeds the threshold. The default threshold is 85%.

Note

This alert applies to the SLB instances associated with the API server and Ingress.

To adjust the threshold, see Configure alert rules.

CloudMonitor

metric-cms

slb_max_con_util_high

cms.slb.max.connection.utilization

SLB listener dropped connections ≥ 1

The number of dropped connections per second for an SLB instance listener continuously exceeds the default threshold of 1.

Note

This alert applies to the SLB instances associated with the API server and Ingress.

To adjust the threshold, see Configure alert rules.

CloudMonitor

metric-cms

slb_drop_con_high

cms.slb.drop.connection

Insufficient disk space on node

This event is triggered when a node in the cluster has insufficient disk space.

SLS

event

node-disk-pressure

sls.app.ack.node.disk_pressure

Insufficient scheduling resources

This event is triggered when the cluster lacks sufficient resources for scheduling.

SLS

event

node-res-insufficient

sls.app.ack.resource.insufficient

Insufficient IP resources

This event is triggered when the cluster is running out of available IP addresses.

SLS

event

node-ip-pressure

sls.app.ack.ip.not_enough

Disk usage exceeds threshold

This event is triggered when the cluster's disk usage exceeds the threshold. Check the disk usage of your cluster.

SLS

event

disk_space_press

sls.app.ack.csi.no_enough_disk_space

ACK control plane O&M alert rules

Alert

Description

Source

Rule type

CRD rule name

Event ID

Cluster task notification

Describes plans and changes on the control plane.

SLS

event

ack-system-event-info

sls.app.ack.system_events.task.info

Cluster task failure notification

Indicates a failed cluster operation that requires prompt investigation.

SLS

event

ack-system-event-error

sls.app.ack.system_events.task.error

Auto scaling alert rules

Alert

Description

Source

Rule type

ACK CR rule name

SLS event ID

Scale-out

Nodes are automatically scaled out to handle increased load.

Log Service

event

autoscaler-scaleup

sls.app.ack.autoscaler.scaleup_group

Scale-in

Nodes are automatically scaled in to conserve resources when the load decreases.

Log Service

event

autoscaler-scaledown

sls.app.ack.autoscaler.scaledown

Scale-out timeout

A scale-out operation timed out. This may indicate insufficient resources or an improper policy.

Log Service

event

autoscaler-scaleup-timeout

sls.app.ack.autoscaler.scaleup_timeout

Scale-in of empty nodes

Inactive or underutilized nodes are removed to optimize resource usage.

Log Service

event

autoscaler-scaledown-empty

sls.app.ack.autoscaler.scaledown_empty

Scale-out failure

A scale-out operation failed. Investigate the cause and adjust the resource policy.

Log Service

event

autoscaler-up-group-failed

sls.app.ack.autoscaler.scaleup_group_failed

Unhealthy cluster due to auto scaling

The cluster became unhealthy during an auto scaling operation, requiring immediate investigation.

Log Service

event

autoscaler-cluster-unhealthy

sls.app.ack.autoscaler.cluster_unhealthy

Deletion of long-unstarted nodes

Nodes that fail to start within a specified period are removed to reclaim resources.

Log Service

event

autoscaler-del-started

sls.app.ack.autoscaler.delete_started_timeout

Deletion of unregistered nodes

Nodes that fail to register with the cluster are removed to optimize cluster resources.

Log Service

event

autoscaler-del-unregistered

sls.app.ack.autoscaler.delete_unregistered

Scale-in failure

A scale-in failure can waste resources and cause uneven load distribution.

Log Service

event

autoscaler-scale-down-failed

sls.app.ack.autoscaler.scaledown_failed

Node deleted before drain

An auto scaling operation removed a node before successfully draining its running Pods.

Log Service

event

autoscaler-instance-expired

sls.app.ack.autoscaler.instance_expired

Alert rules for cluster application workloads

Alert

Description

Source

Rule type

ACK CR rule name

SLS event ID

Job failed

Triggers when a Job fails during execution.

Alibaba Cloud Managed Service for Prometheus

metric-prometheus

job-failed

prom.job.failed

Insufficient available Deployment replicas

Triggers when the number of available replicas for a Deployment falls below the desired count. This may cause a partial or complete service outage.

Alibaba Cloud Managed Service for Prometheus

metric-prometheus

deployment-rep-err

prom.deployment.replicaError

Abnormal DaemonSet replica status

Triggers when a DaemonSet replica is in an abnormal state, such as failing to start or entering a crash loop. This can affect the expected behavior of services on the corresponding node.

Alibaba Cloud Managed Service for Prometheus

metric-prometheus

daemonset-status-err

prom.daemonset.scheduledError

DaemonSet scheduling anomaly

Triggers when a DaemonSet fails to schedule on one or more nodes. Common causes include resource constraints and scheduling policy conflicts.

Alibaba Cloud Managed Service for Prometheus

metric-prometheus

daemonset-misscheduled

prom.daemonset.misscheduled

Pod anomaly alert rules

Alert

Description

Source

Rule type

ACK CR rule name

SLS event ID

Pod OOM

A pod or one of its processes runs out of memory (OOM).

SLS

event

pod-oom

sls.app.ack.pod.oom

Pod fails to start

A pod in the cluster fails to start.

SLS

event

pod-failed

sls.app.ack.pod.failed

Unhealthy pod status

A pod is in an unhealthy state, such as Pending, Failed, or Unknown.

Prometheus Monitoring

metric-prometheus

pod-status-err

prom.pod.status.notHealthy

Pod CrashLoopBackOff

A pod frequently fails to start and enters the CrashLoopBackOff state.

Prometheus Monitoring

metric-prometheus

pod-crashloop

prom.pod.status.crashLooping

Alert rules for cluster storage anomalies

Alert

Description

Source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Disk capacity less than 20 GiB

Disks with a capacity of less than 20 GiB cannot be mounted due to a fixed cluster limit. Check the capacity of the disk that you are trying to mount.

Log Service

event

csi_invalid_size

sls.app.ack.csi.invalid_disk_size

Subscription disks not supported for container volumes

Subscription disks cannot be mounted as container volumes in the cluster. Check the billing method of the disk.

Log Service

event

csi_not_portable

sls.app.ack.csi.disk_not_portable

Failed to unmount mount target (device busy)

The mount target cannot be unmounted because it is in use by an active process or its resources have not been fully released.

Log Service

event

csi_device_busy

sls.app.ack.csi.device_busy

No available disks

No disks were available for a cluster storage mount operation.

Log Service

event

csi_no_ava_disk

sls.app.ack.csi.no_ava_disk

Disk IOHang

An I/O hang anomaly has occurred on a disk in the cluster, which means I/O operations are not completing.

Log Service

event

csi_disk_iohang

sls.app.ack.csi.disk_iohang

Slow I/O on PVC-bound disk

A PersistentVolumeClaim (PVC) bound to a disk is experiencing high I/O latency, also known as a slow I/O anomaly.

Log Service

event

csi_latency_high

sls.app.ack.csi.latency_too_high

PersistentVolume (PV) in a failed state

A PersistentVolume (PV) in the cluster has entered a failed state.

Managed Service for Prometheus

metric-prometheus

pv-failed

prom.pv.failed

Cluster network anomaly alerts

Alert

Description

Source

Rule type

ACK_CR_Rule_Name

SLS_Event_ID

Multiple route tables in VPC

Multiple route tables in a VPC can complicate network configuration and cause routing conflicts. We recommend optimizing your network structure.

Log Service

event

ccm-vpc-multi-route-err

sls.app.ack.ccm.describe_route_tables_failed

No available SLB

The cluster failed to create an SLB instance.

Log Service

event

slb-no-ava

sls.app.ack.ccm.no_ava_slb

SLB synchronization failure

The cluster failed to synchronize a newly created SLB instance.

Log Service

event

slb-sync-err

sls.app.ack.ccm.sync_slb_failed

SLB deletion failure

The cluster failed to delete an SLB instance.

Log Service

event

slb-del-err

sls.app.ack.ccm.del_slb_failed

Route creation failure

The cluster failed to create a VPC network route.

Log Service

event

route-create-err

sls.app.ack.ccm.create_route_failed

Route synchronization failure

The cluster failed to synchronize a VPC network route.

Log Service

event

route-sync-err

sls.app.ack.ccm.sync_route_failed

Invalid Terway resource

The system detected an invalid resource in the cluster's Terway network.

Log Service

event

terway-invalid-res

sls.app.ack.terway.invalid_resource

Terway IP allocation failure

The Terway network component failed to allocate an IP address.

Log Service

event

terway-alloc-ip-err

sls.app.ack.terway.alloc_ip_fail

Ingress bandwidth configuration parse failure

The system failed to parse the configuration for the cluster's Ingress network.

Log Service

event

terway-parse-err

sls.app.ack.terway.parse_fail

Terway resource allocation failure

The Terway network component failed to allocate network resources.

Log Service

event

terway-alloc-res-err

sls.app.ack.terway.allocate_failure

Terway resource reclamation failure

The Terway network component failed to reclaim network resources.

Log Service

event

terway-dispose-err

sls.app.ack.terway.dispose_failure

Terway virtual mode change

The virtual mode of the cluster's Terway network has changed.

Log Service

event

terway-virt-mod-err

sls.app.ack.terway.virtual_mode_change

Terway-triggered pod IP configuration check

The Terway network component triggered an IP configuration check for a pod.

Log Service

event

terway-ip-check

sls.app.ack.terway.config_check

Ingress configuration reload failure

The Ingress configuration failed to reload. Ensure that the configuration is correct.

Log Service

event

ingress-reload-err

sls.app.ack.ingress.err_reload_nginx

Critical cluster audit alert rules

Alert

Description

Source

Rule type

CRD rule name

Event ID

Container logon or command execution

This event triggers when a user executes a command in or logs on to a container. Auditing this activity is critical to track unauthorized access and investigate security incidents.

SLS

event

audit-at-command

sls.app.k8s.audit.at.command

Node scheduling status change

Changes to a node's scheduling status can affect service availability and resource allocation. Monitor these changes, such as cordoning a node, to confirm they are intentional and do not disrupt workloads.

SLS

event

audit-cordon-switch

sls.app.k8s.audit.at.cordon.uncordon

Resource deletion

Resource deletion can be a routine operation or an indicator of malicious activity. As a security best practice, audit all deletions to prevent the accidental or unauthorized removal of critical components.

SLS

event

audit-resource-delete

sls.app.k8s.audit.at.delete

Node drain or eviction

Draining a node or evicting pods is typically part of a maintenance workflow or a response to resource pressure. Monitor these events to understand their impact on service stability and confirm they are intentional.

SLS

event

audit-drain-eviction

sls.app.k8s.audit.at.drain.eviction

Logon from the internet

Logons from the Internet pose a significant security risk. Monitor these events to verify that access is restricted to authorized users and source IPs.

SLS

event

audit-internet-login

sls.app.k8s.audit.at.internet.login

Node label update

Node labels are key to workload scheduling and resource management. Unauthorized or incorrect label changes can disrupt operations. Audit these updates to ensure they are intentional and correctly applied.

SLS

event

audit-node-label-update

sls.app.k8s.audit.at.label

Node taint update

Taints and tolerations are advanced scheduling mechanisms. An incorrect taint can prevent pods from being scheduled on a node. Monitor taint updates to maintain cluster health and ensure intended scheduling behavior.

SLS

event

audit-node-taint-update

sls.app.k8s.audit.at.taint

Resource modification

Modifying critical resources such as Deployments, Services, or ConfigMaps can have a broad impact. Auditing these updates helps you track changes, ensure they align with your operational policies, and detect unauthorized modifications.

SLS

event

audit-resource-update

sls.app.k8s.audit.at.update

Security anomaly alert rules

Alert

Description

Source

Type

CRD rule name

Event ID

High-risk configuration detected

Triggered when a security inspection identifies a high-risk configuration in the cluster.

Log Service

event

si-c-a-risk

sls.app.ack.si.config_audit_high_risk

Cluster inspection anomaly alerts

Alert

Description

Source

Rule type

CRD rule name

Event ID

Cluster inspection anomaly

Triggered when the automated cluster inspection detects a potential anomaly. Analyze the issue and adjust your maintenance strategy accordingly.

SLS

event

cis-sched-failed

sls.app.ack.cis.schedule_task_failed

Troubleshooting alerts

Pod eviction triggered by node disk pressure

Alert message

(combined from similar events): Failed to garbage collect required amount of images. Attempted to free XXXX bytes, but only found 0 bytes eligible to free

Symptoms

The Pod's status is Evicted, and the node reports a disk pressure condition: The node had condition: [DiskPressure].

Cause

When disk usage on a node reaches the eviction threshold (85% by default), the kubelet initiates pressure-based eviction. The kubelet attempts to free space, primarily through image garbage collection. If this fails to reclaim enough space, the kubelet evicts the Pod to relieve the disk pressure. You can log on to the target node and run the df -h command to check the disk usage.

Solution

  1. Log on to the target node (in a containerd runtime environment) and run the following command to prune unused container images and free up disk space.

    crictl rmi --prune
  2. Clean up logs or scale up the node's disk.

  3. Adjust the relevant thresholds.

    • Adjust the kubelet's image garbage collection thresholds to reduce Pod evictions caused by high disk usage. For more information, see Customize kubelet configurations for a node pool.

    • You will receive an alert when the node disk usage reaches or exceeds 85%. Based on your business requirements, you can modify the alert threshold for the node_disk_util_high rule in the YAML configuration. For more information, see Configure alert rules.

Recommendations and Precautions

  • For nodes that frequently experience this issue, assess your application's storage needs and plan your resource requests and node disk capacity accordingly.

  • Regularly monitor storage usage to promptly identify and mitigate potential risks. For more information, see the node storage monitoring dashboard.

Pod OOMKilling

Alert message

pod was OOM killed. node:xxx pod:xxx namespace:xxx uuid:xxx

Symptoms

The Pod status is abnormal, and PodOOMKilling appears in the event content.

Solution

An Out of Memory (OOM) error can be triggered at the node level or the container cgroup level.

  • Cause:

    • Container cgroup-level OOM: The Pod's actual memory usage exceeds its configured memory limits, so the cgroup terminates it.

    • Node-level OOM: This occurs when a node's memory is exhausted. Common causes include running too many Pods without resource limits (requests/limits) or high memory consumption by non-Kubernetes processes.

  • To determine if an OOM event has occurred, log on to the target node and run the dmesg -T | grep -i "memory" command. An OOM event has occurred if the output contains a message such as out_of_memory. The event is a container Cgroup-level OOM if the output also contains Memory cgroup, or a node-level OOM otherwise.

  • Resolution:

For more information about OOM errors, see Causes and solutions for OOM Killer.

Pod status is CrashLoopBackOff

When a process inside a Pod exits unexpectedly, ACK attempts to restart the Pod. If the Pod repeatedly fails to reach a ready state after multiple restarts, its status becomes CrashLoopBackOff. Follow these steps to troubleshoot the issue:

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Workloads > Pods.

  3. In the list of Pods, find the abnormal Pod and click Details in the Actions column.

  4. Go to the Events tab and analyze the details of any abnormal events.

  5. Check the Logs tab for information about why the process failed.

    Note

    If the Pod has been restarted, select Show the log of the last container exit to view logs from the failed attempt.

    You can view only the 500 most recent log entries in the console. To view more historical logs, we recommend that you use a log persistence solution for unified collection and storage.