ACK alert management-Container Service for Kubernetes(ACK)-阿里云帮助中心

Monitor cluster events, resource metrics, and component health with configurable CRD-based alert rules.

Billing

The alert feature uses data from Log Service SLS, Managed Service for Prometheus, and CloudMonitor. Notifications such as SMS and phone calls incur additional charges. Review the default alert rule template to identify alert sources and enable the required services.

Alert data source	Configuration requirements	Billing details
Log Service SLS	Enable event monitoring. Event monitoring is enabled by default when you enable the alert feature.	Pay-by-feature billing
Managed Service for Prometheus	Configure Managed Service for Prometheus for your cluster.	Billing instructions
CloudMonitor	Enable CloudMonitor for your ACK cluster.	pay-as-you-go

Enable alert management

Configure metric alerts for cluster resources and automatically receive notifications when anomalies occur, enabling more efficient cluster management and ensuring stable service operation. See Default alert rule template for resource alert details.

ACK managed clusters

Enable alert configuration for an existing cluster or when creating a new cluster.

Existing cluster

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click Operations > Alerts.
On the Alerts page, follow the on-screen instructions to install or upgrade the required components.

After installation or upgrade, go to the Alerts page to configure alerts.

Tab	Description
Alert Rules	Status: Enable or disable an alert rule set. Edit Contact Group: Set the contact group for alert notifications. Notifications are sent to contact groups only. Create contacts and groups first. To notify an individual, create a dedicated group for that contact.
Alert History	View up to 100 alert records from the last 24 hours. Click a link in the Alert Rule column to view rule configurations in the corresponding monitoring system. Click Details to navigate to the anomaly-related resource page. Click Intelligent Analytics for AI-powered issue analysis and troubleshooting.
Alert Contacts	Create, edit, or delete contacts. Contact methods: Phone call/SMS: Set a mobile number for a contact to receive alerts by phone and SMS. Only verified mobile numbers can receive phone call notifications. See Verify a mobile phone number. Email: Set an email address for a contact to receive alert notifications. Chatbots: DingTalk chatbots, WeCom chatbots, and Lark chatbots. For DingTalk chatbots, add security keywords: alert, dispatch. Verify email and chatbot notifications in the CloudMonitor console under Alerts > Alert Contacts before configuring them.
Alert Contact Groups	Create, edit, or delete contact groups. If no contact group exists, the console creates a default group from your Alibaba Cloud account.

New cluster

When you create a cluster, on the Component Configurations page, select Alerts and then select Use Default Alert Rule Template. Choose an Alert Notification Contact Group. See Create an ACK managed cluster.

After alert configuration is enabled, default rules take effect and the default contact group receives notifications. You can modify the alert contacts or contact groups.

ACK dedicated clusters

For an ACK dedicated cluster, grant permissions to the Worker RAM role before enabling default alert rules.

Grant permissions to the worker RAM role

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click Cluster Information.
On the Cluster Information page, in the Cluster Resources section, copy the name next to Worker RAM Role and click the link to open the RAM console.
1. Create the following custom policy. See Create a custom policy.
```
{
            "Action": [
                "log:*",
                "arms:*",
                "cms:*",
                "cs:UpdateContactGroup"
            ],
            "Resource": [
                "*"
            ],
            "Effect": "Allow"
}
```
2. On the Role page, find the Worker RAM role and attach the custom policy. See Manage permissions of a RAM role.
  
  Note: This example uses broad permissions for simplicity. In production, follow the principle of least privilege.
Verify that alert permissions are configured.
1. In the left-side navigation pane of the cluster management page, choose Workloads > Deployments.
2. Select kube-system from the Namespace drop-down list and click the name of alicloud-monitor-controller in the Name column.
3. Click the Logs tab to view the pod logs that indicate successful authorization.

Enable default alert rules

In the left-side navigation pane of the cluster page, choose Operations > Alerts.

On the Alerts page, configure alert settings.

Tab	Description
Alert Rules	Status: Enable or disable an alert rule set. Edit Contact Group: Set the contact group for alert notifications. Notifications are sent to contact groups only. Create contacts and groups first. To notify an individual, create a dedicated group for that contact.
Alert History	View up to 100 alert records from the last 24 hours. Click a link in the Alert Rule column to view rule configurations in the corresponding monitoring system. Click Details to navigate to the anomaly-related resource page. Click Intelligent Analytics for AI-powered issue analysis and troubleshooting.
Alert Contacts	Create, edit, or delete contacts. Contact methods: Phone call/SMS: Set a mobile number for a contact to receive alerts by phone and SMS. Only verified mobile numbers can receive phone call notifications. See Verify a mobile phone number. Email: Set an email address for a contact to receive alert notifications. Chatbots: DingTalk chatbots, WeCom chatbots, and Lark chatbots. For DingTalk chatbots, add security keywords: alert, dispatch. Verify email and chatbot notifications in the CloudMonitor console under Alerts > Alert Contacts before configuring them.
Alert Contact Groups	Create, edit, or delete contact groups. If no contact group exists, the console creates a default group from your Alibaba Cloud account.

Configure alert rules

Enabling the alerting configuration creates an AckAlertRule CRD resource named default in the kube-system namespace with a default alert rule template. Modify this CRD to customize alert rules.

Console

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click Operations > Alerts.
On the Alert Rules tab, click Configure Alert Rule in the upper-right corner. Then, in the Actions column of the target rule, click YAML to view the AckAlertRule CRD resource configuration.

Modify the YAML file. See Default alert rule template for parameter details.

The following example shows the YAML configuration of an alert rule:

Alert rule YAML configuration

apiVersion: alert.alibabacloud.com/v1beta1
kind: AckAlertRule
metadata:
  name: default
spec:
  groups:
    # The following is a sample configuration for a cluster event alert rule.
    - name: pod-exceptions                             # The name of the alert rule group, which corresponds to the Group_Name field in the alert template.
      rules:
        - name: pod-oom                                # The name of the alert rule.
          type: event                                  # The type of the alert rule (Rule_Type). Valid values: event, metric-cms (CloudMonitor metric), and metric-prometheus (Prometheus metric).
          expression: sls.app.ack.pod.oom              # The alert rule expression. If the rule type is event, the value is the Rule_Expression_Id from the default alert rule template in this topic.
          enable: enable                               # The state of the alert rule. Valid values: enable and disable.
        - name: pod-failed
          type: event
          expression: sls.app.ack.pod.failed
          enable: enable
    # The following is a sample configuration for a cluster infrastructure resource alert rule.
    - name: res-exceptions                              # The name of the alert rule group, which corresponds to the Group_Name field in the alert template.
      rules:
        - name: node_cpu_util_high                      # The name of the alert rule.
          type: metric-cms                              # The type of the alert rule (Rule_Type). Valid values: event, metric-cms (CloudMonitor metric), and metric-prometheus (Prometheus metric).
          expression: cms.host.cpu.utilization          # The alert rule expression. If the rule type is metric-cms, the value is the Rule_Expression_Id from the default alert rule template in this topic.
          contactGroups:                                # The contact groups mapped to the alert rule. This configuration is generated by the ACK console. Contacts are consistent for a single account and can be reused across multiple clusters.
            - arms_contact_group_id_v2: '69xxx'
              cms_contact_group_name: xxx Contact Group
              id: '10xxx'
          enable: enable                                # The state of the alert rule. Valid values: enable and disable.
          thresholds:                                   # The threshold of the alert rule.
            - key: CMS_ESCALATIONS_CRITICAL_Threshold
              unit: percent
              value: '85'                                # CPU utilization threshold. Default value: 85%.
            - key: CMS_ESCALATIONS_CRITICAL_Times
              value: '3'                                # An alert is triggered if the threshold is exceeded 3 consecutive times.
            - key: CMS_RULE_SILENCE_SEC                 # The silence period after the first alert is reported.
              value: '900'

Use rules.thresholds to customize alert thresholds. For example, the preceding configuration triggers an alert when a node's CPU utilization exceeds 85% three consecutive times and more than 900 seconds have passed since the last alert.

Parameter	Required	Description	Default
`CMS_ESCALATIONS_CRITICAL_Threshold`	Yes	The threshold for the alert rule. If this parameter is omitted, rule synchronization fails and the rule is disabled. `unit`: The unit of the threshold. Valid values: percent, count, and qps. `value`: The threshold value.	Varies based on the default alert rule template.
`CMS_ESCALATIONS_CRITICAL_Times`	Optional	The number of consecutive times the condition must be met before CloudMonitor triggers an alert. If this parameter is omitted, the default value is used.	3
`CMS_RULE_SILENCE_SEC`	Optional	The silence period in seconds after an initial alert is reported for a continuously triggering CloudMonitor rule. This prevents alert fatigue. If this parameter is omitted, the default value is used.	900

CLI

Edit the alert rule YAML file:

kubectl edit ackalertrules default -n kube-system

Modify the YAML file, then save and exit. See Default alert rule template for parameter details.

Alert rule YAML configuration

apiVersion: alert.alibabacloud.com/v1beta1
kind: AckAlertRule
metadata:
  name: default
spec:
  groups:
    # The following is a sample configuration for a cluster event alert rule.
    - name: pod-exceptions                             # The name of the alert rule group, which corresponds to the Group_Name field in the alert template.
      rules:
        - name: pod-oom                                # The name of the alert rule.
          type: event                                  # The type of the alert rule (Rule_Type). Valid values: event, metric-cms (CloudMonitor metric), and metric-prometheus (Prometheus metric).
          expression: sls.app.ack.pod.oom              # The alert rule expression. If the rule type is event, the value is the Rule_Expression_Id from the default alert rule template in this topic.
          enable: enable                               # The state of the alert rule. Valid values: enable and disable.
        - name: pod-failed
          type: event
          expression: sls.app.ack.pod.failed
          enable: enable
    # The following is a sample configuration for a cluster infrastructure resource alert rule.
    - name: res-exceptions                              # The name of the alert rule group, which corresponds to the Group_Name field in the alert template.
      rules:
        - name: node_cpu_util_high                      # The name of the alert rule.
          type: metric-cms                              # The type of the alert rule (Rule_Type). Valid values: event, metric-cms (CloudMonitor metric), and metric-prometheus (Prometheus metric).
          expression: cms.host.cpu.utilization          # The alert rule expression. If the rule type is metric-cms, the value is the Rule_Expression_Id from the default alert rule template in this topic.
          contactGroups:                                # The contact groups mapped to the alert rule. This configuration is generated by the ACK console. Contacts are consistent for a single account and can be reused across multiple clusters.
            - arms_contact_group_id_v2: 'xxx'
              cms_contact_group_name: xxx
              id: 'xxx'
          enable: enable                                # The state of the alert rule. Valid values: enable and disable.
          thresholds:                                   # The threshold of the alert rule.
            - key: CMS_ESCALATIONS_CRITICAL_Threshold
              unit: percent
              value: '85'                                # CPU utilization threshold. Default value: 85%.
            - key: CMS_ESCALATIONS_CRITICAL_Times
              value: '3'                                # An alert is triggered if the threshold is exceeded 3 consecutive times.
            - key: CMS_RULE_SILENCE_SEC                 # The silence period after the first alert is reported.
              value: '900'

Parameter	Required	Description	Default
`CMS_ESCALATIONS_CRITICAL_Threshold`	Yes	The threshold for the alert rule. If this parameter is omitted, rule synchronization fails and the rule is disabled. `unit`: The unit of the threshold. Valid values: percent, count, and qps. `value`: The threshold value.	Varies based on the default alert rule template.
`CMS_ESCALATIONS_CRITICAL_Times`	Optional	The number of consecutive times the condition must be met before CloudMonitor triggers an alert. If this parameter is omitted, the default value is used.	3
`CMS_RULE_SILENCE_SEC`	Optional	The silence period in seconds after an initial alert is reported for a continuously triggering CloudMonitor rule. This prevents alert fatigue. If this parameter is omitted, the default value is used.	900

Default alert rule template

The following alerts are synchronized from Log Service, Prometheus Service, and CloudMonitor. On the Alerts page, in the Alerts column, click Advanced Settings to view rule configurations.

Error events

Alert	Description	Source	Rule type	ACK CR rule name	SLS event ID
Error event	Triggers on any error-level event in the cluster.	SLS	event	error-event	sls.app.ack.error

Warning events

Alert	Description	Source	Rule type	CRD rule name	Event ID
Warning event	Triggered by critical warning-level events in the cluster, excluding certain ignorable events.	SLS	event	warn-event	sls.app.ack.warn

Core component alerts (ACK managed cluster)

Alert	Description	Source	Rule type	ACK CR rule name	SLS event ID
Cluster API server availability anomaly	Triggered when the API server has availability issues, which can limit cluster management.	Managed Service for Prometheus	metric-prometheus	apiserver-unhealthy	prom.apiserver.notHealthy.down
Cluster etcd availability anomaly	The unavailability of etcd affects the state of the entire cluster.	Managed Service for Prometheus	metric-prometheus	etcd-unhealthy	prom.etcd.notHealthy.down
Cluster kube-scheduler availability anomaly	The kube-scheduler schedules pods. If unavailable, new pods may not start.	Managed Service for Prometheus	metric-prometheus	scheduler-unhealthy	prom.scheduler.notHealthy.down
Cluster KCM availability anomaly	The kube-controller-manager (KCM) runs control loops. Its unavailability can affect the cluster's self-healing and resource adjustment mechanisms.	Managed Service for Prometheus	metric-prometheus	kcm-unhealthy	prom.kcm.notHealthy.down
Cluster cloud-controller-manager availability anomaly	The cloud-controller-manager manages the lifecycle of external cloud components. Its unavailability can affect dynamic service adjustments.	Managed Service for Prometheus	metric-prometheus	ccm-unhealthy	prom.ccm.notHealthy.down
Cluster CoreDNS availability anomaly - Request count drops to zero	CoreDNS provides cluster DNS services. Problems with CoreDNS can disrupt service discovery and domain name resolution.	Managed Service for Prometheus	metric-prometheus	coredns-unhealthy-requestdown	prom.coredns.notHealthy.requestdown
Cluster CoreDNS availability anomaly - Panic error	Triggered when CoreDNS panics. Immediate log analysis is required for diagnosis.	Managed Service for Prometheus	metric-prometheus	coredns-unhealthy-panic	prom.coredns.notHealthy.panic
High error request rate for cluster Ingress	A high HTTP request error rate from the Ingress controller can impact service accessibility.	Managed Service for Prometheus	metric-prometheus	ingress-err-request	prom.ingress.request.errorRateHigh
Cluster Ingress controller certificate is expiring soon	An expired SSL certificate will cause HTTPS requests to fail. Renew the certificate in advance.	Managed Service for Prometheus	metric-prometheus	ingress-ssl-expire	prom.ingress.ssl.expire
Total number of pending pods exceeds 1,000	A large number of pods stuck in the Pending state can indicate insufficient resources or an unsound scheduling policy.	Managed Service for Prometheus	metric-prometheus	pod-pending-accumulate	prom.pod.pending.accumulate
High RT for cluster API server mutating admission webhook	Slow responses from the mutating admission webhook can slow down resource creation and modification.	Managed Service for Prometheus	metric-prometheus	apiserver-admit-rt-high	prom.apiserver.mutating.webhook.rt.high
High RT for cluster API server validating admission webhook	Slow responses from the validating admission webhook can delay configuration changes.	Managed Service for Prometheus	metric-prometheus	apiserver-validate-rt-high	prom.apiserver.validation.webhook.rt.high
Cluster control plane component OOM	A core cluster component is out of memory (OOM). Detailed investigation is required to prevent service failure.	SLS	event	ack-controlplane-oom	sls.app.ack.controlplane.pod.oom

Alert rules for cluster node pool operations

Alert	Description	Source	Rule type	CRD rule name	Event ID
Node auto-healing fails	A node auto-healing failure requires immediate investigation to maintain high availability.	SLS	event	node-repair_failed	sls.app.ack.rc.node_repair_failed
Node CVE fix fails	A failed fix for a critical CVE can compromise cluster security. Investigate and remediate urgently.	SLS	event	nodepool-cve-fix-failed	sls.app.ack.rc.node_vulnerability_fix_failed
Node pool CVE fix succeeds	A successful CVE fix reduces security risks from known vulnerabilities.	SLS	event	nodepool-cve-fix-succ	sls.app.ack.rc.node_vulnerability_fix_succeed
Node pool CVE auto-fix skipped	An automatic fix was skipped, potentially due to compatibility issues or custom configurations. Verify that your security policy is appropriate.	SLS	event	nodepool-cve-fix-skip	sls.app.ack.rc.node_vulnerability_fix_skipped
Node pool kubelet configuration fails	A failed kubelet configuration can impact node performance and resource scheduling.	SLS	event	nodepool-kubelet-cfg-failed	sls.app.ack.rc.node_kubelet_config_failed
Node pool kubelet configuration succeeds	The new kubelet configuration was applied successfully. Verify it has taken effect as expected.	SLS	event	nodepool-kubelet-config-succ	sls.app.ack.rc.node_kubelet_config_succeed
Node pool kubelet upgrade fails	A failed kubelet upgrade can affect cluster stability and functionality. Review the upgrade process and configuration.	SLS	event	nodepool-k-c-upgrade-failed	sls.app.ack.rc.node_kubelet_config_upgrade_failed
Node pool kubelet upgrade succeeds	After a successful upgrade, verify that the new kubelet version meets cluster and application requirements.	SLS	event	nodepool-k-c-upgrade-succ	sls.app.ack.rc.kubelet_upgrade_succeed
Node pool runtime upgrade succeeds	The container runtime upgrade for the node pool was successful.	SLS	event	nodepool-runtime-upgrade-succ	sls.app.ack.rc.runtime_upgrade_succeed
Node pool runtime upgrade fails	The container runtime upgrade for the node pool failed.	SLS	event	nodepool-runtime-upgrade-fail	sls.app.ack.rc.runtime_upgrade_failed
Node pool OS image upgrade succeeds	The OS image upgrade for the node pool was successful.	SLS	event	nodepool-os-upgrade-succ	sls.app.ack.rc.os_image_upgrade_succeed
Node pool OS image upgrade fails	The OS image upgrade for the node pool failed.	SLS	event	nodepool-os-upgrade-failed	sls.app.ack.rc.os_image_upgrade_failed
Lingjun node pool configuration change succeeds	Configuration changes to the Lingjun node pool were applied successfully.	SLS	event	nodepool-lingjun-config-succ	sls.app.ack.rc.lingjun_configuration_apply_succeed
Lingjun node pool configuration change fails	Applying configuration changes to the Lingjun node pool failed.	SLS	event	nodepool-lingjun-cfg-failed	sls.app.ack.rc.lingjun_configuration_apply_failed
Node instance system exception	This alert indicates a system exception (such as an ECS system event or a Lingjun instance event) on an ACK node's underlying resource. Investigate immediately to mitigate impact from potential system or hardware failures and determine if auto-healing or change management is needed.	SLS	event	node-system-exception	sls.app.ack.nlc.out_of_band_event_observer

Cluster node anomaly alerts

Alert	Description	Source	Rule type	ACK CR rule name	SLS event ID
Docker process anomaly	The Dockerd or containerd runtime on a cluster node hangs or becomes unresponsive.	SLS	event	docker-hang	sls.app.ack.docker.hang
Eviction event	An eviction event occurs in the cluster.	SLS	event	eviction-event	sls.app.ack.eviction
GPU XID error	A GPU on a node reports an XID error, indicating an internal GPU issue.	SLS	event	gpu-xid-error	sls.app.ack.gpu.xid_error
Node goes offline	A node in the cluster goes offline.	SLS	event	node-down	sls.app.ack.node.down
Node restarts	A node in the cluster restarts.	SLS	event	node-restart	sls.app.ack.node.restart
NTP service anomaly	The NTP service on a cluster node is malfunctioning.	SLS	event	node-ntp-down	sls.app.ack.ntp.down
PLEG anomaly	The Pod Lifecycle Event Generator (PLEG) on a cluster node is malfunctioning.	SLS	event	node-pleg-error	sls.app.ack.node.pleg_error
Process anomaly	A process on a cluster node hangs or becomes unresponsive.	SLS	event	ps-hang	sls.app.ack.ps.hang
Too many file handles	A node has too many open file handles.	SLS	event	node-fd-pressure	sls.app.ack.node.fd_pressure
Too many processes	A cluster node is running too many processes.	SLS	event	node-pid-pressure	sls.app.ack.node.pid_pressure
Failed to delete a node	The cluster fails to delete a node.	SLS	event	node-del-err	sls.app.ack.ccm.del_node_failed
Failed to add a node	The cluster fails to add a node.	SLS	event	node-add-err	sls.app.ack.ccm.add_node_failed
Managed node pool command execution fails	A command fails to execute in a managed node pool.	SLS	event	nlc-run-cmd-err	sls.app.ack.nlc.run_command_fail
Empty task command in managed node pool	A task is triggered in a managed node pool without a command.	SLS	event	nlc-empty-cmd	sls.app.ack.nlc.empty_task_cmd
Unimplemented task mode in managed node pool	An unimplemented task mode is encountered in the managed node pool.	SLS	event	nlc-url-m-unimp	sls.app.ack.nlc.url_mode_unimpl
Unknown repair operation in managed node pool	An unknown repair operation is encountered in the managed node pool.	SLS	event	nlc-opt-no-found	sls.app.ack.nlc.op_not_found
Error destroying managed node pool node	An error occurs while destroying a node in the managed node pool.	SLS	event	nlc-des-node-err	sls.app.ack.nlc.destroy_node_fail
Failed to drain managed node pool node	A node in a managed node pool fails to drain correctly.	SLS	event	nlc-drain-node-err	sls.app.ack.nlc.drain_node_fail
Restarted ECS instance not reaching desired state	A restarted ECS instance in a managed node pool does not reach its desired state.	SLS	event	nlc-restart-ecs-wait	sls.app.ack.nlc.restart_ecs_wait_fail
Failed to restart ECS instance in managed node pool	An ECS instance in a managed node pool fails to restart.	SLS	event	nlc-restart-ecs-err	sls.app.ack.nlc.restart_ecs_fail
Failed to reset ECS instance in managed node pool	An ECS instance in a managed node pool fails to reset.	SLS	event	nlc-reset-ecs-err	sls.app.ack.nlc.reset_ecs_fail
Self-healing task fails in managed node pool	A self-healing task in a managed node pool fails.	SLS	event	nlc-sel-repair-err	sls.app.ack.nlc.repair_fail

Alert rules for cluster resource anomalies

Alert	Description	Source	Rule type	Rule name	Event ID
Node: CPU utilization ≥ 85%	Triggered when the CPU utilization of a node exceeds the threshold (default: 85%). When remaining resources fall below 15%, utilization may exceed the CPU reserved for the container engine, causing frequent CPU throttling and severely impacting process responsiveness. Optimize CPU usage or adjust the threshold promptly. To adjust the threshold, see Configure alert rules.	CloudMonitor	metric-cms	node_cpu_util_high	cms.host.cpu.utilization
Node: Memory utilization ≥ 85%	Triggered when the memory utilization of a node exceeds the threshold (default: 85%). With remaining resources below 15%, utilization can exceed the memory reserved for the container engine, triggering kubelet forced eviction. Optimize memory usage or adjust the threshold promptly. To adjust the threshold, see Configure alert rules.	CloudMonitor	metric-cms	node_mem_util_high	cms.host.memory.utilization
Node: Disk utilization ≥ 85%	Triggered when the disk utilization of a node exceeds the threshold (default: 85%). To adjust the threshold, see Configure alert rules.	CloudMonitor	metric-cms	node_disk_util_high	cms.host.disk.utilization
Node: Outbound Internet bandwidth utilization ≥ 85%	Triggered when the outbound Internet bandwidth utilization of a node exceeds the threshold (default: 85%). To adjust the threshold, see Configure alert rules.	CloudMonitor	metric-cms	node_public_net_util_high	cms.host.public.network.utilization
Node: inode utilization ≥ 85%	Triggered when the inode utilization of a node exceeds the threshold (default: 85%). To adjust the threshold, see Configure alert rules.	CloudMonitor	metric-cms	node_fs_inode_util_high	cms.host.fs.inode.utilization
SLB: Layer 7 QPS utilization ≥ 85%	Triggered when the Layer 7 QPS utilization of an SLB instance exceeds the threshold (default: 85%). Note The SLB instance is associated with the API server or Ingress. To adjust the threshold, see Configure alert rules.	CloudMonitor	metric-cms	slb_qps_util_high	cms.slb.qps.utilization
SLB: Outbound bandwidth utilization ≥ 85%	Triggered when the outbound bandwidth utilization of an SLB instance exceeds the threshold (default: 85%). Note The SLB instance is associated with the API server or Ingress. To adjust the threshold, see Configure alert rules.	CloudMonitor	metric-cms	slb_traff_tx_util_high	cms.slb.traffic.tx.utilization
SLB: Maximum connection utilization ≥ 85%	Triggered when the maximum connection utilization of an SLB instance exceeds the threshold (default: 85%). Note The SLB instance is associated with the API server or Ingress. To adjust the threshold, see Configure alert rules.	CloudMonitor	metric-cms	slb_max_con_util_high	cms.slb.max.connection.utilization
SLB: Dropped connections per second for listener ≥ 1	Triggered when dropped connections per second for an SLB listener continuously exceeds the threshold (default: 1). Note The SLB instance is associated with the API server or Ingress. To adjust the threshold, see Configure alert rules.	CloudMonitor	metric-cms	slb_drop_con_high	cms.slb.drop.connection
Node: Insufficient disk space	A node has insufficient disk space.	SLS	event	node-disk-pressure	sls.app.ack.node.disk_pressure
Node: Insufficient scheduling resources	A node has insufficient resources for scheduling.	SLS	event	node-res-insufficient	sls.app.ack.resource.insufficient
Node: Insufficient IP resources	A node has insufficient IP resources.	SLS	event	node-ip-pressure	sls.app.ack.ip.not_enough
Cluster: Disk usage exceeds threshold	The cluster's disk usage has exceeded the threshold.	SLS	event	disk_space_press	sls.app.ack.csi.no_enough_disk_space

Alerts for ACK control plane operations

Alert	Description	Source	Rule type	CRD rule name	Event ID
ACK cluster task notification	Scheduled operations and changes in the cluster.	SLS	event	ack-system-event-info	sls.app.ack.system_events.task.info
ACK cluster task failure	A cluster operation has failed. Investigate promptly.	SLS	event	ack-system-event-error	sls.app.ack.system_events.task.error

Auto scaling alerts

Alert	Description	Source	Rule type	ACK CR rule name	SLS event ID
Auto scale-out	Adds nodes automatically when load increases.	SLS	event	autoscaler-scaleup	sls.app.ack.autoscaler.scaleup_group
Auto scale-in	Removes nodes automatically when load decreases.	SLS	event	autoscaler-scaledown	sls.app.ack.autoscaler.scaledown
Scale-out timeout	Scale-out timed out, possibly due to insufficient resources or an incorrect scaling policy.	SLS	event	autoscaler-scaleup-timeout	sls.app.ack.autoscaler.scaleup_timeout
Scale-in of empty node	Removes inactive or empty nodes to reclaim resources.	SLS	event	autoscaler-scaledown-empty	sls.app.ack.autoscaler.scaledown_empty
Scale-out failed	A scale-out operation failed. Investigate and adjust the resource policy.	SLS	event	autoscaler-up-group-failed	sls.app.ack.autoscaler.scaleup_group_failed
Cluster unhealthy	The autoscaler paused scaling because the cluster is unhealthy.	SLS	event	autoscaler-cluster-unhealthy	sls.app.ack.autoscaler.cluster_unhealthy
Deletion of unstarted node	Deletes nodes that fail to start within a specified period.	SLS	event	autoscaler-del-started	sls.app.ack.autoscaler.delete_started_timeout
Deletion of unregistered node	Deletes nodes that fail to register with the cluster within a specified period.	SLS	event	autoscaler-del-unregistered	sls.app.ack.autoscaler.delete_unregistered
Scale-in failed	A scale-in operation failed, which may waste resources and cause uneven load distribution.	SLS	event	autoscaler-scale-down-failed	sls.app.ack.autoscaler.scaledown_failed
Incomplete node drain	The autoscaler removed a node before evicting all pods. This may cause service interruptions.	SLS	event	autoscaler-instance-expired	sls.app.ack.autoscaler.instance_expired

Cluster application workload alerts

Alert	Description	Source	Rule type	CRD rule name	SLS event ID
Job failed	A job failed during execution.	Prometheus	metric-prometheus	job-failed	prom.job.failed
Insufficient deployment replicas	A deployment has fewer available replicas than desired, which can cause service degradation.	Prometheus	metric-prometheus	deployment-rep-err	prom.deployment.replicaError
Abnormal DaemonSet replica status	A DaemonSet replica entered an abnormal state, such as failing to start or crashing. This can disrupt services on the node.	Prometheus	metric-prometheus	daemonset-status-err	prom.daemonset.scheduledError
DaemonSet scheduling anomaly	A DaemonSet failed to schedule pods on all eligible nodes, possibly due to resource constraints or node taints.	Prometheus	metric-prometheus	daemonset-misscheduled	prom.daemonset.misscheduled

Pod anomaly alert rules

Alert	Description	Source	Rule_Type	ACK_CR_Rule_Name	SLS_Event_ID
Pod OOM	A pod or process experienced an out-of-memory (OOM) error.	Log Service	event	pod-oom	sls.app.ack.pod.oom
Pod startup failure	A pod failed to start.	Log Service	event	pod-failed	sls.app.ack.pod.failed
Unhealthy pod status	A pod is in an unhealthy state, such as `Pending`, `Failed`, or `Unknown`.	Managed Service for Prometheus	metric-prometheus	pod-status-err	prom.pod.status.notHealthy
Pod CrashLoopBackOff	A pod repeatedly failed to start and entered `CrashLoopBackOff` or other failure states.	Managed Service for Prometheus	metric-prometheus	pod-crashloop	prom.pod.status.crashLooping

Cluster storage anomaly alert rules

Alert	Description	Source	Rule type	CRD rule name	Event ID
Disk capacity < 20 GiB	Disks smaller than 20 GiB cannot be mounted due to a cluster storage limitation.	SLS	event	csi_invalid_size	sls.app.ack.csi.invalid_disk_size
Subscription disk not supported for container volumes	Subscription disks cannot be used as container volumes.	SLS	event	csi_not_portable	sls.app.ack.csi.disk_not_portable
Failed to unmount mount target (device busy)	The resource is not fully released, or an active process is accessing the mount target.	SLS	event	csi_device_busy	sls.app.ack.csi.device_busy
No available disks	A storage mount failed because no available disks were found.	SLS	event	csi_no_ava_disk	sls.app.ack.csi.no_ava_disk
Disk IOHang	A disk in the cluster is unresponsive due to an I/O hang.	SLS	event	csi_disk_iohang	sls.app.ack.csi.disk_iohang
Slow I/O on PVC-bound disk	A disk bound to a PersistentVolumeClaim (PVC) is experiencing high I/O latency.	SLS	event	csi_latency_high	sls.app.ack.csi.latency_too_high
PersistentVolume in Failed state	A PersistentVolume (PV) entered a Failed state, which may prevent pods from accessing storage.	Alibaba Cloud Prometheus	metric-prometheus	pv-failed	prom.pv.failed

Alert rules for cluster network anomalies

Alert	Description	Source	Rule type	CRD rule name	Event ID
Multiple VPC route tables	Multiple route tables in a VPC can cause routing conflicts. Optimize your network structure.	SLS	event	ccm-vpc-multi-route-err	sls.app.ack.ccm.describe_route_tables_failed
No available SLB instances	The cluster cannot create an SLB instance, possibly due to quota limits or insufficient permissions.	SLS	event	slb-no-ava	sls.app.ack.ccm.no_ava_slb
SLB sync failure	The cluster failed to synchronize an SLB instance configuration, which can cause outdated routing or backend settings.	SLS	event	slb-sync-err	sls.app.ack.ccm.sync_slb_failed
SLB deletion failure	The cluster failed to delete an SLB instance, potentially leaving orphaned resources.	SLS	event	slb-del-err	sls.app.ack.ccm.del_slb_failed
Route creation failure	The cluster failed to create a required route in the VPC route table.	SLS	event	route-create-err	sls.app.ack.ccm.create_route_failed
Route sync failure	The cluster failed to synchronize a route in the VPC route table.	SLS	event	route-sync-err	sls.app.ack.ccm.sync_route_failed
Invalid Terway resource	An invalid network resource was detected in the Terway CNI plugin.	SLS	event	terway-invalid-res	sls.app.ack.terway.invalid_resource
Terway IP allocation failure	Terway failed to allocate an IP address for a pod, often due to IP exhaustion in the configured subnets.	SLS	event	terway-alloc-ip-err	sls.app.ack.terway.alloc_ip_fail
Ingress bandwidth config parse failure	A parsing error occurred in the cluster ingress bandwidth annotation.	SLS	event	terway-parse-err	sls.app.ack.terway.parse_fail
Terway resource allocation failure	The Terway CNI plugin failed to allocate a required network resource.	SLS	event	terway-alloc-res-err	sls.app.ack.terway.allocate_failure
Terway resource reclaim failure	Terway failed to reclaim a network resource after a pod was deleted.	SLS	event	terway-dispose-err	sls.app.ack.terway.dispose_failure
Terway virtual mode change	The Terway network virtual mode changed.	SLS	event	terway-virt-mod-err	sls.app.ack.terway.virtual_mode_change
Terway pod IP config check	Terway initiated a pod IP configuration check.	SLS	event	terway-ip-check	sls.app.ack.terway.config_check
Ingress reload failure	The ingress controller failed to reload its configuration. Check for syntax errors.	SLS	event	ingress-reload-err	sls.app.ack.ingress.err_reload_nginx

Alert rules for key cluster audit operations

Alert	Description	Source	Rule_Type	ACK_CR_Rule_Name	SLS_Event_ID
Container exec or command execution	A user executed a command in or logged into a container. Audit to track maintenance and potential security threats.	SLS	event	audit-at-command	sls.app.k8s.audit.at.command
Node schedulability status change	A node's schedulability status changed (cordoned or uncordoned), which can affect service efficiency and resource load.	SLS	event	audit-cordon-switch	sls.app.k8s.audit.at.cordon.uncordon
Resource deletion	A cluster resource was deleted. Audit to investigate planned changes or malicious activity.	SLS	event	audit-resource-delete	sls.app.k8s.audit.at.delete
Node drain or eviction	A node was drained or a pod was evicted, indicating resource pressure or planned maintenance.	SLS	event	audit-drain-eviction	sls.app.k8s.audit.at.drain.eviction
Public network logon	A logon attempt from the public network was detected. Review access permissions and monitor for unauthorized activity.	SLS	event	audit-internet-login	sls.app.k8s.audit.at.internet.login
Node label update	A node label was updated. Label changes affect workload scheduling — verify correctness to prevent unintended behavior.	SLS	event	audit-node-label-update	sls.app.k8s.audit.at.label
Node taint update	A node taint was added, removed, or updated. Taint changes can repel pods without matching tolerations, disrupting workloads.	SLS	event	audit-node-taint-update	sls.app.k8s.audit.at.taint
Resource modification	A cluster resource was modified. Frequent or unexpected changes may indicate configuration drift.	SLS	event	audit-resource-update	sls.app.k8s.audit.at.update

Cluster security alert rules

Alert	Description	Source	Rule type	CRD rule name	Event ID
High-risk configuration found	A security inspection found a high-risk configuration in the cluster.	SLS	event	si-c-a-risk	sls.app.ack.si.config_audit_high_risk

Alert rules for cluster inspection anomalies

Alert	Description	Source	Rule type	ACK CR rule name	SLS event ID
Cluster inspection anomaly	The automated inspection detected a potential anomaly. Review the issue and adjust maintenance as needed.	SLS	event	cis-sched-failed	sls.app.ack.cis.schedule_task_failed

Troubleshoot alerts

Pod eviction due to disk pressure

Alert message

(combined from similar events): Failed to garbage collect required amount of images. Attempted to free XXXX bytes, but only found 0 bytes eligible to free

Symptoms

The Pod status is Evicted. The node is experiencing disk pressure (The node had condition: [DiskPressure].).

Cause

When a node's disk usage reaches the eviction threshold (85% by default), kubelet initiates pressure-based eviction, running image garbage collection and potentially evicting Pods. Log on to the target node and run df -h to check disk usage.

Resolution

Log on to the target node (containerd runtime) and run the following command to remove unused images:
```
crictl rmi --prune
```
Clean up logs or expand the node's disk capacity.
- Create a snapshot of the node's disk, then delete unnecessary files. See Resolve full disk issues on Linux instances.
- Expand a node's system disk or data disk online to increase storage capacity.
Adjust the relevant thresholds.
- Adjust the kubelet's image garbage collection threshold to reduce Pod evictions. See Customize kubelet configurations for a node pool.
- The default alert threshold is 85%. Modify the node_disk_util_high alert rule by configuring alert rules.

Recommendations and prevention

If a node frequently triggers this alert, evaluate your application's storage requirements and plan node disk capacity accordingly.
Regularly monitor storage usage through the Node storage monitoring dashboard.

Pod OOMKilling

Alert message

pod was OOM killed. node:xxx pod:xxx namespace:xxx uuid:xxx

Symptoms

The Pod status is abnormal, and the event details contain PodOOMKilling.

Resolution

An out-of-memory (OOM) event can be triggered at either the node level or the container cgroup level.

Cause:
- Container cgroup-level OOM: The Pod's memory usage exceeds its configured limits, causing Kubernetes to terminate it.
- Node-level OOM: Typically occurs when too many Pods without resource limits run on the node, or non-Kubernetes processes consume excessive memory.
Diagnosis: Run dmesg -T | grep -i "memory" on the target node. If the output contains out_of_memory, an OOM event occurred. If it also includes Memory cgroup, the OOM is at the container cgroup level; otherwise, it is at the node level.
Recommended actions:
- For container cgroup-level OOM:
  - Increase the Pod's memory limits. Keep actual usage below 80% of limits. See Manage Pods and Scale node resources.
  - Enable resource profiling for recommended container request and limit configurations.
- For node-level OOM:
  - Increase node memory or distribute workloads across more nodes. See Scale node resources and Schedule applications to specific nodes.
  - Identify high-memory Pods on the node and set appropriate memory limits.

See Causes and solutions for OOM killer events.

Pod in CrashLoopBackOff

When a process inside a Pod exits unexpectedly, ACK restarts it. If the Pod repeatedly fails to stabilize, it enters CrashLoopBackOff. Troubleshoot as follows:

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click Workloads > Pods.
Find the affected Pod and click Details in the Actions column.
On the Events tab, review the details of any abnormal events.
Check the Logs tab for the cause of the failure.

Note
If the Pod has restarted, select Show the log of the last container exit.

The console shows only the 500 most recent entries. For older logs, configure log persistence.