ACK alert management lets you centrally manage container alerts, including those for anomalous events, key metrics of cluster resources, and core components and applications. You can also modify the default alert rules in your cluster using a CustomResourceDefinition (CRD) to promptly detect anomalies.
Billing
The alerting function pulls data from Log Service, Managed Service for Prometheus, and Cloud Monitor. After an alert is triggered, SMS or phone call notifications incur additional charges. Before enabling the alerting function, review the default alert rule template to identify the source of each alert item and enable the required services.
|
Data source |
Requirements |
Details |
|
Log Service |
Requires event monitoring, which is enabled by default when you enable the alerting function. |
|
|
Managed Service for Prometheus |
Follow the instructions in Access and configure Alibaba Cloud Prometheus monitoring for your cluster. |
|
|
Cloud Monitor |
Enable Cloud Monitor for your Container Service for Kubernetes cluster. |
Enable alert management
After you enable alert management, you can set metric-based alerts for specific resources in your cluster and automatically receive alert notifications when exceptions occur. This helps you efficiently manage and maintain your cluster and ensure service stability. For more information about alerts for related resources, see Default alert rule templates.
ACK managed cluster
Enable alert configuration for an existing cluster or when you create a new one.
Existing cluster
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click .
-
On the Alerts page, follow the on-screen instructions to install or upgrade the component.
-
After the installation or upgrade is complete, go to the Alerts page to configure alerts.
Tab
Description
Alert Rules
-
Status: Enable or disable a target alert rule set.
-
Edit notification object: Set the contact group for alert notifications.
Before you proceed, you must create contacts and contact groups, and add the contacts to the groups. Notification objects only support contact groups. To send notifications to an individual, create a group that contains only that contact and select the group.
Alert History
You can view the 100 most recent alert records sent within the last 24 hours.
-
Click a link in the Alert Rule column to go to the corresponding monitoring system and view the detailed rule configuration.
-
Click Details to navigate to the page of the affected resource, such as a resource with an abnormal event or metric anomalies.
-
Click Intelligent Analytics to use the AI assistant to analyze the issue and receive resolution guidance.
Alert Contacts
Create, edit, or delete contacts.
Contact methods:
-
Phone/SMS: After you set a mobile number for a contact, they can receive alert notifications by phone call or text message.
Only verified mobile numbers can receive phone call notifications. To verify a mobile number, see Verify a mobile number.
-
Email: After you set an email address for a contact, they can receive alert notifications by email.
-
Chatbot: DingTalk chatbot, WeCom chatbot, and Lark chatbot .
For a DingTalk chatbot, you must add the following security keywords: Alert, Assign.
Before you configure email and chatbot settings, you can verify them in the Cloud Monitor console. In the console, choose to ensure that you can receive alert notifications.
Alert Contact Groups
Create, edit, or delete contact groups. You can only select contact groups when you configure the Edit notification object setting.
If no contact group exists, the console automatically creates a default contact group based on your Alibaba Cloud account information.
-
New cluster
On the Component Configurations page of the cluster creation wizard, next to Alerts, select Use Default Alert Rule Template, and then select an Alert notification contact group. For more information, see Create an ACK managed cluster.
If you enable alert configuration when you create a cluster, the system enables the default alert rules and sends alert notifications to the default contact group. You can also modify the alert contacts or contact groups.
ACK dedicated cluster
For an ACK dedicated cluster, you must first grant permissions to the Worker RAM role and then enable the default alert rules.
Authorize the Worker RAM role
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click Cluster Information.
-
On the Cluster Information page, in the Cluster Resources section, copy the name next to Worker RAM Role. Then, click the link to go to the Resource Access Management (RAM) console to grant permissions to the Worker RAM role.
-
Create a custom policy with the following content. For more information, see Create a custom policy by using the script editor.
{ "Action": [ "log:*", "arms:*", "cms:*", "cs:UpdateContactGroup" ], "Resource": [ "*" ], "Effect": "Allow" } -
On the Role page, find the Worker RAM role and attach the custom policy that you created. For more information, see Manage permissions for a RAM role.
Note: To simplify the procedure, this topic grants broad permissions. In a production environment, we recommend that you follow the principle of least privilege and grant only the required permissions.
-
-
Check the logs to verify that the required permissions are configured.
-
In the left-side navigation pane of the cluster details page, choose .
-
Set the Namespace to kube-system, and then click the Name of alicloud-monitor-controller in the list.
-
Click the Logs tab. The pod logs indicate successful authorization.
-
Enable default alert rules
-
In the left-side navigation pane of the target cluster page, choose Operations > Alerts.
-
On the Alerts page, configure the alert settings.
Tab
Description
Alert Rules
-
Status: Enable or disable a target alert rule set.
-
Edit notification object: Set the contact group for alert notifications.
Before you proceed, you must create contacts and contact groups, and add the contacts to the groups. Notification objects only support contact groups. To send notifications to an individual, create a group that contains only that contact and select the group.
Alert History
You can view the 100 most recent alert records sent within the last 24 hours.
-
Click a link in the Alert Rule column to go to the corresponding monitoring system and view the detailed rule configuration.
-
Click Details to navigate to the page of the affected resource, such as a resource with an abnormal event or metric anomalies.
-
Click Intelligent Analytics to use the AI assistant to analyze the issue and receive resolution guidance.
Alert Contacts
Create, edit, or delete contacts.
Contact methods:
-
Phone/SMS: After you set a mobile number for a contact, they can receive alert notifications by phone call or text message.
Only verified mobile numbers can receive phone call notifications. To verify a mobile number, see Verify a mobile number.
-
Email: After you set an email address for a contact, they can receive alert notifications by email.
-
Chatbot: DingTalk chatbot, WeCom chatbot, and Lark chatbot .
For a DingTalk chatbot, you must add the following security keywords: Alert, Assign.
Before you configure email and chatbot settings, you can verify them in the Cloud Monitor console. In the console, choose to ensure that you can receive alert notifications.
Alert Contact Groups
Create, edit, or delete contact groups. You can only select contact groups when you configure the Edit notification object setting.
If no contact group exists, the console automatically creates a default contact group based on your Alibaba Cloud account information.
-
Configure alert rules
After you enable the alerting feature, the system automatically creates an AckAlertRule resource named default in the kube-system namespace. This resource contains the default alert rule template. You can modify this resource to customize the alert rules for Container Service for Kubernetes (ACK).
Console
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click .
-
On the Alert Rules tab, click Configure Alert Rule in the upper-right corner. Then, for the target rule, click YAML in the Actions column to view the
AckAlertRuleresource configuration for the current cluster. -
Modify the YAML file as needed. For parameter details, see the default alert rule template.
The following YAML shows an example alert rule configuration.
You can use
rules.thresholds(see the table below for parameters) to customize alert thresholds. For example, in the preceding sample configuration, an alert notification is triggered when the cluster node CPU utilization exceeds 85%, the threshold is met three consecutive times, and more than 900 seconds have passed since the last alert.Parameter
Required
Description
Default
CMS_ESCALATIONS_CRITICAL_ThresholdYes
The threshold for the alert rule. If you do not configure this parameter, the rule fails to synchronize and becomes disabled.
-
unit: The unit of the threshold. Valid values include percent, count, and qps. -
value: The threshold value.
Depends on the default alert rule template.
CMS_ESCALATIONS_CRITICAL_TimesNo
The number of consecutive times the threshold must be exceeded to trigger an alert. If you do not configure this parameter, the system uses the default value.
3
CMS_RULE_SILENCE_SECNo
The silence period in seconds after an alert is first sent. This prevents frequent notifications for a persistent issue. If you do not configure this parameter, the system uses the default value.
900
-
CLI
-
Run the following command to edit the alert rule YAML file.
kubectl edit ackalertrules default -n kube-system -
Modify the YAML file as needed. For parameter details, see the default alert rule template. After you finish editing, save the file and exit.
You can use
rules.thresholdsto flexibly customize alert thresholds. For example, with the sample configuration above, an alert notification is triggered when the CPU utilization of a cluster node exceeds 85%, the threshold is reached three consecutive times, and more than 900 seconds have passed since the last alert.Parameter
Required
Description
Default
CMS_ESCALATIONS_CRITICAL_ThresholdYes
The threshold for the alert rule. If you do not configure this parameter, the rule fails to synchronize and becomes disabled.
-
unit: The unit of the threshold. Valid values include percent, count, and qps. -
value: The threshold value.
Depends on the default alert rule template.
CMS_ESCALATIONS_CRITICAL_TimesNo
The number of consecutive times the threshold must be exceeded to trigger an alert. If you do not configure this parameter, the system uses the default value.
3
CMS_RULE_SILENCE_SECNo
The silence period in seconds after an alert is first sent. This prevents frequent notifications for a persistent issue. If you do not configure this parameter, the system uses the default value.
900
-
Default alert rules
Alerts are synchronized from Log Service, Alibaba Cloud Prometheus, and Cloud Monitor. On the Alerts page, under the Alerts column for each alert, you can view its rule configuration in Advanced Settings.
Troubleshooting alerts
Pod eviction triggered by node disk pressure
Alert message
(combined from similar events): Failed to garbage collect required amount of images. Attempted to free XXXX bytes, but only found 0 bytes eligible to free
Symptoms
The Pod's status is Evicted, and the node reports a disk pressure condition: The node had condition: [DiskPressure].
Cause
When disk usage on a node reaches the eviction threshold (85% by default), the kubelet initiates pressure-based eviction. The kubelet attempts to free space, primarily through image garbage collection. If this fails to reclaim enough space, the kubelet evicts the Pod to relieve the disk pressure. You can log on to the target node and run the df -h command to check the disk usage.
Solution
-
Log on to the target node (in a
containerdruntime environment) and run the following command to prune unused container images and free up disk space.crictl rmi --prune -
Clean up logs or scale up the node's disk.
-
Create a snapshot of the node's
data diskorcloud disk. After creating the snapshot, delete any unnecessary files or directories. For more information, see Resolve insufficient disk space on a Linux instance. -
Scale up the
system diskordata diskof the target node online. For more information, see Scale up the system disk or data disk of a node.
-
-
Adjust the relevant thresholds.
-
Adjust the
kubelet'simage garbage collectionthresholds to reduce Pod evictions caused by high disk usage. For more information, see Customize kubelet configurations for a node pool. -
You will receive an alert when the node disk usage reaches or exceeds 85%. Based on your business requirements, you can modify the alert threshold for the
node_disk_util_highrule in the YAML configuration. For more information, see Configure alert rules.
-
Recommendations and Precautions
-
For nodes that frequently experience this issue, assess your application's storage needs and plan your resource requests and node disk capacity accordingly.
-
Regularly monitor storage usage to promptly identify and mitigate potential risks. For more information, see the node storage monitoring dashboard.
Pod OOMKilling
Alert message
pod was OOM killed. node:xxx pod:xxx namespace:xxx uuid:xxx
Symptoms
The Pod status is abnormal, and PodOOMKilling appears in the event content.
Solution
An Out of Memory (OOM) error can be triggered at the node level or the container cgroup level.
-
Cause:
-
Container cgroup-level OOM: The Pod's actual memory usage exceeds its configured memory
limits, so the cgroup terminates it. -
Node-level OOM: This occurs when a node's memory is exhausted. Common causes include running too many Pods without
resource limits(requests/limits) or high memory consumption by non-Kubernetes processes.
-
-
To determine if an OOM event has occurred, log on to the target node and run the
dmesg -T | grep -i "memory"command. An OOM event has occurred if the output contains a message such asout_of_memory. The event is a container Cgroup-level OOM if the output also containsMemory cgroup, or a node-level OOM otherwise. -
Resolution:
-
Container cgroup-level OOM:
-
Increase the Pod's memory
limits. As a best practice, ensure actual usage remains below 80% of the new limit. For more information, see Manage Pods and Scale up or scale down node resources. -
Enable resource profiling to get recommendations for container requests and
limits.
-
-
Node-level OOM:
-
Scale up the node's memory resources or distribute the
workloadacross more nodes. For more information, see Scale up or scale down node resources and Schedule applications to specific nodes. -
Identify Pods with high memory usage on the node and set appropriate memory
limitsfor them.
-
-
For more information about OOM errors, see Causes and solutions for OOM Killer.
Pod status is CrashLoopBackOff
When a process inside a Pod exits unexpectedly, ACK attempts to restart the Pod. If the Pod repeatedly fails to reach a ready state after multiple restarts, its status becomes CrashLoopBackOff. Follow these steps to troubleshoot the issue:
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click .
-
In the list of Pods, find the abnormal Pod and click Details in the Actions column.
-
Go to the Events tab and analyze the details of any abnormal events.
-
Check the Logs tab for information about why the process failed.
NoteIf the Pod has been restarted, select Show the log of the last container exit to view logs from the failed attempt.
You can view only the 500 most recent log entries in the console. To view more historical logs, we recommend that you use a log persistence solution for unified collection and storage.