Use cluster diagnostics to troubleshoot cluster issues-Container Service for Kubernetes(ACK)-阿里云帮助中心

Artificial Intelligence for IT Operations (AIOps) provides one-click diagnostics, including node diagnosis, pod diagnosis, Service diagnosis, Ingress diagnosis, memory diagnosis, network diagnostics, and AI Profiling. This topic describes how to use the cluster diagnostics feature in an ACK managed cluster.

Prerequisites

You have an ACK managed cluster. For more information, see Create an ACK managed cluster.
Make sure your Kubernetes cluster is in the Running state.
Note
Log on to the Container Service for Kubernetes console. On the Clusters page, check the Cluster Status column to confirm that the cluster is Running.

Diagnostic features

AIOps provides the diagnostic features described in the following table.

Diagnostic item	Description
Node diagnosis	Diagnose node-related issues, such as a node in the NotReady state.
Pod diagnosis	Diagnose issues related to an abnormal pod status, such as pod startup failures or frequent pod restarts.
Service diagnosis	Diagnose Service-related issues, such as Service configurations, resource quotas, and anomalous events.
Ingress diagnosis	Diagnose Ingress-related issues, such as traffic configurations.
Memory diagnosis	Diagnose node memory issues, such as memory leaks, cgroup leaks, and out-of-memory (OOM) errors. The diagnostic results provide visualizations of overall memory usage.
Network diagnostics	Diagnose common network issues, such as connectivity problems between pods, between the cluster and the Internet, or from the Internet to a load balancer.
AI Profiling	Collects real-time data from online GPU containers, including CPU calls, Python processes, system calls, and CUDA kernel functions. You can analyze the data on a visual interface.

Configure diagnostics

Important

The diagnostics feature runs a data collection program on your cluster nodes to collect check results. The collected information includes the system version, load, Docker and kubelet running status, and critical error messages from system logs. The data collection program does not collect your business information or sensitive data.

Configuring diagnostics for nodes is similar to configuring them for pods, Services, and Ingresses. This section uses node diagnosis as an example.

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of the target cluster. In the left-side navigation pane, choose Inspections and Diagnostics > Diagnostics.
On the Diagnostics page, click Node diagnosis. On the Node diagnosis page, click Diagnosis in the upper-left corner.
In the Select Node panel, select a Node Name, read the notes, select I know and agree, and then click Create diagnosis.
You can track the diagnostic progress on the page. After the diagnosis is complete, the page displays the results and a list of diagnostic items. You can then review the results to identify and resolve any issues.

View diagnostic results

On the Diagnostics page, find the diagnostic report in the list and click Diagnosis details in the Operation column.

Note

The diagnostic items may vary based on the cluster configuration. The actual diagnostic items on the diagnostic page shall prevail.

Diagnostic item	Check item status	Description
Node diagnosis	Normal: No action is required. Warning: Review is required. You should address these issues to prevent potential cluster anomalies. Abnormal: Address the issue as soon as possible to ensure that your cluster runs as expected. Unknown: The check failed to complete or the result is unknown.	Node diagnosis includes checks for Node, NodeComponent, ClusterComponent, ECSControllerManager, and GPUNode. It determines the cause of a node anomaly by analyzing the node status, node component status, cluster component status, and ECS status. On the diagnosis details page, you can view the diagnostic results, recommended solutions, and a list of specific check items. Hover over the icon next to a check item to view its description. Items with an Abnormal or Warning status are displayed on the Troubleshoot tab. If the status of a check item is Abnormal, you can view anomaly details by hovering over Details in the Status column for that item.
Pod diagnosis		Pod diagnosis includes checks for Pod, ClusterComponent, Node, NodeComponent, and ECSControllerManager. It determines the cause of a pod anomaly by analyzing the pod status, cluster component status, node status, node component status, and ECS status. On the diagnosis details page, you can view the pod diagnostic results, recommended solutions, and a list of specific check items. Hover over the icon next to a check item to view its description. Items with an Abnormal or Warning status are displayed on the Troubleshoot tab. If the status of a check item is Abnormal, you can view anomaly details by hovering over Details in the Status column for that item.
Service diagnosis		Service diagnosis includes checks for Service and resource quotas. It determines the cause of a Service anomaly by checking items such as the SLB billing type, certificates, quotas, and anomalous events. Hover over the icon next to a check item to view its description. Items with an Abnormal or Warning status are displayed on the Troubleshoot tab. If the status of a check item is Abnormal, you can view anomaly details by hovering over Details in the Status column for that item.
Ingress diagnosis		Ingress diagnosis includes checks for Ingress, Addon, and SLB. It determines the cause of an Ingress anomaly by analyzing the Ingress status, Ingress plug-in status, and SLB status. Hover over the icon next to a check item to view its description. Items with an Abnormal or Warning status are displayed on the Troubleshoot tab. If the status of a check item is Abnormal, you can view anomaly details by hovering over Details in the Status column for that item.
Memory diagnosis	None.	On the diagnosis details page, you can view the Memory Overview, Memory Analysis, and OOM Analysis tabs, which provide information such as memory leak status, memory utilization, and the memory consumed by each process.
Network diagnostics	Normal: No action is required. Abnormal: Address the issue as soon as possible.	On the Diagnosis result page, you can view the network diagnostic results. The Packet paths section displays a complete map of the diagnosed access path. Abnormal nodes are highlighted.