Artificial Intelligence for IT Operations (AIOps) provides one-click diagnostics, including node diagnosis, pod diagnosis, Service diagnosis, Ingress diagnosis, memory diagnosis, network diagnostics, and AI Profiling. This topic describes how to use the cluster diagnostics feature in an ACK managed cluster.
Prerequisites
You have an ACK managed cluster. For more information, see Create an ACK managed cluster.
Make sure your Kubernetes cluster is in the Running state.
NoteLog on to the Container Service for Kubernetes console. On the Clusters page, check the Cluster Status column to confirm that the cluster is Running.
Diagnostic features
AIOps provides the diagnostic features described in the following table.
Diagnostic item | Description |
Diagnose node-related issues, such as a node in the NotReady state. | |
Diagnose issues related to an abnormal pod status, such as pod startup failures or frequent pod restarts. | |
Diagnose Service-related issues, such as Service configurations, resource quotas, and anomalous events. | |
Diagnose Ingress-related issues, such as traffic configurations. | |
Diagnose node memory issues, such as memory leaks, cgroup leaks, and out-of-memory (OOM) errors. The diagnostic results provide visualizations of overall memory usage. | |
Diagnose common network issues, such as connectivity problems between pods, between the cluster and the Internet, or from the Internet to a load balancer. | |
Collects real-time data from online GPU containers, including CPU calls, Python processes, system calls, and CUDA kernel functions. You can analyze the data on a visual interface. |
Configure diagnostics
The diagnostics feature runs a data collection program on your cluster nodes to collect check results. The collected information includes the system version, load, Docker and kubelet running status, and critical error messages from system logs. The data collection program does not collect your business information or sensitive data.
Configuring diagnostics for nodes is similar to configuring them for pods, Services, and Ingresses. This section uses node diagnosis as an example.
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of the target cluster. In the left-side navigation pane, choose .
On the Diagnostics page, click Node diagnosis. On the Node diagnosis page, click Diagnosis in the upper-left corner.
In the Select Node panel, select a Node Name, read the notes, select I know and agree, and then click Create diagnosis.
You can track the diagnostic progress on the page. After the diagnosis is complete, the page displays the results and a list of diagnostic items. You can then review the results to identify and resolve any issues.
View diagnostic results
On the Diagnostics page, find the diagnostic report in the list and click Diagnosis details in the Operation column.
The diagnostic items may vary based on the cluster configuration. The actual diagnostic items on the diagnostic page shall prevail.
Diagnostic item | Check item status | Description |
Node diagnosis |
| Node diagnosis includes checks for Node, NodeComponent, ClusterComponent, ECSControllerManager, and GPUNode. It determines the cause of a node anomaly by analyzing the node status, node component status, cluster component status, and ECS status. On the diagnosis details page, you can view the diagnostic results, recommended solutions, and a list of specific check items. Hover over the Items with an Abnormal or Warning status are displayed on the Troubleshoot tab. If the status of a check item is Abnormal, you can view anomaly details by hovering over Details in the Status column for that item. |
Pod diagnosis | Pod diagnosis includes checks for Pod, ClusterComponent, Node, NodeComponent, and ECSControllerManager. It determines the cause of a pod anomaly by analyzing the pod status, cluster component status, node status, node component status, and ECS status. On the diagnosis details page, you can view the pod diagnostic results, recommended solutions, and a list of specific check items. Hover over the Items with an Abnormal or Warning status are displayed on the Troubleshoot tab. If the status of a check item is Abnormal, you can view anomaly details by hovering over Details in the Status column for that item. | |
Service diagnosis | Service diagnosis includes checks for Service and resource quotas. It determines the cause of a Service anomaly by checking items such as the SLB billing type, certificates, quotas, and anomalous events. Hover over the Items with an Abnormal or Warning status are displayed on the Troubleshoot tab. If the status of a check item is Abnormal, you can view anomaly details by hovering over Details in the Status column for that item. | |
Ingress diagnosis | Ingress diagnosis includes checks for Ingress, Addon, and SLB. It determines the cause of an Ingress anomaly by analyzing the Ingress status, Ingress plug-in status, and SLB status. Hover over the Items with an Abnormal or Warning status are displayed on the Troubleshoot tab. If the status of a check item is Abnormal, you can view anomaly details by hovering over Details in the Status column for that item. | |
Memory diagnosis | None. | On the diagnosis details page, you can view the Memory Overview, Memory Analysis, and OOM Analysis tabs, which provide information such as memory leak status, memory utilization, and the memory consumed by each process. |
Network diagnostics |
| On the Diagnosis result page, you can view the network diagnostic results. The Packet paths section displays a complete map of the diagnosed access path. Abnormal nodes are highlighted. |