Use the Node diagnosis feature of Inspections and Diagnostics to troubleshoot GPU-accelerated node issues-Container Service for Kubernetes(ACK)-阿里云帮助中心

Prerequisites

An ACK Pro cluster is created.
The cluster is in the Running state. Log on to the ACK console to check the cluster status on the Clusters page.

Run a node diagnosis

Select a GPU-accelerated node, run Node diagnosis, and use the report to identify and resolve issues.

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of the target cluster. In the left-side navigation pane, choose Inspections and Diagnostics > Diagnostics.
On the Diagnostics page, click Node diagnosis. On the Node diagnosis page, click Diagnosis in the upper-left corner.
In the Select Node panel, select a Node Name, read the notes, select I know and agree, and then click Create diagnosis.
You can track the diagnostic progress on the page. After the diagnosis is complete, the page displays the results and a list of diagnostic items. You can then review the results to identify and resolve any issues.

If the diagnosed node is a GPU-accelerated node, the diagnostic report displays GPU-related metrics after the diagnosis completes. Use this report with Troubleshoot with nvidia-smi status codes and Troubleshoot with XID errors to resolve issues.

Troubleshoot with nvidia-smi status codes

nvidia-smi monitors NVIDIA GPU performance and health. Find the NVIDIASMIStatusCode result in the diagnostic report. The table below lists each status code and the recommended action.

Nvidia-smi status code	Description	Actions
0	Command succeeded. nvidia-smi is working as expected.	Not applicable.
3	Operation unavailable on the target device. The node may not support nvidia-smi, or a driver issue exists.	Check `/var/log/nvidia-installer.log` for driver logs and run `dmesg \| grep -i nv` for kernel errors.
6	Failed to query GPU devices. Indicates a driver issue.	Check `/var/log/nvidia-installer.log` for driver logs and run `dmesg \| grep -i nv` for kernel errors.
8	GPU device power cable not connected correctly. Indicates a hardware issue.	Submit a ticket to contact ECS technical support.
9	NVIDIA driver not loaded. Indicates a driver issue.	Check `/var/log/nvidia-installer.log` for driver logs and run `dmesg \| grep -i nv` for kernel errors.
10	The NVIDIA kernel detected an interrupt issue.	Check `/var/log/nvidia-installer.log` for driver logs, run `dmesg \| grep -i nv` for kernel errors, or review XID diagnostic results.
12	The NVML shared library was not found or could not be loaded.	Check `/var/log/nvidia-installer.log` for driver logs, run `dmesg \| grep -i nv` for kernel errors, or review XID diagnostic results.
13	The local NVML version does not match the driver version.	Check `/var/log/nvidia-installer.log` for driver logs, run `dmesg \| grep -i nv` for kernel errors, or review XID diagnostic results.
14	infoROM corrupted. Indicates a hardware issue.	Submit a ticket to contact ECS technical support.
15	GPU has fallen off the bus. Indicates a hardware issue.	Submit a ticket to contact ECS technical support.
255	Unknown driver error. Indicates a driver issue.	Check `/var/log/nvidia-installer.log` for driver logs, run `dmesg \| grep -i nv` for kernel errors, or review XID diagnostic results.
-1	The nvidia-smi command timed out.	Check `/var/log/nvidia-installer.log` for driver logs, run `dmesg \| grep -i nv` for kernel errors, or review XID diagnostic results.

Troubleshoot with XID errors

XID messages are GPU error reports logged by the NVIDIA driver to the OS kernel log. They identify the error type, location, and code related to GPU hardware, NVIDIA software, or your application.

In the diagnostic report, check the XID exceptions on GPU-accelerated node item. If empty, no XID errors exist. Otherwise, use the tables below to troubleshoot or submit a ticket.

Self-troubleshooting

If you encounter any of the following XID errors, try these steps to resolve them:

Resubmit the workload and check whether the XID error disappears.
If the error persists, review your code and logs to determine whether your code caused it.
If your code is not the cause,submit a ticket for technical support.

XID	Description
13	Graphics Engine Exception. Usually caused by out-of-bounds array access or illegal instruction. Rarely a hardware issue.
31	GPU memory page fault. Usually caused by illegal memory access from the application. Rarely a driver or hardware issue.
43	GPU stopped processing. Usually an application error, not a hardware issue.
45	Preemptive cleanup due to previous errors. Most likely occurs when running multiple CUDA applications with a Double Bit ECC Error (DBE). Indicates a GPU application exited due to manual stop, hardware issue, or resource limit. The specific cause requires further log analysis.
68	NVDEC0 Exception. Usually a hardware or driver issue.

Ticket-based troubleshooting

For the following XID errors,submit a ticket to technical support with the full GPU node diagnostic output.

XID	Description
32	Invalid or corrupted push buffer stream. Reported by the DMA controller on the PCIe bus. Usually caused by a PCI quality issue, not the application.
38	Driver firmware error, not a hardware issue.
48	Double Bit ECC Error (DBE). Reported when the GPU encounters an uncorrectable error, also reported to your application. Usually requires a GPU reset or node restart to clear.
61	Internal micro-controller breakpoint/warning. The GPU internal engine has stopped, affecting your workloads.
62	Internal micro-controller halt. Similar to XID 61.
63	ECC page retirement or row remapping recording event. When a GPU memory hardware error occurs, the NVIDIA self-correction mechanism retires or remaps the faulty memory area and records it in the infoROM. Volta: ECC page retirement event successfully recorded to infoROM. Ampere: Row remapping event successfully recorded to infoROM.
64	ECC page retirement or row remapper recording failure. Similar to XID 63, but the infoROM recording failed.
74	NVLINK Error. Indicates a critical NVLink hardware failure. The GPU must be taken offline for maintenance.
79	GPU has fallen off the bus and can no longer be detected. Critical hardware failure. The GPU must be taken offline for maintenance.
92	High single-bit ECC error rate. Indicates a hardware or driver failure.
94	Contained ECC error. The NVIDIA error containment mechanism isolated an uncorrectable ECC error to the affected application, preventing impact on other applications on the node.
95	Uncontained ECC error. Similar to XID 94, but containment failed. All applications on the GPU are affected.

Troubleshoot GPU-accelerated node issues with Node diagnosis