Resolve common GPU issues in ACK clusters, including driver setup, NVML errors, and node management.
|
Problem categorization |
Description |
Link |
|
GPU errors and troubleshooting |
GPU driver issues, monitoring tools such as DCGM and Prometheus, and runtime errors such as NVML initialization failures and XID errors. |
|
|
cGPU (containerized GPU) issues |
cGPU configuration, startup, runtime errors, and kernel module permission issues. |
|
|
GPU node and cluster management |
Cluster-level operations including GPU card usage detection, virtualization support, node maintenance such as kernel upgrades, and faulty card isolation. |
Why are the GPU ECC configurations in my cluster inconsistent?
Error-Correcting Code (ECC) mode detects and corrects GPU memory errors, improving stability and reliability at the cost of slightly reduced available GPU memory. ACK does not enforce uniform ECC settings, so configurations can differ between nodes.
When to enable or disable ECC:
|
Recommendation |
Workload type |
|
Disable ECC |
Cost-sensitive workloads and low-latency inference, such as online real-time inference |
|
Enable ECC |
Workloads requiring data consistency and integrity, such as database servers, financial systems, scientific computing, and high-performance computing (HPC) |
Set the ECC mode for a GPU node:
-
Check the current ECC status.
nvidia-smiExpected output:
Fri Jun 6 11:49:05 2025 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla T4 On | 00000000:00:08.0 Off | 0 | | N/A 31C P8 9W / 70W | 0MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+In the
Volatile Uncorr. ECCcolumn:0means ECC is enabled with no errors;Offmeans ECC is disabled. -
Enable or disable ECC as needed.
-
Enable ECC for all GPUs on the node: ``
nvidia-smi -e 1`` -
Disable ECC for all GPUs on the node: ``
nvidia-smi -e 0``
-
-
Restart the operating system for the change to take effect.
ImportantSave all necessary data before restarting the node.
-
Confirm the new ECC status with
nvidia-smi. The following output shows ECC disabled:Fri Jun 6 11:52:15 2025 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla T4 On | 00000000:00:08.0 Off | Off | | N/A 31C P8 9W / 70W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
Does ACK support vGPU-accelerated instances?
vGPU-accelerated instances require a GRID License from NVIDIA. Purchase a license and build your own license server.
Alibaba Cloud does not provide license servers, so vGPU-accelerated instances cannot be used directly—even in a vGPU-accelerated cluster. The ACK console no longer supports selecting vGPU-accelerated instances as cluster nodes.
Unsupported instance prefixes include ecs.vgn5i, ecs.vgn6i, ecs.vgn7i, and ecs.sgn7i. To use these instances, purchase a GRID License from NVIDIA and build your own license server. Use an Elastic Compute Service (ECS) instance and follow the official NVIDIA tutorial. See NVIDIA.
A license server is required to update the NVIDIA driver license for vGPU-accelerated instances.
Purchase an ECS instance and follow the official NVIDIA tutorial to build a license server.
If you have a license server, follow these steps to add vGPU-accelerated instances to an ACK cluster.
Add vGPU-accelerated instances to an ACK cluster:
-
Go to Privilege Quota and request the custom OS image feature.
-
Create a custom OS image based on CentOS 7.x or Alibaba Cloud Linux 2 with the NVIDIA GRID driver and GRID License configured. See Create a custom image from an instance and Install a GRID driver on a vGPU-accelerated instance (Linux).
-
Create a node pool. See Create and manage a node pool.
-
Add the vGPU-accelerated instances to the node pool. See Add existing nodes.
What's next: See Update the NVIDIA driver license for vGPU-accelerated (vGPU) instances in an ACK cluster.
How to manually upgrade the kernel on a GPU node in an existing cluster
After upgrading the kernel, reinstall the NVIDIA driver to restore GPU functionality.
Upgrade the kernel only if its version is earlier than 3.10.0-957.21.3.
This procedure does not cover the kernel upgrade itself. It only describes the NVIDIA driver reinstall required after the kernel is upgraded.
-
Cordon the GPU node to mark it as unschedulable. This example uses the node
cn-beijing.i-2ze19qyi8votgjz12345.kubectl cordon cn-beijing.i-2ze19qyi8votgjz12345 node/cn-beijing.i-2ze19qyi8votgjz12345 already cordoned -
Drain the GPU node.
kubectl drain cn-beijing.i-2ze19qyi8votgjz12345 --grace-period=120 --ignore-daemonsets=true node/cn-beijing.i-2ze19qyi8votgjz12345 cordoned WARNING: Ignoring DaemonSet-managed pods: flexvolume-9scb4, kube-flannel-ds-r2qmh, kube-proxy-worker-l62sf, logtail-ds-f9vbg pod/nginx-ingress-controller-78d847fb96-5fkkw evicted -
Uninstall the current NVIDIA driver.
NoteThis example uses driver version 384.111. For a different version, download the matching package from NVIDIA and replace the version number.
-
Log on to the GPU node and check the driver version using
nvidia-smi.sudo nvidia-smi -a | grep 'Driver Version' Driver Version : 384.111 -
Download the NVIDIA driver installation package.
cd /tmp/ && sudo curl -O https://cn.download.nvidia.cn/tesla/384.111/NVIDIA-Linux-x86_64-384.111.runNoteYou must use the installation package to uninstall the NVIDIA driver.
-
Uninstall the driver.
sudo chmod u+x NVIDIA-Linux-x86_64-384.111.run sudo sh ./NVIDIA-Linux-x86_64-384.111.run --uninstall -a -s -q
-
-
Upgrade the kernel.
-
Restart the GPU instance.
sudo reboot -
Log on to the GPU node again and install the kernel-devel package.
sudo yum install -y kernel-devel-$(uname -r) -
Download and install the required NVIDIA driver from the NVIDIA website. This example uses version 410.79.
cd /tmp/ sudo curl -O https://cn.download.nvidia.cn/tesla/410.79/NVIDIA-Linux-x86_64-410.79.run sudo chmod u+x NVIDIA-Linux-x86_64-410.79.run sudo sh ./NVIDIA-Linux-x86_64-410.79.run -a -s -q # warm up GPU sudo nvidia-smi -pm 1 || true sudo nvidia-smi -acp 0 || true sudo nvidia-smi --auto-boost-default=0 || true sudo nvidia-smi --auto-boost-permission=0 || true sudo nvidia-modprobe -u -c=0 -m || true -
Verify
/etc/rc.d/rc.localcontains the following configuration. Add it if missing.sudo nvidia-smi -pm 1 || true sudo nvidia-smi -acp 0 || true sudo nvidia-smi --auto-boost-default=0 || true sudo nvidia-smi --auto-boost-permission=0 || true sudo nvidia-modprobe -u -c=0 -m || true -
Restart kubelet and Docker.
sudo service kubelet stop sudo service docker restart sudo service kubelet start -
Set the GPU node back to schedulable.
kubectl uncordon cn-beijing.i-2ze19qyi8votgjz12345 node/cn-beijing.i-2ze19qyi8votgjz12345 already uncordoned -
Verify the driver version in the device plugin pod on the GPU node.
If
docker psshows no containers running on the GPU node, see Fix container startup issues on GPU nodes.kubectl exec -n kube-system -t nvidia-device-plugin-cn-beijing.i-2ze19qyi8votgjz12345 nvidia-smi Thu Jan 17 00:33:27 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: N/A | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... On | 00000000:00:09.0 Off | 0 | | N/A 27C P0 28W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Fix container startup issues on GPU nodes
Symptom: After restarting kubelet and Docker on a GPU node, no containers start.
sudo service kubelet stop
Redirecting to /bin/systemctl stop kubelet.service
sudo service docker stop
Redirecting to /bin/systemctl stop docker.service
sudo service docker start
Redirecting to /bin/systemctl start docker.service
sudo service kubelet start
Redirecting to /bin/systemctl start kubelet.service
sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
Cause: A Cgroup driver mismatch between Docker and kubelet. Check the Cgroup driver for Docker:
sudo docker info | grep -i cgroup
Cgroup Driver: cgroupfs
Fix: If the output shows cgroupfs, follow these steps.
-
Back up
/etc/docker/daemon.json, then update it with the following configuration.sudo cat >/etc/docker/daemon.json <<-EOF { "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } }, "exec-opts": ["native.cgroupdriver=systemd"], "log-driver": "json-file", "log-opts": { "max-size": "100m", "max-file": "10" }, "oom-score-adjust": -1000, "storage-driver": "overlay2", "storage-opts": ["overlay2.override_kernel_check=true"], "live-restore": true } EOF -
Restart Docker and kubelet.
sudo service kubelet stop Redirecting to /bin/systemctl stop kubelet.service sudo service docker restart Redirecting to /bin/systemctl restart docker.service sudo service kubelet start Redirecting to /bin/systemctl start kubelet.service -
Confirm the Cgroup driver is now
systemd.sudo docker info | grep -i cgroup Cgroup Driver: systemd
What do I do if adding an ECS Bare Metal Instance node fails?
Symptom: Adding an ecs.ebmgn7 ECS Bare Metal Instance node to a cluster fails.
Cause: ECS Bare Metal Instances (ecs.ebmgn7) support multi-instance GPU (MIG). ACK resets existing MIG settings when adding these nodes to prevent conflicts. If the reset times out, node addition fails.
Diagnose: Check the ACK deployment log on the node host.
sudo cat /var/log/ack-deploy.log
If the log shows this error, the MIG reset timed out:
command timeout: timeout 300 nvidia-smi --gpu-reset
Fix: Add the node again. See Add existing nodes.
What do I do if Failed to initialize NVML: Unknown Error occurs when running a GPU container on Alibaba Cloud Linux 3?
Symptom: Running nvidia-smi in a GPU container returns:
sudo nvidia-smi
Failed to initialize NVML: Unknown Error
Cause: Running systemctl daemon-reload or systemctl daemon-reexec on Alibaba Cloud Linux 3 updates cgroup configurations, which interferes with NVIDIA Management Library (NVML) access in containers. See community issues #1671 and #48.
Fix: Apply one of the following solutions based on your setup.
-
Using
NVIDIA_VISIBLE_DEVICES=all: Addprivileged: trueto the container'ssecurityContext.apiVersion: v1 kind: Pod metadata: name: test-gpu-pod spec: containers: - name: test-gpu-pod image: centos:7 command: - sh - -c - sleep 1d securityContext: # Add privileged permissions to the container. privileged: true -
Using shared GPU scheduling: Switch to Alibaba Cloud Linux 2 or CentOS 7.
-
Quick workaround: Recreate the application pod. This is a temporary fix—the issue may recur. Assess the business impact before proceeding.
-
If none of the above apply: Evaluate whether your workload can run on a different operating system, such as Alibaba Cloud Linux 2 or CentOS 7.
What do I do if a GPU card becomes unavailable due to XID 119 or XID 120 errors?
Symptom: A GPU card fails to initialize. Run sh nvidia-bug-report.sh — the log shows XID 119 or XID 120 errors. Example XID 119 error:

For other XID errors, see NVIDIA Common XID Errors.
Cause: An exception in the GPU System Processor (GSP) component.
Fix: First, update the NVIDIA driver to the latest version. If the issue persists, disable GSP. NVIDIA introduced GSP in driver version 510. See Chapter 42. GSP firmware.
Disable GSP based on your scenario:
Scaling out new nodes
Create a node pool or edit an existing one. In the advanced configuration, add the label ack.aliyun.com/disable-nvidia-gsp=true. ACK automatically disables GSP on new nodes added to this pool.
See Create and manage a node pool.

Disabling GSP may increase node scale-out time.
Adding existing nodes
-
Create a node pool or edit an existing one. Add the label
ack.aliyun.com/disable-nvidia-gsp=trueto the node pool's advanced configuration. ACK automatically disables GSP when existing nodes are added. See Create and manage a node pool.Disabling GSP may increase the time to add nodes.

-
Add existing nodes to the node pool. See Add existing nodes.
Managing existing nodes in a cluster
Option 1: Use a node pool label
-
Add the label
ack.aliyun.com/disable-nvidia-gsp=trueto the node's node pool. See Edit a node pool.
-
Remove the node from the cluster without releasing the ECS instance. See Remove a node from a cluster or node pool.
-
Re-add the node to the cluster as an existing node. See Add existing nodes.
Option 2: Manually disable GSP on the node
If you cannot remove and re-add the node, log on to the node and manually disable GSP. See FAQ.
When upgrading from driver 470 to 525, disable GSP for 525. Version 470 lacks GSP, but 525 may trigger a GSP bug. After upgrading, follow the FAQ steps to manually disable GSP.
How to manually isolate a faulty GPU card in a cluster
Under shared GPU scheduling, a faulty GPU can cause repeated job failures. Mark the GPU as unhealthy to exclude it from scheduling.
Prerequisites:
-
For clusters running Kubernetes 1.24 or later: scheduler version
1.xx.x-aliyun-6.4.3.xxxor later. -
For clusters running Kubernetes 1.22: scheduler version
1.22.15-aliyun-6.2.4.xxxor later. -
Shared GPU scheduling is enabled.
Submit the following ConfigMap. Replace <node-name> with the actual node name, and set deviceId to the GPU index from nvidia-smi.
apiVersion: v1
kind: ConfigMap
metadata:
name: <node-name>-device-status # Replace <node-name> with the actual node name.
namespace: kube-system
data:
devices: |
- deviceId: 0 # Run nvidia-smi to get the GPU index.
deviceType: gpu
healthy: false
The ConfigMap must be in the kube-system namespace with the name format <node-name>-device-status. In the data field, deviceId is the GPU index from nvidia-smi, deviceType is gpu, and healthy is false. Once submitted, the scheduler stops allocating work to that GPU.
Resolve the "Failed to initialize NVML: Unknown Error" message in GPU containers
Symptom: Running nvidia-smi in a GPU container returns:
sudo nvidia-smi
Failed to initialize NVML: Unknown Error
This issue affects nodes running Ubuntu 22.04 or Red Hat Enterprise Linux (RHEL) 9.3 64-bit.
Cause: Running systemctl daemon-reload or systemctl daemon-reexec on the node updates cgroup configurations, which cuts off NVML access for affected containers.
Affected pods:
-
Pods that specify
aliyun.com/gpu-meminresources.limits -
Pods that set
NVIDIA_VISIBLE_DEVICESas a container environment variable -
Pods using a container image that has
NVIDIA_VISIBLE_DEVICESset by default
Pods requesting GPU resources vianvidia.com/gpuinresources.limitsare not affected.
The NVIDIA Device Plugin and ack-gpu-exporter both set NVIDIA_VISIBLE_DEVICES=all by default.
Fix:
-
Quick workaround: Recreate the application pod. This is a temporary fix—the issue may recur. Assess the business impact before proceeding.
-
If the pod uses
NVIDIA_VISIBLE_DEVICES=all: Addprivileged: trueto the container'ssecurityContext.ImportantGranting
privilegedpermissions introduces security risks. Prefer recreating the pod when possible.apiVersion: v1 kind: Pod metadata: name: test-gpu-pod spec: containers: - name: test-gpu-pod image: centos:7 command: - sh - -c - sleep 1d securityContext: # Add privileged permissions to the container. privileged: true
How to prevent the /run/containerd/io.containerd.runtime.v2.task/k8s.io/<container ID>/log.json file from growing continuously on GPU nodes
Symptom: The /run/containerd/io.containerd.runtime.v2.task/k8s.io/<container ID>/log.json file grows continuously and consumes disk space.
Affected environment: Nodes where the nvidia-container-toolkit version is earlier than 1.16.2.
Cause: Frequent exec calls to a container — for example, from an exec probe — cause the NVIDIA container runtime to write an informational log entry per call.
Fix: Log on to the node. Change the log level from info to error and clear existing log content.
#!/bin/bash
set -e
export CONFIG=/etc/nvidia-container-runtime/config.toml
export CONTAINER_ROOT_PATH="/run/containerd/io.containerd.runtime.v2.task/k8s.io"
if [ -f $CONFIG ];then
# Change the log level in the nvidia-container-runtime configuration from "info" to "error".
sed -i 's@^log-level = "info"@log-level = "error"@g' $CONFIG
# Clear the content of the container's log.json file.
find $CONTAINER_ROOT_PATH -mindepth 2 -maxdepth 2 -name log.json -type f -exec sh -c 'echo "" > "{}"' \;
fi