GPU FAQ

更新时间:
复制 MD 格式

Resolve common GPU issues in ACK clusters, including driver setup, NVML errors, and node management.

Problem categorization

Description

Link

GPU errors and troubleshooting

GPU driver issues, monitoring tools such as DCGM and Prometheus, and runtime errors such as NVML initialization failures and XID errors.

cGPU (containerized GPU) issues

cGPU configuration, startup, runtime errors, and kernel module permission issues.

GPU node and cluster management

Cluster-level operations including GPU card usage detection, virtualization support, node maintenance such as kernel upgrades, and faulty card isolation.

Why are the GPU ECC configurations in my cluster inconsistent?

Error-Correcting Code (ECC) mode detects and corrects GPU memory errors, improving stability and reliability at the cost of slightly reduced available GPU memory. ACK does not enforce uniform ECC settings, so configurations can differ between nodes.

When to enable or disable ECC:

Recommendation

Workload type

Disable ECC

Cost-sensitive workloads and low-latency inference, such as online real-time inference

Enable ECC

Workloads requiring data consistency and integrity, such as database servers, financial systems, scientific computing, and high-performance computing (HPC)

Set the ECC mode for a GPU node:

  1. Check the current ECC status.

    nvidia-smi

    Expected output:

    Fri Jun  6 11:49:05 2025
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  Tesla T4                       On  | 00000000:00:08.0 Off |                    0 |
    | N/A   31C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
    |                                         |                      |                  N/A |
    +-----------------------------------------+----------------------+----------------------+
    
    +---------------------------------------------------------------------------------------+
    | Processes:                                                                            |
    |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
    |        ID   ID                                                             Usage      |
    |=======================================================================================|
    |  No running processes found                                                           |
    +---------------------------------------------------------------------------------------+

    In the Volatile Uncorr. ECC column: 0 means ECC is enabled with no errors; Off means ECC is disabled.

  2. Enable or disable ECC as needed.

    • Enable ECC for all GPUs on the node: ``nvidia-smi -e 1``

    • Disable ECC for all GPUs on the node: ``nvidia-smi -e 0``

  3. Restart the operating system for the change to take effect.

    Important

    Save all necessary data before restarting the node.

  4. Confirm the new ECC status with nvidia-smi. The following output shows ECC disabled:

    Fri Jun  6 11:52:15 2025
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  Tesla T4                       On  | 00000000:00:08.0 Off |                  Off |
    | N/A   31C    P8               9W /  70W |      0MiB / 16384MiB |      0%      Default |
    |                                         |                      |                  N/A |
    +-----------------------------------------+----------------------+----------------------+
    
    +---------------------------------------------------------------------------------------+
    | Processes:                                                                            |
    |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
    |        ID   ID                                                             Usage      |
    |=======================================================================================|
    |  No running processes found                                                           |
    +---------------------------------------------------------------------------------------+

Does ACK support vGPU-accelerated instances?

vGPU-accelerated instances require a GRID License from NVIDIA. Purchase a license and build your own license server.

Alibaba Cloud does not provide license servers, so vGPU-accelerated instances cannot be used directly—even in a vGPU-accelerated cluster. The ACK console no longer supports selecting vGPU-accelerated instances as cluster nodes.

Unsupported instance prefixes include ecs.vgn5i, ecs.vgn6i, ecs.vgn7i, and ecs.sgn7i. To use these instances, purchase a GRID License from NVIDIA and build your own license server. Use an Elastic Compute Service (ECS) instance and follow the official NVIDIA tutorial. See NVIDIA.

A license server is required to update the NVIDIA driver license for vGPU-accelerated instances.
Purchase an ECS instance and follow the official NVIDIA tutorial to build a license server.

If you have a license server, follow these steps to add vGPU-accelerated instances to an ACK cluster.

Add vGPU-accelerated instances to an ACK cluster:

  1. Go to Privilege Quota and request the custom OS image feature.

  2. Create a custom OS image based on CentOS 7.x or Alibaba Cloud Linux 2 with the NVIDIA GRID driver and GRID License configured. See Create a custom image from an instance and Install a GRID driver on a vGPU-accelerated instance (Linux).

  3. Create a node pool. See Create and manage a node pool.

  4. Add the vGPU-accelerated instances to the node pool. See Add existing nodes.

What's next: See Update the NVIDIA driver license for vGPU-accelerated (vGPU) instances in an ACK cluster.

How to manually upgrade the kernel on a GPU node in an existing cluster

After upgrading the kernel, reinstall the NVIDIA driver to restore GPU functionality.

Note

Upgrade the kernel only if its version is earlier than 3.10.0-957.21.3.

This procedure does not cover the kernel upgrade itself. It only describes the NVIDIA driver reinstall required after the kernel is upgraded.

  1. Cordon the GPU node to mark it as unschedulable. This example uses the node cn-beijing.i-2ze19qyi8votgjz12345.

    kubectl cordon cn-beijing.i-2ze19qyi8votgjz12345
    
    node/cn-beijing.i-2ze19qyi8votgjz12345 already cordoned
  2. Drain the GPU node.

    kubectl drain cn-beijing.i-2ze19qyi8votgjz12345 --grace-period=120 --ignore-daemonsets=true
    
    node/cn-beijing.i-2ze19qyi8votgjz12345 cordoned
    WARNING: Ignoring DaemonSet-managed pods: flexvolume-9scb4, kube-flannel-ds-r2qmh, kube-proxy-worker-l62sf, logtail-ds-f9vbg
    pod/nginx-ingress-controller-78d847fb96-5fkkw evicted
  3. Uninstall the current NVIDIA driver.

    Note

    This example uses driver version 384.111. For a different version, download the matching package from NVIDIA and replace the version number.

    1. Log on to the GPU node and check the driver version using nvidia-smi .

      sudo nvidia-smi -a | grep 'Driver Version'
      Driver Version                      : 384.111
    2. Download the NVIDIA driver installation package.

      cd /tmp/ && sudo curl -O https://cn.download.nvidia.cn/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
      Note

      You must use the installation package to uninstall the NVIDIA driver.

    3. Uninstall the driver.

      sudo chmod u+x NVIDIA-Linux-x86_64-384.111.run
      sudo sh ./NVIDIA-Linux-x86_64-384.111.run --uninstall -a -s -q
  4. Upgrade the kernel.

  5. Restart the GPU instance.

    sudo reboot
  6. Log on to the GPU node again and install the kernel-devel package.

    sudo yum install -y kernel-devel-$(uname -r)
  7. Download and install the required NVIDIA driver from the NVIDIA website. This example uses version 410.79.

    cd /tmp/
    sudo curl -O https://cn.download.nvidia.cn/tesla/410.79/NVIDIA-Linux-x86_64-410.79.run
    sudo chmod u+x NVIDIA-Linux-x86_64-410.79.run
    sudo sh ./NVIDIA-Linux-x86_64-410.79.run -a -s -q
    
    # warm up GPU
    sudo nvidia-smi -pm 1 || true
    sudo nvidia-smi -acp 0 || true
    sudo nvidia-smi --auto-boost-default=0 || true
    sudo nvidia-smi --auto-boost-permission=0 || true
    sudo nvidia-modprobe -u -c=0 -m || true
  8. Verify /etc/rc.d/rc.local contains the following configuration. Add it if missing.

    sudo nvidia-smi -pm 1 || true
    sudo nvidia-smi -acp 0 || true
    sudo nvidia-smi --auto-boost-default=0 || true
    sudo nvidia-smi --auto-boost-permission=0 || true
    sudo nvidia-modprobe -u -c=0 -m || true
  9. Restart kubelet and Docker.

    sudo service kubelet stop
    sudo service docker restart
    sudo service kubelet start
  10. Set the GPU node back to schedulable.

     kubectl uncordon cn-beijing.i-2ze19qyi8votgjz12345
    
     node/cn-beijing.i-2ze19qyi8votgjz12345 already uncordoned
  11. Verify the driver version in the device plugin pod on the GPU node.

    If docker ps shows no containers running on the GPU node, see Fix container startup issues on GPU nodes.
     kubectl exec -n kube-system -t nvidia-device-plugin-cn-beijing.i-2ze19qyi8votgjz12345 nvidia-smi
     Thu Jan 17 00:33:27 2019
     +-----------------------------------------------------------------------------+
     | NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: N/A      |
     |-------------------------------+----------------------+----------------------+
     | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
     | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
     |===============================+======================+======================|
     |   0  Tesla P100-PCIE...  On   | 00000000:00:09.0 Off |                    0 |
     | N/A   27C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
     +-------------------------------+----------------------+----------------------+
    
     +-----------------------------------------------------------------------------+
     | Processes:                                                       GPU Memory |
     |  GPU       PID   Type   Process name                             Usage      |
     |=============================================================================|
     |  No running processes found                                                 |
     +-----------------------------------------------------------------------------+

Fix container startup issues on GPU nodes

Symptom: After restarting kubelet and Docker on a GPU node, no containers start.

sudo service kubelet stop
Redirecting to /bin/systemctl stop kubelet.service
sudo service docker stop
Redirecting to /bin/systemctl stop docker.service
sudo service docker start
Redirecting to /bin/systemctl start docker.service
sudo service kubelet start
Redirecting to /bin/systemctl start kubelet.service

sudo docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

Cause: A Cgroup driver mismatch between Docker and kubelet. Check the Cgroup driver for Docker:

sudo docker info | grep -i cgroup
Cgroup Driver: cgroupfs

Fix: If the output shows cgroupfs, follow these steps.

  1. Back up /etc/docker/daemon.json, then update it with the following configuration.

    sudo cat >/etc/docker/daemon.json <<-EOF
    {
        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []
            }
        },
        "exec-opts": ["native.cgroupdriver=systemd"],
        "log-driver": "json-file",
        "log-opts": {
            "max-size": "100m",
            "max-file": "10"
        },
        "oom-score-adjust": -1000,
        "storage-driver": "overlay2",
        "storage-opts": ["overlay2.override_kernel_check=true"],
        "live-restore": true
    }
    EOF
  2. Restart Docker and kubelet.

    sudo service kubelet stop
    Redirecting to /bin/systemctl stop kubelet.service
    sudo service docker restart
    Redirecting to /bin/systemctl restart docker.service
    sudo service kubelet start
    Redirecting to /bin/systemctl start kubelet.service
  3. Confirm the Cgroup driver is now systemd.

    sudo docker info | grep -i cgroup
    Cgroup Driver: systemd

What do I do if adding an ECS Bare Metal Instance node fails?

Symptom: Adding an ecs.ebmgn7 ECS Bare Metal Instance node to a cluster fails.

Cause: ECS Bare Metal Instances (ecs.ebmgn7) support multi-instance GPU (MIG). ACK resets existing MIG settings when adding these nodes to prevent conflicts. If the reset times out, node addition fails.

Diagnose: Check the ACK deployment log on the node host.

sudo cat /var/log/ack-deploy.log

If the log shows this error, the MIG reset timed out:

command timeout: timeout 300 nvidia-smi --gpu-reset

Fix: Add the node again. See Add existing nodes.

What do I do if Failed to initialize NVML: Unknown Error occurs when running a GPU container on Alibaba Cloud Linux 3?

Symptom: Running nvidia-smi in a GPU container returns:

sudo nvidia-smi

Failed to initialize NVML: Unknown Error

Cause: Running systemctl daemon-reload or systemctl daemon-reexec on Alibaba Cloud Linux 3 updates cgroup configurations, which interferes with NVIDIA Management Library (NVML) access in containers. See community issues #1671 and #48.

Fix: Apply one of the following solutions based on your setup.

  • Using NVIDIA_VISIBLE_DEVICES=all: Add privileged: true to the container's securityContext.

    apiVersion: v1
    kind: Pod
    metadata:
      name: test-gpu-pod
    spec:
      containers:
        - name: test-gpu-pod
          image: centos:7
          command:
          - sh
          - -c
          - sleep 1d
          securityContext: # Add privileged permissions to the container.
            privileged: true
  • Using shared GPU scheduling: Switch to Alibaba Cloud Linux 2 or CentOS 7.

  • Quick workaround: Recreate the application pod. This is a temporary fix—the issue may recur. Assess the business impact before proceeding.

  • If none of the above apply: Evaluate whether your workload can run on a different operating system, such as Alibaba Cloud Linux 2 or CentOS 7.

What do I do if a GPU card becomes unavailable due to XID 119 or XID 120 errors?

Symptom: A GPU card fails to initialize. Run sh nvidia-bug-report.sh — the log shows XID 119 or XID 120 errors. Example XID 119 error:

123

For other XID errors, see NVIDIA Common XID Errors.

Cause: An exception in the GPU System Processor (GSP) component.

Fix: First, update the NVIDIA driver to the latest version. If the issue persists, disable GSP. NVIDIA introduced GSP in driver version 510. See Chapter 42. GSP firmware.

Disable GSP based on your scenario:

Scaling out new nodes

Create a node pool or edit an existing one. In the advanced configuration, add the label ack.aliyun.com/disable-nvidia-gsp=true. ACK automatically disables GSP on new nodes added to this pool.

See Create and manage a node pool.

image

Disabling GSP may increase node scale-out time.

Adding existing nodes

  1. Create a node pool or edit an existing one. Add the label ack.aliyun.com/disable-nvidia-gsp=true to the node pool's advanced configuration. ACK automatically disables GSP when existing nodes are added. See Create and manage a node pool.

    Disabling GSP may increase the time to add nodes.

    image

  2. Add existing nodes to the node pool. See Add existing nodes.

Managing existing nodes in a cluster

Option 1: Use a node pool label

  1. Add the label ack.aliyun.com/disable-nvidia-gsp=true to the node's node pool. See Edit a node pool.

    image

  2. Remove the node from the cluster without releasing the ECS instance. See Remove a node from a cluster or node pool.

  3. Re-add the node to the cluster as an existing node. See Add existing nodes.

Option 2: Manually disable GSP on the node

If you cannot remove and re-add the node, log on to the node and manually disable GSP. See FAQ.

When upgrading from driver 470 to 525, disable GSP for 525. Version 470 lacks GSP, but 525 may trigger a GSP bug. After upgrading, follow the FAQ steps to manually disable GSP.

How to manually isolate a faulty GPU card in a cluster

Under shared GPU scheduling, a faulty GPU can cause repeated job failures. Mark the GPU as unhealthy to exclude it from scheduling.

Prerequisites:

  • For clusters running Kubernetes 1.24 or later: scheduler version 1.xx.x-aliyun-6.4.3.xxx or later.

  • For clusters running Kubernetes 1.22: scheduler version 1.22.15-aliyun-6.2.4.xxx or later.

  • Shared GPU scheduling is enabled.

Submit the following ConfigMap. Replace <node-name> with the actual node name, and set deviceId to the GPU index from nvidia-smi.

apiVersion: v1
kind: ConfigMap
metadata:
  name: <node-name>-device-status   # Replace <node-name> with the actual node name.
  namespace: kube-system
data:
  devices: |
    - deviceId: 0          # Run nvidia-smi to get the GPU index.
      deviceType: gpu
      healthy: false

The ConfigMap must be in the kube-system namespace with the name format <node-name>-device-status. In the data field, deviceId is the GPU index from nvidia-smi, deviceType is gpu, and healthy is false. Once submitted, the scheduler stops allocating work to that GPU.

Resolve the "Failed to initialize NVML: Unknown Error" message in GPU containers

Symptom: Running nvidia-smi in a GPU container returns:

sudo nvidia-smi

Failed to initialize NVML: Unknown Error

This issue affects nodes running Ubuntu 22.04 or Red Hat Enterprise Linux (RHEL) 9.3 64-bit.

Cause: Running systemctl daemon-reload or systemctl daemon-reexec on the node updates cgroup configurations, which cuts off NVML access for affected containers.

Affected pods:

  • Pods that specify aliyun.com/gpu-mem in resources.limits

  • Pods that set NVIDIA_VISIBLE_DEVICES as a container environment variable

  • Pods using a container image that has NVIDIA_VISIBLE_DEVICES set by default

Pods requesting GPU resources via nvidia.com/gpu in resources.limits are not affected.
The NVIDIA Device Plugin and ack-gpu-exporter both set NVIDIA_VISIBLE_DEVICES=all by default.

Fix:

  • Quick workaround: Recreate the application pod. This is a temporary fix—the issue may recur. Assess the business impact before proceeding.

  • If the pod uses NVIDIA_VISIBLE_DEVICES=all: Add privileged: true to the container's securityContext.

    Important

    Granting privileged permissions introduces security risks. Prefer recreating the pod when possible.

    apiVersion: v1
    kind: Pod
    metadata:
      name: test-gpu-pod
    spec:
      containers:
        - name: test-gpu-pod
          image: centos:7
          command:
          - sh
          - -c
          - sleep 1d
          securityContext: # Add privileged permissions to the container.
            privileged: true

How to prevent the /run/containerd/io.containerd.runtime.v2.task/k8s.io/<container ID>/log.json file from growing continuously on GPU nodes

Symptom: The /run/containerd/io.containerd.runtime.v2.task/k8s.io/<container ID>/log.json file grows continuously and consumes disk space.

Affected environment: Nodes where the nvidia-container-toolkit version is earlier than 1.16.2.

Cause: Frequent exec calls to a container — for example, from an exec probe — cause the NVIDIA container runtime to write an informational log entry per call.

Fix: Log on to the node. Change the log level from info to error and clear existing log content.

#!/bin/bash
set -e

export CONFIG=/etc/nvidia-container-runtime/config.toml
export CONTAINER_ROOT_PATH="/run/containerd/io.containerd.runtime.v2.task/k8s.io"

if [ -f $CONFIG ];then
    # Change the log level in the nvidia-container-runtime configuration from "info" to "error".
sed -i 's@^log-level = "info"@log-level = "error"@g' $CONFIG
    # Clear the content of the container's log.json file.
find $CONTAINER_ROOT_PATH -mindepth 2 -maxdepth 2 -name log.json -type f -exec sh -c 'echo "" > "{}"' \;
fi