Configure and manage the ack-nvidia-device-plugin add-on

更新时间:
复制 MD 格式

The NVIDIA Device Plugin is a GPU device plugin for Kubernetes clusters that manages GPUs on each node, allowing Kubernetes to use GPU resources more efficiently. This topic describes how to upgrade and restart the NVIDIA Device Plugin, isolate GPU devices, and check its version on Alibaba Cloud Container Service for Kubernetes (ACK) nodes in an exclusive GPU scheduling scenario.

Usage notes

When you deploy the NVIDIA Device Plugin as a DaemonSet, note the following:

  • The add-on is automatically installed when you create a cluster.

  • If you uninstall this add-on, scaled-out GPU-accelerated nodes cannot report their GPU resources correctly.

  • Upgrading a cluster from an earlier version to 1.32 also upgrades the NVIDIA Device Plugin from a static pod to an ACK add-on.

  • The DaemonSet uses a NodeSelector (ack.node.gpu.schedule=default). When a GPU-accelerated node is added to the cluster, the ACK script that adds the node automatically attaches this label to the GPU-accelerated node. The DaemonSet then deploys its pod on the GPU-accelerated node.

Important
  • If a node runs Ubuntu 22.04 or Red Hat Enterprise Linux (RHEL) 9.3 64-bit, the NVIDIA Device Plugin may not work correctly after you run the systemctl daemon-reload or systemctl daemon-reexec command. This issue occurs because the ack-nvidia-device-plugin component sets the NVIDIA_VISIBLE_DEVICES=all environment variable for pods by default, which can make the GPU device inaccessible. For more information, see What do I do if the "Failed to initialize NVML: Unknown Error" error occurs when I run a GPU container?.

  • If you upgrade a cluster from a version earlier than 1.32 to version 1.32 before May 1, 2025, the cluster may contain NVIDIA Device Plugins deployed as both static pods and DaemonSets. You can run the following script to find nodes where the NVIDIA Device Plugin is deployed as a static pod.

    #!/bin/bash
    # Loop through all pods with the component=nvidia-device-plugin label in the kube-system namespace.
    for i in $(kubectl get po -n kube-system -l component=nvidia-device-plugin | grep -v NAME | awk '{print $1}');do
        # Check if the pod's configuration source is 'file', which indicates a static pod.
        if kubectl get po $i -o yaml -n kube-system | grep 'kubernetes.io/config.source: file' &> /dev/null;then
        # Print the node name if it is a static pod.
        kubectl get pod $i -n kube-system -o jsonpath='{.spec.nodeName}{"\n"}'
        fi
    done

    Expected output:

    cn-beijing.10.12.XXX.XX
    cn-beijing.10.13.XXX.XX

    The output shows that some nodes still have the NVIDIA Device Plugin deployed as a static pod. You can use the following command to migrate the NVIDIA Device Plugin from a static pod to a DaemonSet.

    kubectl label nodes <NODE_NAME> ack.node.gpu.schedule=default
  • When you upgrade a node pool in a cluster of version 1.31 or earlier, the process also upgrades the NVIDIA Device Plugin and resets any of its non-standard configurations.

Version differences

The implementation and management strategies for the ack-nvidia-device-plugin component vary depending on the cluster version, as described in the following table.

Feature

Version 1.32 and later

Versions 1.20 to 1.31

Deployment method

DaemonSet

static pod

Management method

Add-ons page in the ACK console

Manual

Node label requirement

ack.node.gpu.schedule=default

None

Node pool upgrade strategy

Manual upgrade

Automatic upgrade

If your cluster version is earlier than 1.20, we recommend that you manually upgrade the cluster.

Prerequisites

Check the NVIDIA device plugin version

Version 1.32 and later

For add-ons deployed as a DaemonSet, go to the Add-ons page in the ACK console, find the ack-nvidia-device-plugin add-on, and view its version on the add-on card.

Versions 1.20 to 1.31

For add-ons deployed as a static pod, run the following command to check the add-on version.

kubectl get pods -n kube-system -l component=nvidia-device-plugin \
  -o jsonpath='{range .items[*]}{.spec.containers[0].image}{"\t"}{.spec.nodeName}{"\n"}{end}' \
  | awk -F'[:/]' '{split($NF, a, "-"); print a[1] "\t" $0}' \
  | sort -k1,1V \
  | cut -f2- \
  | awk -F'\t' '{split($1, img, ":"); print img[NF] "\t" $2}'

Upgrade the NVIDIA device plugin

  1. Upgrade the ack-nvidia-device-plugin add-on.

    Version 1.32 and later

    1. Log on to the ACK console. In the left navigation pane, click Clusters.

    2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Add-ons.

    3. On the Add-ons page, find the ack-nvidia-device-plugin card and click Upgrade.

    4. In the dialog box that appears, click OK.

    Versions 1.20 to 1.31

    1. On the Clusters page, click the name of your cluster. In the left navigation pane, click Nodes > Nodes.

    2. Select the GPU-accelerated nodes that you want to manage, click Batch Operations at the bottom of the node list, select Run Shell Scripts in the Batch Operations dialog box, and then click OK.

      Important

      We recommend that you first upgrade the GPU device plugin on a small number of GPU-accelerated nodes. After you verify that the plugin works as expected, perform the upgrade on the remaining nodes.

    3. You are redirected to the CloudOps Orchestration Service (OOS) console. Set Execution Mode to Paused Upon Failure and click Next: Parameter Settings.

    4. On the parameter settings page, select Run Shell Script and paste the following sample script.

      Note
      • Set the RUN_PKG_VERSION parameter in the script to the major version number of the cluster (for example, 1.30). Do not enter the minor version number (for example, 1.30.1). Otherwise, the script will report an error.

      • In the script, change the REGION_ID parameter to the region ID of the current cluster, for example, cn-beijing.

      #!/bin/bash
      set -xe
      
      # Set the Kubernetes major version.
      RUN_PKG_VERSION=1.30
      # Set the region ID.
      REGION_ID=cn-beijing
      
      function update_device_plugin() {
      	base_dir=/tmp/update_device_plugin
      	rm -rf $base_dir
      	mkdir -p $base_dir
      	cd $base_dir
      	region_id=$REGION_ID
      	PKG_URL=https://aliacs-k8s-${region_id}.oss-${region_id}-internal.aliyuncs.com/public/pkg/run/run-${RUN_PKG_VERSION}.tar.gz
      	curl -sSL --retry 3 --retry-delay 2 -o run.tar.gz $PKG_URL
      	tar -xf run.tar.gz
      
      	local dir=pkg/run/$RUN_PKG_VERSION/module
      	# Replace the image registry address.
      	sed -i "s@registry.cn-hangzhou.aliyuncs.com/acs@registry-${region_id}-vpc.ack.aliyuncs.com/acs@g" $dir/nvidia-device-plugin.yml
      	mkdir -p /etc/kubernetes/device-plugin-backup
      	mkdir -p /etc/kubernetes/manifests
      	# Back up the old configuration file.
      	mv  /etc/kubernetes/manifests/nvidia-device-plugin.yml /etc/kubernetes/device-plugin-backup/nvidia-device-plugin.yml.$(date +%s)
      	sleep 5
      	# Copy the new configuration file.
      	cp -a $dir/nvidia-device-plugin.yml /etc/kubernetes/manifests
      	echo "succeeded to update device plugin"
      }
      
      # Check if the nvidia-device-plugin.yml file exists.
      if [ -f /etc/kubernetes/manifests/nvidia-device-plugin.yml ]; then
      	update_device_plugin
      else
      	echo "skip to update device plugin"
      fi
    5. Click Next: Confirm. After you confirm the information, click Create.

      After the task is created, you are automatically redirected to the Task Execution Management page, where you can view the task's execution status. The update is successful if the Output is succeeded to update device plugin.

  2. Check whether the add-on runs as expected.

    Run the following commands to check if the GPU device plugin is working correctly on the GPU-accelerated node.

    1. Use kubectl to connect to the cluster in Workbench or CloudShell.

    2. Run the following command to check if the NVIDIA Device Plugin has restarted:

      kubectl get po -n kube-system -l component=nvidia-device-plugin 

      The AGE column in the sample output shows whether the pod has restarted.

      NAME                             READY   STATUS    RESTARTS      AGE
      nvidia-device-plugin-xxxx        1/1     Running   1             1m
    3. After all pods restart, run the following script to check if the nodes are reporting GPU resources:

      #!/bin/bash
      
      # Get all NVIDIA Device Plugin pods and their corresponding nodes.
      PODS=$(kubectl get po -n kube-system -l component=nvidia-device-plugin -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.nodeName}{"\n"}{end}')
      
      # Iterate through each pod's node.
      echo "$PODS" | while IFS=$'\t' read -r pod_name node_name; do
          # Get the allocatable nvidia.com/gpu resource value for the node.
          gpu_allocatable=$(kubectl get node "$node_name" -o jsonpath='{.status.allocatable.nvidia\.com/gpu}' 2>/dev/null)
      
          # Check if the resource value is 0.
          if [ "$gpu_allocatable" == "0" ]; then
              echo "Error: node=$node_name, pod=$pod_name, resource(nvidia.com/gpu) is 0"
          fi
      done

      If a node reports zero resources, see Restart the NVIDIA Device Plugin.

Restart the NVIDIA device plugin

In exclusive GPU scheduling scenarios in ACK, the device plugin that reports GPU devices on a node is deployed as a pod by default. Therefore, you must perform the restart on the target node.

Version 1.32 and later

  1. Run the following command to find the device plugin pod on the corresponding node.

    kubectl get pod -n kube-system -l component=nvidia-device-plugin -o wide | grep <NODE>
  2. Run the following command to restart the corresponding device plugin pod.

    kubectl delete po <DEVICE_PLUGIN_POD> -n kube-system 

Versions 1.20 to 1.31

  1. On the Clusters page, click the name of your cluster. In the left navigation pane, click Nodes > Node Pools.

  2. On the Node Pools page, click the node pool name to go to the node management page, and then log on to the target GPU-accelerated node.

    If the operating system is ContainerOS, direct user login is disabled by default to reduce potential security risks, and SSH login is not provided. If you still need to log on to an instance for O&M, see O&M for ContainerOS nodes.
  3. Select the GPU-accelerated nodes that you want to manage, click Batch Operations at the bottom of the node list, select Execute Shell Command in the Batch Operations dialog box, and then click OK.

    Important

    We recommend that you first restart the device plugin on a small number of GPU-accelerated nodes. After you verify that the plugin works as expected, perform the restart on the remaining nodes.

  4. You are redirected to the OOS console. Set Execution Mode to Paused Upon Failure and click Next: Parameter Settings.

  5. On the parameter settings page, select Run Shell Script and paste the following sample script.

    #!/bin/bash
    set -e
    
    # Check if the nvidia-device-plugin.yml file exists.
    if [ -f /etc/kubernetes/manifests/nvidia-device-plugin.yml ]; then
    	# Move the file to restart the static pod.
    	cp -a /etc/kubernetes/manifests/nvidia-device-plugin.yml /etc/kubernetes
    	rm -rf /etc/kubernetes/manifests/nvidia-device-plugin.yml
    	sleep 5
    	mv /etc/kubernetes/nvidia-device-plugin.yml /etc/kubernetes/manifests
    	echo "the nvidia device is restarted"
    else
    	echo "no need to restart nvidia device plugin"
    fi
  6. Click Next: Confirm. After you confirm the information, click Create. You are redirected to the Task Execution Management page where you can view the task status.

  7. Run the following command to check if the GPU device plugin is working correctly on the GPU-accelerated node.

    kubectl get nodes <NODE_NAME> -o jsonpath='{.metadata.name} ==> nvidia.com/gpu: {.status.allocatable.nvidia\.com/gpu}'

    Expected output:

    cn-hangzhou.172.16.XXX.XX ==> nvidia.com/gpu: 1

    A non-zero value for the nvidia.com/gpu extended resource on a GPU-accelerated node indicates that the Device Plugin is working properly.

Modify the device plugin checkpoint key

When allocating a device to a pod, the device plugin creates a checkpoint file on the node that records which devices are allocated to which pods. By default, the NVIDIA Device Plugin uses the GPU's UUID as the unique key for each GPU device in the checkpoint file. You can change this key to the device index to resolve issues such as UUID loss after a VM cold migration.

Version 1.32 and later

  1. Run the following command to edit the NVIDIA Device Plugin DaemonSet.

    kubectl edit ds -n kube-system ack-nvidia-device-plugin
  2. Add the CHECKPOINT_DEVICE_ID_STRATEGY environment variable.

        env:
          - name: CHECKPOINT_DEVICE_ID_STRATEGY
            value: index
  3. Restart the NVIDIA Device Plugin to apply the changes.

Versions 1.20 to 1.31

  1. Check the image tag in the /etc/kubernetes/manifests/nvidia-device-plugin.yml file on the target node. The tag indicates the version of the Device-Plugin. If the version is 0.9.3 or later, no action is required. Otherwise, update the version to v0.9.3-0dd4d5f5-aliyun.

  2. Modify the environment variables of the static pod in the /etc/kubernetes/manifests/nvidia-device-plugin.yml file. See the following code to add the CHECKPOINT_DEVICE_ID_STRATEGY environment variable.

        env:
          - name: CHECKPOINT_DEVICE_ID_STRATEGY
            value: index
  3. Restart the NVIDIA Device Plugin to apply the changes.

Enable GPU device isolation

Important

GPU device isolation is supported only on nvidia-device-plugin v0.9.1 or later. For more information, see Check the NVIDIA Device Plugin Version.

In exclusive GPU scheduling scenarios in ACK, you may need to isolate a GPU device on a node due to a fault or other reasons. ACK provides a mechanism to manually isolate a device to prevent the scheduler from assigning new GPU application pods to that device. Follow these steps:

On the target node, manage the unhealthyDevices.json file in the /etc/nvidia-device-plugin/ directory. If this file does not exist, create it. The unhealthyDevices.json file must be in the following JSON format. The array can contain multiple device checkpoint keys. Add keys as needed.

{
  "index": ["x", "x"],
  "uuid": ["xxx", "xxx"]
}

You can enter the index or uuid of the target device for isolation in the JSON file. You only need to enter one for each device. The changes take effect automatically after you save the file.

After the configuration is complete, you can check the number of nvidia.com/gpu resources reported by the Kubernetes node to verify the isolation effect.

Mount GDR devices and enable GPU memory copy acceleration

Requirements

Starting from v0.5.0, ack-nvidia-device-plugin supports mounting GDR devices into GPU containers. This feature requires the following:

  • The cluster version is 1.32 or later.

  • The node is a Lingjun node.

  • GDR software is installed on the node.

    You can log in to the node and run the lsmod | grep gdrdrv command to check whether the GDR kernel module is loaded. The sample output is as follows:
    gdrdrv                196608  0
    nvidia              14680064  88 nvidia_uvm,nvidia_peermem,gdrdrv,nvidia_modeset
  • NVIDIA Container Toolkit 1.15.0 or later is installed on the node.

    You can check the version by logging in to the node and running nvidia-container-cli --version. The following provides an example of the output:
    cli-version: 1.17.8
    lib-version: 1.17.8

Enable GDR acceleration

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Add-ons.

  3. On the Add-ons page, find the ack-nvidia-device-plugin card and click Configuration.

  4. In the dialog box that appears, select Enable GDRCopy to accelerate GPU memory copy (effective on Lingjun nodes with GDR software installed and nvidia-container-toolkit version >= 1.15.0).

Verify the result

  1. On the Clusters page, click the name of your cluster. In the left navigation pane, click Workloads > Deployments.

  2. On the Deployments page, click Create from YAML. Use the following sample code to create an application.

    apiVersion: v1
    kind: Pod
    metadata:
      name: tensorflow-mnist
      namespace: default
    spec:
      containers:
      - image: registry.cn-beijing.aliyuncs.com/acs/tensorflow-mnist-sample:v1.5
        name: tensorflow-mnist
        command:
        - python
        - tensorflow-sample-code/tfjob/docker/mnist/main.py
        - --max_steps=100000
        - --data_dir=tensorflow-sample-code/data
        resources:
          limits:
            nvidia.com/gpu: 1  # Request one GPU card for this container.
        workingDir: /root
      restartPolicy: Always
  3. Run the following command to enter the pod:

    kubectl exec -ti tensorflow-mnist -- ls /dev/gdrdrv

    Expected output:

    /dev/gdrdrv

    The output indicates that the gdrdrv device was mounted successfully.

FAQ

Disable the native GPU isolation feature

Background

When a GPU on a node becomes faulty, ACK automatically isolates the faulty GPU by using the NVIDIA Device Plugin to prevent the scheduler from assigning tasks to it. However, this automatic isolation does not perform an automatic repair. You still need to manually restart or repair the node. We recommend that you configure alerts for GPU exceptions to handle them promptly. For more information about the native GPU isolation feature of the NVIDIA Device Plugin, see k8s-device-plugin.

  • After isolation, if the node has insufficient GPUs for a task's requirements (for example, a task needs eight cards but only seven are available), the scheduler cannot place the task on that node, leaving GPU resources idle.

  • When the GPU status returns to normal, the system automatically removes the isolation from the device.

  • If you need to disable automatic isolation so that the node still reports resources for faulty GPUs, refer to the following solution.

Solution

  1. Confirm whether the native GPU isolation feature is enabled for the add-on.

    If it is not enabled, you do not need to perform the subsequent steps to disable it.

    Version 1.32 and later

    Log on to the ACK console and check the ack-nvidia-device-plugin version on the Add-ons page of the target cluster. If the version is v0.1.0, the GPU isolation feature is enabled.

    Versions 1.20 to 1.31

    1. Use kubectl to connect to the cluster in Workbench or CloudShell.

    2. Check if the NVIDIA Device Plugin has GPU isolation enabled.

      check_gpu_isolation_enabled() {
          echo "Checking nodes with NVIDIA Device Plugin deployed as static pods that have GPU isolation enabled (i.e., DP_DISABLE_HEALTHCHECKS is not set to 'all')..."
      
          # Get pod names with the expected label.
          pods=$(kubectl get pods -n kube-system -l component=nvidia-device-plugin --no-headers -o custom-columns=":metadata.name" 2>/dev/null)
      
          if [ -z "$pods" ]; then
              echo "No pods found with label 'component=nvidia-device-plugin'."
              return 0
          fi
      
          found=0
          while IFS= read -r pod; do
              [ -z "$pod" ] && continue
      
              # Check if it's a static pod: annotation kubernetes.io/config.source must be "file".
              config_source=$(kubectl get pod "$pod" -n kube-system -o jsonpath='{.metadata.annotations.kubernetes\.io/config\.source}' 2>/dev/null)
      
              if [ "$config_source" != "file" ]; then
                  # Not a static pod (e.g., managed by a DaemonSet), skip.
                  continue
              fi
      
              # Get node name.
              node=$(kubectl get pod "$pod" -n kube-system -o jsonpath='{.spec.nodeName}' 2>/dev/null)
              [ -z "$node" ] && continue
      
              # Check if DP_DISABLE_HEALTHCHECKS is set to "all".
              disabled_value=$(kubectl get pod "$pod" -n kube-system -o jsonpath='{range .spec.containers[*]}{range .env[?(@.name=="DP_DISABLE_HEALTHCHECKS")]}{.value}{"\n"}{end}{end}' 2>/dev/null | head -n1)
      
              # If not set or not equal to "all", health checks are active, which means isolation is enabled.
              if [ -z "$disabled_value" ] || [ "$disabled_value" != "all" ]; then
                  echo "Node: $node, Static Pod: $pod"
                  found=1
              fi
          done <<< "$pods"
      
          if [ "$found" -eq 0 ]; then
              echo "No static pods found with GPU isolation enabled."
          else
              echo "Note: The above nodes will automatically isolate faulty GPUs via the NVIDIA Device Plugin."
          fi
      }
      
      check_gpu_isolation_enabled

      Expected output:

      Checking nodes with NVIDIA Device Plugin deployed as static pods that have GPU isolation enabled (i.e., DP_DISABLE_HEALTHCHECKS is not set to 'all')...
      Node: cn-beijing.192.168.XXX.XXX, Static Pod: nvidia-device-plugin-cn-beijing.192.168.XXX.XXX
      Note: The above nodes will automatically isolate faulty GPUs via NVIDIA Device Plugin.

      If the output contains Note: The above nodes will automatically isolate faulty GPUs via NVIDIA Device Plugin., the GPU isolation feature is enabled.

  2. Disable the GPU isolation feature by upgrading the add-on.

    The native GPU isolation feature is disabled by default in the latest version of the NVIDIA Device Plugin add-on. Upgrade the NVIDIA Device Plugin to the latest version to apply this change.

Related topics