Use DRA to schedule GPUs-Container Service for Kubernetes(ACK)-阿里云帮助中心

Deploy the NVIDIA DRA driver to dynamically allocate and share GPUs across pods in your ACK cluster.

How it works

Dynamic Resource Allocation (DRA) extends the persistent volume API for generic resources. Like PersistentVolumeClaims for storage, ResourceClaims request GPU resources from a DeviceClass.

DRA enables more flexible, fine-grained resource allocation than device plugins:

Flexible device filtering: Use the Common Expression Language (CEL) to filter devices by attributes.
Device sharing: Share a GPU across containers or pods by referencing the same ResourceClaim.
Simplified pod requests: Specify resource requirements declaratively without per-container device counts.

NVIDIA DRA Driver for GPUs implements the DRA API for Kubernetes workloads, supporting controlled GPU sharing and dynamic reconfiguration.

Prerequisites

Before you begin:

An ACK managed cluster running Kubernetes 1.34 or later
kubectl installed and configured with your cluster's kubeconfig

Set up the DRA GPU scheduling environment

Step 1: Create a GPU node pool

Add a node label to disable default GPU device plugin resource reporting and prevent duplicate allocation.

Log on to the Container Service console. In the left navigation pane, choose Clusters. Click the cluster name, then choose Node management > Node Pools.
Click Create Node Pool. Select a GPU instance type from GPU instance types supported by ACK. Keep other settings at defaults.
1. Click Specify Instance Type. Enter an instance type name, such as ecs.gn7i-c8g1.2xlarge. Set Expected Nodes to 1.
2. Click Advanced. Under Node Labels, add the following label:
```
ack.node.gpu.schedule: disabled
```
  This disables exclusive GPU scheduling and device plugin resource reporting on the node.
  
  Important: Running both the device plugin and DRA on the same node causes duplicate GPU allocation. Always add this label to nodes where DRA is enabled.

Step 2: Install the NVIDIA DRA driver

The NVIDIA DRA GPU driver implements the DRA API for Kubernetes GPU workloads.

Install the Helm CLI.

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

Add the NVIDIA Helm repository and update it.

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update

Install NVIDIA DRA GPU driver version 25.3.2.

Important

--set controller.affinity=null removes node affinity from the controller, allowing it to run on any node. Evaluate this before production use, as it may affect stability.

helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu --version="25.3.2" --create-namespace --namespace nvidia-dra-driver-gpu \
    --set gpuResourcesEnabledOverride=true \
    --set controller.affinity=null \
    --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].key=ack.node.gpu.schedule" \
    --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].operator=In" \
    --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].values[0]=disabled"

Expected output:

NAME: nvidia-dra-driver-gpu
LAST DEPLOYED: Tue Oct 14 20:42:13 2025
NAMESPACE: nvidia-dra-driver-gpu
STATUS: deployed
REVISION: 1
TEST SUITE: None

Step 3: Verify the environment

Verify the NVIDIA DRA driver is running and GPU resources appear in the cluster.

Check that the DRA GPU driver pods are running.
```
kubectl get pod -n nvidia-dra-driver-gpu
```
All pods should be Running. If any pod is Pending or CrashLoopBackOff, check the node label ack.node.gpu.schedule: disabled from Step 1.

Check that DRA-related resources are created.

kubectl get deviceclass,resourceslice

Expected output:

NAME                                                                    AGE
deviceclass.resource.k8s.io/compute-domain-daemon.nvidia.com            60s
deviceclass.resource.k8s.io/compute-domain-default-channel.nvidia.com   60s
deviceclass.resource.k8s.io/gpu.nvidia.com                              60s
deviceclass.resource.k8s.io/mig.nvidia.com                              60s

NAME                                                                                   NODE                      DRIVER                      POOL                      AGE
resourceslice.resource.k8s.io/cn-beijing.1x.1x.3x.1x-compute-domain.nvidia.com-htjqn   cn-beijing.10.11.34.156   compute-domain.nvidia.com   cn-beijing.10.11.34.156   57s
resourceslice.resource.k8s.io/cn-beijing.1x.1x.3x.1x-gpu.nvidia.com-bnwhj              cn-beijing.10.11.34.156   gpu.nvidia.com              cn-beijing.10.11.34.156   57s

If no deviceclass resources appear, confirm your cluster runs Kubernetes 1.34 or later. If no resourceslice resources appear, recheck the driver installation in Step 2.

View GPU resource details from a ResourceSlice.

Replace cn-beijing.1x.1x.3x.1x-gpu.nvidia.com-bnwhj with your ResourceSlice name from the previous step.
```
kubectl get resourceslice.resource.k8s.io/cn-beijing.1x.1x.3x.1x-gpu.nvidia.com-bnwhj -o yaml
```

Deploy a workload with DRA GPU

These steps use a ResourceClaimTemplate to create a ResourceClaim per pod, giving each pod independent GPU access.

Create a file named resource-claim-template.yaml.

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: single-gpu
spec:
  spec:
    devices:
      requests:
      - exactly:
          allocationMode: ExactCount
          deviceClassName: gpu.nvidia.com
          count: 1
        name: gpu

Apply the template.

kubectl apply -f resource-claim-template.yaml

Create a file named resource-claim-template-pod.yaml.

apiVersion: v1
kind: Pod
metadata:
  name: pod1
  labels:
    app: pod
spec:
  containers:
  - name: ctr
    image: registry-cn-hangzhou.ack.aliyuncs.com/dev/ubuntu:22.04
    command: ["bash", "-c"]
    args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: gpu
  resourceClaims:
  - name: gpu
    resourceClaimTemplateName: single-gpu

Deploy the pod.

kubectl apply -f resource-claim-template-pod.yaml

List the ResourceClaims created for the pod.

Replace pod1-gpu-wstqm with your ResourceClaim name.
```
kubectl get resourceclaim
```
The output lists a ResourceClaim such as pod1-gpu-wstqm. To inspect it:
```
kubectl describe resourceclaim pod1-gpu-wstqm
```
Verify the pod uses the GPU. Expected output: GPU 0: NVIDIA A10.
```
kubectl logs pod1
```

(Optional) Clean up the environment

After testing, delete unused resources to avoid unnecessary charges.

Delete the pod and ResourceClaimTemplate.

kubectl delete pod pod1
kubectl delete resourceclaimtemplate single-gpu

Uninstall the NVIDIA DRA GPU driver.

helm uninstall nvidia-dra-driver-gpu -n nvidia-dra-driver-gpu

Remove or release node resources.