Deploy the NVIDIA DRA driver to dynamically allocate and share GPUs across pods in your ACK cluster.
How it works
Dynamic Resource Allocation (DRA) extends the persistent volume API for generic resources. Like PersistentVolumeClaims for storage, ResourceClaims request GPU resources from a DeviceClass.
DRA enables more flexible, fine-grained resource allocation than device plugins:
-
Flexible device filtering: Use the Common Expression Language (CEL) to filter devices by attributes.
-
Device sharing: Share a GPU across containers or pods by referencing the same ResourceClaim.
-
Simplified pod requests: Specify resource requirements declaratively without per-container device counts.
NVIDIA DRA Driver for GPUs implements the DRA API for Kubernetes workloads, supporting controlled GPU sharing and dynamic reconfiguration.
Prerequisites
Before you begin:
-
An ACK managed cluster running Kubernetes 1.34 or later
-
kubectl installed and configured with your cluster's kubeconfig
Set up the DRA GPU scheduling environment
Step 1: Create a GPU node pool
Add a node label to disable default GPU device plugin resource reporting and prevent duplicate allocation.
-
Log on to the Container Service console. In the left navigation pane, choose Clusters. Click the cluster name, then choose Node management > Node Pools.
-
Click Create Node Pool. Select a GPU instance type from GPU instance types supported by ACK. Keep other settings at defaults.
-
Click Specify Instance Type. Enter an instance type name, such as
ecs.gn7i-c8g1.2xlarge. Set Expected Nodes to 1. -
Click Advanced. Under Node Labels, add the following label:
ack.node.gpu.schedule: disabledThis disables exclusive GPU scheduling and device plugin resource reporting on the node.
Important: Running both the device plugin and DRA on the same node causes duplicate GPU allocation. Always add this label to nodes where DRA is enabled.
-
Step 2: Install the NVIDIA DRA driver
The NVIDIA DRA GPU driver implements the DRA API for Kubernetes GPU workloads.
-
Install the Helm CLI.
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash -
Add the NVIDIA Helm repository and update it.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo update -
Install NVIDIA DRA GPU driver version
25.3.2.Important--set controller.affinity=nullremoves node affinity from the controller, allowing it to run on any node. Evaluate this before production use, as it may affect stability.helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu --version="25.3.2" --create-namespace --namespace nvidia-dra-driver-gpu \ --set gpuResourcesEnabledOverride=true \ --set controller.affinity=null \ --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].key=ack.node.gpu.schedule" \ --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].operator=In" \ --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].values[0]=disabled"Expected output:
NAME: nvidia-dra-driver-gpu LAST DEPLOYED: Tue Oct 14 20:42:13 2025 NAMESPACE: nvidia-dra-driver-gpu STATUS: deployed REVISION: 1 TEST SUITE: None
Step 3: Verify the environment
Verify the NVIDIA DRA driver is running and GPU resources appear in the cluster.
-
Check that the DRA GPU driver pods are running.
kubectl get pod -n nvidia-dra-driver-gpuAll pods should be
Running. If any pod isPendingorCrashLoopBackOff, check the node labelack.node.gpu.schedule: disabledfrom Step 1. -
Check that DRA-related resources are created.
kubectl get deviceclass,resourcesliceExpected output:
NAME AGE deviceclass.resource.k8s.io/compute-domain-daemon.nvidia.com 60s deviceclass.resource.k8s.io/compute-domain-default-channel.nvidia.com 60s deviceclass.resource.k8s.io/gpu.nvidia.com 60s deviceclass.resource.k8s.io/mig.nvidia.com 60s NAME NODE DRIVER POOL AGE resourceslice.resource.k8s.io/cn-beijing.1x.1x.3x.1x-compute-domain.nvidia.com-htjqn cn-beijing.10.11.34.156 compute-domain.nvidia.com cn-beijing.10.11.34.156 57s resourceslice.resource.k8s.io/cn-beijing.1x.1x.3x.1x-gpu.nvidia.com-bnwhj cn-beijing.10.11.34.156 gpu.nvidia.com cn-beijing.10.11.34.156 57sIf no
deviceclassresources appear, confirm your cluster runs Kubernetes 1.34 or later. If noresourcesliceresources appear, recheck the driver installation in Step 2. -
View GPU resource details from a ResourceSlice.
Replace
cn-beijing.1x.1x.3x.1x-gpu.nvidia.com-bnwhjwith your ResourceSlice name from the previous step.kubectl get resourceslice.resource.k8s.io/cn-beijing.1x.1x.3x.1x-gpu.nvidia.com-bnwhj -o yaml
Deploy a workload with DRA GPU
These steps use a ResourceClaimTemplate to create a ResourceClaim per pod, giving each pod independent GPU access.
-
Create a file named
resource-claim-template.yaml.apiVersion: resource.k8s.io/v1 kind: ResourceClaimTemplate metadata: name: single-gpu spec: spec: devices: requests: - exactly: allocationMode: ExactCount deviceClassName: gpu.nvidia.com count: 1 name: gpuApply the template.
kubectl apply -f resource-claim-template.yaml -
Create a file named
resource-claim-template-pod.yaml.apiVersion: v1 kind: Pod metadata: name: pod1 labels: app: pod spec: containers: - name: ctr image: registry-cn-hangzhou.ack.aliyuncs.com/dev/ubuntu:22.04 command: ["bash", "-c"] args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"] resources: claims: - name: gpu resourceClaims: - name: gpu resourceClaimTemplateName: single-gpuDeploy the pod.
kubectl apply -f resource-claim-template-pod.yaml -
List the ResourceClaims created for the pod.
Replace
pod1-gpu-wstqmwith your ResourceClaim name.kubectl get resourceclaimThe output lists a ResourceClaim such as
pod1-gpu-wstqm. To inspect it:kubectl describe resourceclaim pod1-gpu-wstqm -
Verify the pod uses the GPU. Expected output:
GPU 0: NVIDIA A10.kubectl logs pod1
(Optional) Clean up the environment
After testing, delete unused resources to avoid unnecessary charges.
-
Delete the pod and ResourceClaimTemplate.
kubectl delete pod pod1 kubectl delete resourceclaimtemplate single-gpu -
Uninstall the NVIDIA DRA GPU driver.
helm uninstall nvidia-dra-driver-gpu -n nvidia-dra-driver-gpu