In an ACK Auto Mode cluster, an intelligent managed node pool provides elastic provisioning of GPU instances. It ensures just-in-time GPU resource supply for training jobs and automatically releases the GPU nodes after jobs are finished, enabling on-demand usage with zero idle costs.
Prerequisites
The node pool requires ContainerOS 3.7 or later. To upgrade the ContainerOS version, see Change the operating system.
The following instance types are not supported: ecs.gn5, ecs.gn5i, ecs.gn6v, ecs.gn6e, ecs.gn8v-tee, ebmgn9ge, and ebmgn9gc.
Step 1: Create a GPU intelligent managed node pool
On the ACK Cluster List page, click the name of your target cluster. In the left-side navigation pane, choose .
Click Create Node Pool and configure the following key parameters:
Parameter
Description
Managed type
Select intelligent hosting.
vSwitch
During scaling events, the node pool adds or removes nodes in the availability zones of the selected vSwitches according to the Scaling Policy. For high availability, we recommend that you select vSwitches in two or more different availability zones.
Instance configurations
Set Instance configuration method to Specify instance types.
Architecture: GPU-accelerated cloud server.
Instance types: Select instance types that meet your needs, such as ecs.gn7i-c8g1.2xlarge. To ensure successful scaling, we recommend selecting multiple instance types.
For more information about the parameters, see Create and manage node pools.
Step 2: Create and deploy a GPU training job
Create a file named
pytorch-examples-job.yamlwith the following content:apiVersion: batch/v1 kind: Job metadata: name: pytorch-examples labels: app: pytorch-examples spec: parallelism: 1 backoffLimit: 0 ttlSecondsAfterFinished: 3600 template: metadata: labels: app: pytorch-examples spec: restartPolicy: Never containers: - name: py image: registry-<region-id>-vpc.ack.aliyuncs.com/acs/pytorch-examples:2.6.0-cuda12.2-1 command: - python - main.py - --epochs - "1" - --batch_size - "8" - --accel env: - name: NVIDIA_DISABLE_REQUIRE value: "1" - name: PYTHONUNBUFFERED value: "1" resources: limits: nvidia.com/gpu: 1 workingDir: /root/examples/word_language_modelNoteReplace
<region-id>with the ID of the region where your cluster is located, such ascn-hangzhou,cn-beijing, orcn-zhangjiakou.The
resources.limitsfield in the YAML file specifies that the container requires one GPU. The Kubernetes scheduler uses this information to assign the Pod to a suitable node.
Deploy the application.
kubectl apply -f pytorch-examples-job.yaml
Step 3: Verify the results
Verify that the Pod is scheduled to a GPU node.
After deployment, if GPU resources are insufficient, the Pod enters the
Pendingstate.Check the Pod status.
kubectl get pod -l app=pytorch-examplesCheck the Pod events.
kubectl describe pod <YOUR-POD-NAME>In the Events section, the expected output is similar to the following:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 15s default-scheduler Successfully assigned default/pytorch-examples-***** to cn-beijing.10.61.65.169 to cn-beijing.10.61.65.169 Normal AllocIPSucceed 15s terway-daemon Alloc IP 10.61.65.172/16 took 39.976688ms Normal Pulled 15s kubelet Container image "registry-cn-beijing-vpc.ack.aliyuncs.com/acs/pytorch-examples:2.6.0-cuda12.2-1" already present on machine and can be accessed by the pod Normal Created 15s kubelet Container created Normal Started 15s kubelet Container started
Check the logs to verify that PyTorch is using the GPU.
kubectl logs <YOUR-POD-NAME> -fExpected output:
Using device: cuda | epoch 1 | 200/ 7459 batches | lr 20.00 | ms/batch 5.58 | loss 7.71 | ppl 2241.62 | epoch 1 | 400/ 7459 batches | lr 20.00 | ms/batch 4.58 | loss 6.85 | ppl 946.49 | epoch 1 | 600/ 7459 batches | lr 20.00 | ms/batch 4.57 | loss 6.54 | ppl 692.70 | epoch 1 | 800/ 7459 batches | lr 20.00 | ms/batch 4.57 | loss 6.35 | ppl 574.73 | epoch 1 | 1000/ 7459 batches | lr 20.00 | ms/batch 4.57 | loss 6.20 | ppl 494.75 | epoch 1 | 1200/ 7459 batches | lr 20.00 | ms/batch 4.58 | loss 6.18 | ppl 482.76 | epoch 1 | 1400/ 7459 batches | lr 20.00 | ms/batch 4.60 | loss 6.18 | ppl 483.99 | epoch 1 | 1600/ 7459 batches | lr 20.00 | ms/batch 4.59 | loss 5.97 | ppl 390.43 | epoch 1 | 1800/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.97 | ppl 390.31 | epoch 1 | 2000/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.89 | ppl 361.39 | epoch 1 | 2200/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.84 | ppl 344.48 | epoch 1 | 2400/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.80 | ppl 330.62 | epoch 1 | 2600/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.86 | ppl 351.71 | epoch 1 | 2800/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.79 | ppl 327.62 | epoch 1 | 3000/ 7459 batches | lr 20.00 | ms/batch 4.63 | loss 5.66 | ppl 286.30 | epoch 1 | 3200/ 7459 batches | lr 20.00 | ms/batch 4.64 | loss 5.75 | ppl 313.55 | epoch 1 | 3400/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.73 | ppl 306.97 | epoch 1 | 3600/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.62 | ppl 274.96 | epoch 1 | 3800/ 7459 batches | lr 20.00 | ms/batch 4.63 | loss 5.63 | ppl 279.61 | epoch 1 | 4000/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.62 | ppl 274.83 | epoch 1 | 4200/ 7459 batches | lr 20.00 | ms/batch 4.63 | loss 5.52 | ppl 248.50 | epoch 1 | 4400/ 7459 batches | lr 20.00 | ms/batch 4.63 | loss 5.55 | ppl 256.37 | epoch 1 | 4600/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.69 | ppl 297.25 | epoch 1 | 4800/ 7459 batches | lr 20.00 | ms/batch 4.65 | loss 5.62 | ppl 275.78 | epoch 1 | 5000/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.58 | ppl 265.67 | epoch 1 | 5200/ 7459 batches | lr 20.00 | ms/batch 4.60 | loss 5.26 | ppl 191.98 | epoch 1 | 5400/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.50 | ppl 245.12 | epoch 1 | 5600/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.57 | ppl 261.86 | epoch 1 | 5800/ 7459 batches | lr 20.00 | ms/batch 4.63 | loss 5.45 | ppl 233.85 | epoch 1 | 6000/ 7459 batches | lr 20.00 | ms/batch 4.64 | loss 5.43 | ppl 228.95 | epoch 1 | 6200/ 7459 batches | lr 20.00 | ms/batch 4.64 | loss 5.39 | ppl 219.26 | epoch 1 | 6400/ 7459 batches | lr 20.00 | ms/batch 4.64 | loss 5.47 | ppl 236.98 | epoch 1 | 6600/ 7459 batches | lr 20.00 | ms/batch 4.64 | loss 5.48 | ppl 240.89 | epoch 1 | 6800/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.43 | ppl 228.92 | epoch 1 | 7000/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.39 | ppl 219.44 | epoch 1 | 7200/ 7459 batches | lr 20.00 | ms/batch 4.60 | loss 5.30 | ppl 199.58 | epoch 1 | 7400/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.40 | ppl 221.76 ----------------------------------------------------------------------------------------- | end of epoch 1 | time: 35.56s | valid loss 5.47 | valid ppl 237.61 ----------------------------------------------------------------------------------------- ========================================================================================= | End of training | test loss 5.41 | test ppl 224.45 =========================================================================================The log output begins with
Using device: cuda, confirming that the PyTorch job is running.NoteThe image may not support all GPU models. If the Job fails, try changing the image tag in
pytorch-examples-job.yamlfrom2.6.0-cuda12.2-1to2.9.0-cuda13.0-2.Verify GPU utilization.
While the PyTorch job is running, run the following command in the Pod to check GPU utilization:
kubectl exec -it <YOUR-POD-NAME> -- bashThen, run the
nvidia-smicommand. The expected output is similar to the following:Thu Jun 11 06:08:11 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.195.03 Driver Version: 570.195.03 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA XXX On | 00000000:00:08.0 Off | 0 | | 0% 52C P0 147W / 150W | 513MiB / 23028MiB | 91% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1 C python 506MiB | +-----------------------------------------------------------------------------------------+This output confirms that the Pod was scheduled to a GPU node, the driver and CUDA are running correctly, PyTorch is using the GPU, and utilization has reached 91%.
Clean up resources.
kubectl delete job pytorch-examples