ContainerOS is an Alibaba Cloud operating system optimized for containerized environments. It includes built-in GPU drivers and a container runtime, providing out-of-the-box support for GPU workloads. This topic shows you how to create a node pool of GPU instances using ContainerOS and deploy a sample workload to verify that GPU scheduling and execution work correctly.
Usage notes
Nodes must use ContainerOS 3.7 or later. To upgrade the ContainerOS version, see Change an operating system.
The following instance types are not supported: ecs.gn5, ecs.gn5i, ecs.gn6v, ecs.gn6e, and ecs.gn8v-tee.
Step 1: Create a GPU node pool
In the Container Service for Kubernetes console, go to the Clusters page and click the name of your cluster. In the left navigation pane, choose .
Click Create Node Pool and configure the following key parameters:
Parameter
Description
Instance configuration
Set Instance Configuration Method to Specify Instance Types.
Architecture: GPU instance.
Instance Type: Select an instance type family that meets your business requirements, such as ecs.gn7i-c8g1.2xlarge. We recommend that you select multiple instance types to increase the success rate of scaling.
Operating system
Select ContainerOS GPU 3.7.2.
For more information, see Create and manage node pools.
Step 2: Deploy and verify GPU workload
Create and deploy the workload.
Create a file named
pytorch-examples-job.yamlwith the following content:apiVersion: batch/v1 kind: Job metadata: name: pytorch-examples labels: app: pytorch-examples spec: parallelism: 1 backoffLimit: 0 ttlSecondsAfterFinished: 3600 template: metadata: labels: app: pytorch-examples spec: restartPolicy: Never containers: - name: py image: registry-<region-id>-vpc.ack.aliyuncs.com/acs/pytorch-examples:2.6.0-cuda12.2-1 command: - python - main.py - --epochs - "1" - --batch_size - "8" - --accel env: - name: NVIDIA_DISABLE_REQUIRE value: "1" - name: PYTHONUNBUFFERED value: "1" resources: limits: nvidia.com/gpu: 1 workingDir: /root/examples/word_language_modelNoteReplace
<region-id>with the region ID of your cluster, such ascn-hangzhou,cn-beijing, orcn-zhangjiakou.The
resources.limitsfield in the YAML file specifies that the container requires one GPU. The Kubernetes scheduler uses this information to schedule the Pod to a node with available GPU resources.
Deploy the application.
kubectl apply -f pytorch-examples-job.yaml
Verify that the Pod is scheduled to a GPU node.
If GPU resources are insufficient after deployment, the Pod remains in the
Pendingstate.Check the Pod status.
kubectl get pod -l app=pytorch-examplesCheck the Pod events.
kubectl describe pod -l app=pytorch-examplesThe output under Events: should be similar to the following:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 15s default-scheduler Successfully assigned default/pytorch-examples-m2qs7 to cn-hangzhou.10.61.65.169 Normal AllocIPSucceed 15s terway-daemon Alloc IP 10.61.65.172/16 took 39.976688ms Normal Pulled 15s kubelet Container image "registry-cn-hangzhou-vpc.ack.aliyuncs.com/acs/pytorch-examples:2.6.0-cuda12.2-1" already present on machine and can be accessed by the pod Normal Created 15s kubelet Container created Normal Started 15s kubelet Container started
Check the logs to verify that PyTorch is using the GPU.
kubectl logs <YOUR-POD-NAME> -fExpected output:
Using device: cuda | epoch 1 | 200/ 7459 batches | lr 20.00 | ms/batch 5.58 | loss 7.71 | ppl 2241.62 | epoch 1 | 400/ 7459 batches | lr 20.00 | ms/batch 4.58 | loss 6.85 | ppl 946.49 | epoch 1 | 600/ 7459 batches | lr 20.00 | ms/batch 4.57 | loss 6.54 | ppl 692.70 | epoch 1 | 800/ 7459 batches | lr 20.00 | ms/batch 4.57 | loss 6.35 | ppl 574.73 | epoch 1 | 1000/ 7459 batches | lr 20.00 | ms/batch 4.57 | loss 6.20 | ppl 494.75 | epoch 1 | 1200/ 7459 batches | lr 20.00 | ms/batch 4.58 | loss 6.18 | ppl 482.76 | epoch 1 | 1400/ 7459 batches | lr 20.00 | ms/batch 4.60 | loss 6.18 | ppl 483.99 | epoch 1 | 1600/ 7459 batches | lr 20.00 | ms/batch 4.59 | loss 5.97 | ppl 390.43 | epoch 1 | 1800/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.97 | ppl 390.31 | epoch 1 | 2000/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.89 | ppl 361.39 | epoch 1 | 2200/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.84 | ppl 344.48 | epoch 1 | 2400/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.80 | ppl 330.62 | epoch 1 | 2600/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.86 | ppl 351.71 | epoch 1 | 2800/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.79 | ppl 327.62 | epoch 1 | 3000/ 7459 batches | lr 20.00 | ms/batch 4.63 | loss 5.66 | ppl 286.30 | epoch 1 | 3200/ 7459 batches | lr 20.00 | ms/batch 4.64 | loss 5.75 | ppl 313.55 | epoch 1 | 3400/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.73 | ppl 306.97 | epoch 1 | 3600/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.62 | ppl 274.96 | epoch 1 | 3800/ 7459 batches | lr 20.00 | ms/batch 4.63 | loss 5.63 | ppl 279.61 | epoch 1 | 4000/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.62 | ppl 274.83 | epoch 1 | 4200/ 7459 batches | lr 20.00 | ms/batch 4.63 | loss 5.52 | ppl 248.50 | epoch 1 | 4400/ 7459 batches | lr 20.00 | ms/batch 4.63 | loss 5.55 | ppl 256.37 | epoch 1 | 4600/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.69 | ppl 297.25 | epoch 1 | 4800/ 7459 batches | lr 20.00 | ms/batch 4.65 | loss 5.62 | ppl 275.78 | epoch 1 | 5000/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.58 | ppl 265.67 | epoch 1 | 5200/ 7459 batches | lr 20.00 | ms/batch 4.60 | loss 5.26 | ppl 191.98 | epoch 1 | 5400/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.50 | ppl 245.12 | epoch 1 | 5600/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.57 | ppl 261.86 | epoch 1 | 5800/ 7459 batches | lr 20.00 | ms/batch 4.63 | loss 5.45 | ppl 233.85 | epoch 1 | 6000/ 7459 batches | lr 20.00 | ms/batch 4.64 | loss 5.43 | ppl 228.95 | epoch 1 | 6200/ 7459 batches | lr 20.00 | ms/batch 4.64 | loss 5.39 | ppl 219.26 | epoch 1 | 6400/ 7459 batches | lr 20.00 | ms/batch 4.64 | loss 5.47 | ppl 236.98 | epoch 1 | 6600/ 7459 batches | lr 20.00 | ms/batch 4.64 | loss 5.48 | ppl 240.89 | epoch 1 | 6800/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.43 | ppl 228.92 | epoch 1 | 7000/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.39 | ppl 219.44 | epoch 1 | 7200/ 7459 batches | lr 20.00 | ms/batch 4.60 | loss 5.30 | ppl 199.58 | epoch 1 | 7400/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.40 | ppl 221.76 ----------------------------------------------------------------------------------------- | end of epoch 1 | time: 35.56s | valid loss 5.47 | valid ppl 237.61 ----------------------------------------------------------------------------------------- ========================================================================================= | End of training | test loss 5.41 | test ppl 224.45 =========================================================================================The log output starts with
Using device: cuda, which indicates that the PyTorch GPU example is running.NoteThe current image may not support all GPU models. If the job fails, try changing the image tag in
pytorch-examples-job.yamlfrom2.6.0-cuda12.2-1to2.9.0-cuda13.0-2.Verify GPU usage.
While the PyTorch example is running, access the Pod to check GPU usage:
kubectl exec -it <YOUR-POD-NAME> -- bashThen, run the
nvidia-smicommand. The expected output is similar to the following:root@pytorch-examples-xlqg6:~/examples/word_language_model# nvidia-smi Thu Jun 4 03:42:55 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.195.03 Driver Version: 570.195.03 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA XXX On | 00000000:00:08.0 Off | 0 | | 0% 52C P0 147W / 150W | 513MiB / 23028MiB | 91% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1 C python 506MiB | +-----------------------------------------------------------------------------------------+This output confirms that the GPU was scheduled successfully, the driver and CUDA are functioning correctly, and PyTorch is using the GPU for computation. The GPU utilization is 91%.
If you have node access permissions, you can also run
nvidia-smidirectly on the GPU node. For instructions, see Log on to a node.Delete the job.
kubectl delete job pytorch-examples