How to run GPU workloads on ContainerOS-Container Service for Kubernetes(ACK)-阿里云帮助中心

ContainerOS is an Alibaba Cloud operating system optimized for containerized environments. It includes built-in GPU drivers and a container runtime, providing out-of-the-box support for GPU workloads. This topic shows you how to create a node pool of GPU instances using ContainerOS and deploy a sample workload to verify that GPU scheduling and execution work correctly.

Usage notes

Nodes must use ContainerOS 3.7 or later. To upgrade the ContainerOS version, see Change an operating system.
The following instance types are not supported: ecs.gn5, ecs.gn5i, ecs.gn6v, ecs.gn6e, and ecs.gn8v-tee.

Step 1: Create a GPU node pool

In the Container Service for Kubernetes console, go to the Clusters page and click the name of your cluster. In the left navigation pane, choose Node Management > Node Pools.

Click Create Node Pool and configure the following key parameters:

Parameter

Description

Instance configuration

Set Instance Configuration Method to Specify Instance Types.

Architecture: GPU instance.
Instance Type: Select an instance type family that meets your business requirements, such as ecs.gn7i-c8g1.2xlarge. We recommend that you select multiple instance types to increase the success rate of scaling.

Operating system

Select ContainerOS GPU 3.7.2.

For more information, see Create and manage node pools.

Step 2: Deploy and verify GPU workload

Create and deploy the workload.

Create a file named pytorch-examples-job.yaml with the following content:

apiVersion: batch/v1
kind: Job
metadata:
  name: pytorch-examples
  labels:
    app: pytorch-examples
spec:
  parallelism: 1          
  backoffLimit: 0        
  ttlSecondsAfterFinished: 3600 
  
  template:
    metadata:
      labels:
        app: pytorch-examples
    spec:
      restartPolicy: Never 
      containers:
      - name: py 
        image: registry-<region-id>-vpc.ack.aliyuncs.com/acs/pytorch-examples:2.6.0-cuda12.2-1
        command:
        - python
        - main.py
        - --epochs
        - "1"
        - --batch_size
        - "8"
        - --accel
        env:
        - name: NVIDIA_DISABLE_REQUIRE
          value: "1"
        - name: PYTHONUNBUFFERED
          value: "1"
        resources:
          limits:
            nvidia.com/gpu: 1
        workingDir: /root/examples/word_language_model

Note

Replace <region-id> with the region ID of your cluster, such as cn-hangzhou, cn-beijing, or cn-zhangjiakou.

The resources.limits field in the YAML file specifies that the container requires one GPU. The Kubernetes scheduler uses this information to schedule the Pod to a node with available GPU resources.

Deploy the application.

kubectl apply -f pytorch-examples-job.yaml

Verify that the Pod is scheduled to a GPU node.

If GPU resources are insufficient after deployment, the Pod remains in the Pending state.

Check the Pod status.
```
kubectl get pod -l app=pytorch-examples
```

Check the Pod events.

kubectl describe pod -l app=pytorch-examples

The output under Events: should be similar to the following:

Events:
  Type    Reason          Age   From               Message
  ----    ------          ----  ----               -------
  Normal  Scheduled       15s   default-scheduler  Successfully assigned default/pytorch-examples-m2qs7 to cn-hangzhou.10.61.65.169
  Normal  AllocIPSucceed  15s   terway-daemon      Alloc IP 10.61.65.172/16 took 39.976688ms
  Normal  Pulled          15s   kubelet            Container image "registry-cn-hangzhou-vpc.ack.aliyuncs.com/acs/pytorch-examples:2.6.0-cuda12.2-1" already present on machine and can be accessed by the pod
  Normal  Created         15s   kubelet            Container created
  Normal  Started         15s   kubelet            Container started

Check the logs to verify that PyTorch is using the GPU.

kubectl logs <YOUR-POD-NAME> -f

Expected output:

Using device: cuda
| epoch   1 |   200/ 7459 batches | lr 20.00 | ms/batch  5.58 | loss  7.71 | ppl  2241.62
| epoch   1 |   400/ 7459 batches | lr 20.00 | ms/batch  4.58 | loss  6.85 | ppl   946.49
| epoch   1 |   600/ 7459 batches | lr 20.00 | ms/batch  4.57 | loss  6.54 | ppl   692.70
| epoch   1 |   800/ 7459 batches | lr 20.00 | ms/batch  4.57 | loss  6.35 | ppl   574.73
| epoch   1 |  1000/ 7459 batches | lr 20.00 | ms/batch  4.57 | loss  6.20 | ppl   494.75
| epoch   1 |  1200/ 7459 batches | lr 20.00 | ms/batch  4.58 | loss  6.18 | ppl   482.76
| epoch   1 |  1400/ 7459 batches | lr 20.00 | ms/batch  4.60 | loss  6.18 | ppl   483.99
| epoch   1 |  1600/ 7459 batches | lr 20.00 | ms/batch  4.59 | loss  5.97 | ppl   390.43
| epoch   1 |  1800/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.97 | ppl   390.31
| epoch   1 |  2000/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.89 | ppl   361.39
| epoch   1 |  2200/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.84 | ppl   344.48
| epoch   1 |  2400/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.80 | ppl   330.62
| epoch   1 |  2600/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.86 | ppl   351.71
| epoch   1 |  2800/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.79 | ppl   327.62
| epoch   1 |  3000/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.66 | ppl   286.30
| epoch   1 |  3200/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.75 | ppl   313.55
| epoch   1 |  3400/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.73 | ppl   306.97
| epoch   1 |  3600/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.62 | ppl   274.96
| epoch   1 |  3800/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.63 | ppl   279.61
| epoch   1 |  4000/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.62 | ppl   274.83
| epoch   1 |  4200/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.52 | ppl   248.50
| epoch   1 |  4400/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.55 | ppl   256.37
| epoch   1 |  4600/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.69 | ppl   297.25
| epoch   1 |  4800/ 7459 batches | lr 20.00 | ms/batch  4.65 | loss  5.62 | ppl   275.78
| epoch   1 |  5000/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.58 | ppl   265.67
| epoch   1 |  5200/ 7459 batches | lr 20.00 | ms/batch  4.60 | loss  5.26 | ppl   191.98
| epoch   1 |  5400/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.50 | ppl   245.12
| epoch   1 |  5600/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.57 | ppl   261.86
| epoch   1 |  5800/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.45 | ppl   233.85
| epoch   1 |  6000/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.43 | ppl   228.95
| epoch   1 |  6200/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.39 | ppl   219.26
| epoch   1 |  6400/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.47 | ppl   236.98
| epoch   1 |  6600/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.48 | ppl   240.89
| epoch   1 |  6800/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.43 | ppl   228.92
| epoch   1 |  7000/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.39 | ppl   219.44
| epoch   1 |  7200/ 7459 batches | lr 20.00 | ms/batch  4.60 | loss  5.30 | ppl   199.58
| epoch   1 |  7400/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.40 | ppl   221.76
-----------------------------------------------------------------------------------------
| end of epoch   1 | time: 35.56s | valid loss  5.47 | valid ppl   237.61
-----------------------------------------------------------------------------------------
=========================================================================================
| End of training | test loss  5.41 | test ppl   224.45
=========================================================================================

The log output starts with Using device: cuda, which indicates that the PyTorch GPU example is running.

Note

The current image may not support all GPU models. If the job fails, try changing the image tag in pytorch-examples-job.yaml from 2.6.0-cuda12.2-1 to 2.9.0-cuda13.0-2.

Verify GPU usage.

While the PyTorch example is running, access the Pod to check GPU usage:

kubectl exec -it <YOUR-POD-NAME> -- bash

Then, run the nvidia-smi command. The expected output is similar to the following:

root@pytorch-examples-xlqg6:~/examples/word_language_model# nvidia-smi
Thu Jun  4 03:42:55 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA XXX                     On  |   00000000:00:08.0 Off |                    0 |
|  0%   52C    P0            147W /  150W |     513MiB /  23028MiB |     91%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A               1      C   python                                  506MiB |
+-----------------------------------------------------------------------------------------+

This output confirms that the GPU was scheduled successfully, the driver and CUDA are functioning correctly, and PyTorch is using the GPU for computation. The GPU utilization is 91%.

If you have node access permissions, you can also run nvidia-smi directly on the GPU node. For instructions, see Log on to a node.

Delete the job.
```
kubectl delete job pytorch-examples
```