Run GPU workloads on ContainerOS

更新时间:
复制 MD 格式

ContainerOS is an Alibaba Cloud operating system optimized for containerized environments. It includes built-in GPU drivers and a container runtime, providing out-of-the-box support for GPU workloads. This topic shows you how to create a node pool of GPU instances using ContainerOS and deploy a sample workload to verify that GPU scheduling and execution work correctly.

Usage notes

  • Nodes must use ContainerOS 3.7 or later. To upgrade the ContainerOS version, see Change an operating system.

  • The following instance types are not supported: ecs.gn5, ecs.gn5i, ecs.gn6v, ecs.gn6e, and ecs.gn8v-tee.

Step 1: Create a GPU node pool

  1. In the Container Service for Kubernetes console, go to the Clusters page and click the name of your cluster. In the left navigation pane, choose Node Management > Node Pools.

  2. Click Create Node Pool and configure the following key parameters:

    Parameter

    Description

    Instance configuration

    Set Instance Configuration Method to Specify Instance Types.

    • Architecture: GPU instance.

    • Instance Type: Select an instance type family that meets your business requirements, such as ecs.gn7i-c8g1.2xlarge. We recommend that you select multiple instance types to increase the success rate of scaling.

    Operating system

    Select ContainerOS GPU 3.7.2.

    For more information, see Create and manage node pools.

Step 2: Deploy and verify GPU workload

  1. Create and deploy the workload.

    1. Create a file named pytorch-examples-job.yaml with the following content:

      apiVersion: batch/v1
      kind: Job
      metadata:
        name: pytorch-examples
        labels:
          app: pytorch-examples
      spec:
        parallelism: 1          
        backoffLimit: 0        
        ttlSecondsAfterFinished: 3600 
        
        template:
          metadata:
            labels:
              app: pytorch-examples
          spec:
            restartPolicy: Never 
            containers:
            - name: py 
              image: registry-<region-id>-vpc.ack.aliyuncs.com/acs/pytorch-examples:2.6.0-cuda12.2-1
              command:
              - python
              - main.py
              - --epochs
              - "1"
              - --batch_size
              - "8"
              - --accel
              env:
              - name: NVIDIA_DISABLE_REQUIRE
                value: "1"
              - name: PYTHONUNBUFFERED
                value: "1"
              resources:
                limits:
                  nvidia.com/gpu: 1
              workingDir: /root/examples/word_language_model
      
      Note

      Replace <region-id> with the region ID of your cluster, such as cn-hangzhou, cn-beijing, or cn-zhangjiakou.

      • The resources.limits field in the YAML file specifies that the container requires one GPU. The Kubernetes scheduler uses this information to schedule the Pod to a node with available GPU resources.

    2. Deploy the application.

      kubectl apply -f pytorch-examples-job.yaml
  2. Verify that the Pod is scheduled to a GPU node.

    If GPU resources are insufficient after deployment, the Pod remains in the Pending state.

    1. Check the Pod status.

      kubectl get pod -l app=pytorch-examples
    2. Check the Pod events.

      kubectl describe pod -l app=pytorch-examples

      The output under Events: should be similar to the following:

      Events:
        Type    Reason          Age   From               Message
        ----    ------          ----  ----               -------
        Normal  Scheduled       15s   default-scheduler  Successfully assigned default/pytorch-examples-m2qs7 to cn-hangzhou.10.61.65.169
        Normal  AllocIPSucceed  15s   terway-daemon      Alloc IP 10.61.65.172/16 took 39.976688ms
        Normal  Pulled          15s   kubelet            Container image "registry-cn-hangzhou-vpc.ack.aliyuncs.com/acs/pytorch-examples:2.6.0-cuda12.2-1" already present on machine and can be accessed by the pod
        Normal  Created         15s   kubelet            Container created
        Normal  Started         15s   kubelet            Container started
  3. Check the logs to verify that PyTorch is using the GPU.

    kubectl logs <YOUR-POD-NAME> -f

    Expected output:

    Using device: cuda
    | epoch   1 |   200/ 7459 batches | lr 20.00 | ms/batch  5.58 | loss  7.71 | ppl  2241.62
    | epoch   1 |   400/ 7459 batches | lr 20.00 | ms/batch  4.58 | loss  6.85 | ppl   946.49
    | epoch   1 |   600/ 7459 batches | lr 20.00 | ms/batch  4.57 | loss  6.54 | ppl   692.70
    | epoch   1 |   800/ 7459 batches | lr 20.00 | ms/batch  4.57 | loss  6.35 | ppl   574.73
    | epoch   1 |  1000/ 7459 batches | lr 20.00 | ms/batch  4.57 | loss  6.20 | ppl   494.75
    | epoch   1 |  1200/ 7459 batches | lr 20.00 | ms/batch  4.58 | loss  6.18 | ppl   482.76
    | epoch   1 |  1400/ 7459 batches | lr 20.00 | ms/batch  4.60 | loss  6.18 | ppl   483.99
    | epoch   1 |  1600/ 7459 batches | lr 20.00 | ms/batch  4.59 | loss  5.97 | ppl   390.43
    | epoch   1 |  1800/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.97 | ppl   390.31
    | epoch   1 |  2000/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.89 | ppl   361.39
    | epoch   1 |  2200/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.84 | ppl   344.48
    | epoch   1 |  2400/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.80 | ppl   330.62
    | epoch   1 |  2600/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.86 | ppl   351.71
    | epoch   1 |  2800/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.79 | ppl   327.62
    | epoch   1 |  3000/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.66 | ppl   286.30
    | epoch   1 |  3200/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.75 | ppl   313.55
    | epoch   1 |  3400/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.73 | ppl   306.97
    | epoch   1 |  3600/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.62 | ppl   274.96
    | epoch   1 |  3800/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.63 | ppl   279.61
    | epoch   1 |  4000/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.62 | ppl   274.83
    | epoch   1 |  4200/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.52 | ppl   248.50
    | epoch   1 |  4400/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.55 | ppl   256.37
    | epoch   1 |  4600/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.69 | ppl   297.25
    | epoch   1 |  4800/ 7459 batches | lr 20.00 | ms/batch  4.65 | loss  5.62 | ppl   275.78
    | epoch   1 |  5000/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.58 | ppl   265.67
    | epoch   1 |  5200/ 7459 batches | lr 20.00 | ms/batch  4.60 | loss  5.26 | ppl   191.98
    | epoch   1 |  5400/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.50 | ppl   245.12
    | epoch   1 |  5600/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.57 | ppl   261.86
    | epoch   1 |  5800/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.45 | ppl   233.85
    | epoch   1 |  6000/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.43 | ppl   228.95
    | epoch   1 |  6200/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.39 | ppl   219.26
    | epoch   1 |  6400/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.47 | ppl   236.98
    | epoch   1 |  6600/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.48 | ppl   240.89
    | epoch   1 |  6800/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.43 | ppl   228.92
    | epoch   1 |  7000/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.39 | ppl   219.44
    | epoch   1 |  7200/ 7459 batches | lr 20.00 | ms/batch  4.60 | loss  5.30 | ppl   199.58
    | epoch   1 |  7400/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.40 | ppl   221.76
    -----------------------------------------------------------------------------------------
    | end of epoch   1 | time: 35.56s | valid loss  5.47 | valid ppl   237.61
    -----------------------------------------------------------------------------------------
    =========================================================================================
    | End of training | test loss  5.41 | test ppl   224.45
    =========================================================================================

    The log output starts with Using device: cuda, which indicates that the PyTorch GPU example is running.

    Note

    The current image may not support all GPU models. If the job fails, try changing the image tag in pytorch-examples-job.yaml from 2.6.0-cuda12.2-1 to 2.9.0-cuda13.0-2.

  4. Verify GPU usage.

    While the PyTorch example is running, access the Pod to check GPU usage:

    kubectl exec -it <YOUR-POD-NAME> -- bash   

    Then, run the nvidia-smi command. The expected output is similar to the following:

    root@pytorch-examples-xlqg6:~/examples/word_language_model# nvidia-smi
    Thu Jun  4 03:42:55 2026       
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
    |-----------------------------------------+------------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  NVIDIA XXX                     On  |   00000000:00:08.0 Off |                    0 |
    |  0%   52C    P0            147W /  150W |     513MiB /  23028MiB |     91%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
                                                                                             
    +-----------------------------------------------------------------------------------------+
    | Processes:                                                                              |
    |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
    |        ID   ID                                                               Usage      |
    |=========================================================================================|
    |    0   N/A  N/A               1      C   python                                  506MiB |
    +-----------------------------------------------------------------------------------------+

    This output confirms that the GPU was scheduled successfully, the driver and CUDA are functioning correctly, and PyTorch is using the GPU for computation. The GPU utilization is 91%.

    If you have node access permissions, you can also run nvidia-smi directly on the GPU node. For instructions, see Log on to a node.

  5. Delete the job.

    kubectl delete job pytorch-examples