使用ContainerOS部署并运行GPU工作负载

更新时间:
复制为 MD 格式

ContainerOS是阿里云为容器场景优化的官方操作系统,内置GPU驱动和容器运行时,开箱即可支持GPU工作负载。本文介绍如何在使用ContainerOS操作系统的节点上,通过创建GPU规格节点池并部署示例工作负载,验证GPU调度和运行是否正常。

适用范围

  • 节点操作系统为ContainerOS 3.7及以上版本。如需升级ContainerOS版本,请参见更换操作系统

  • 暂不支持的机型为:ecs.gn5、ecs.gn5i、ecs.gn6v、ecs.gn6e、ecs.gn8v-tee。

步骤一:创建GPU节点池

  1. ACK集群列表页面,单击目标集群名称,在集群详情页左侧导航栏,选择节点管理 > 节点池

  2. 单击创建节点池,按照页面提示完成配置。关键配置如下:

    配置项

    说明

    实例相关配置

    选择实例配置方式指定实例规格

    • 架构GPU云服务器

    • 实例规格:根据业务需求选择合适的实例规格簇,如ecs.gn7i-c8g1.2xlarge。为提高扩容成功率,建议选择多个实例规格。

    操作系统

    选择ContainerOS GPU 3.7.2。

    详细配置项说明,请参见创建和管理节点池

步骤二:部署GPU工作负载并验证运行状态

  1. 创建并部署工作负载。

    1. 创建一个名为pytorch-examples-job.yaml的文件,内容如下:

      apiVersion: batch/v1
      kind: Job
      metadata:
        name: pytorch-examples
        labels:
          app: pytorch-examples
      spec:
        parallelism: 1          
        backoffLimit: 0        
        ttlSecondsAfterFinished: 3600 
        
        template:
          metadata:
            labels:
              app: pytorch-examples
          spec:
            restartPolicy: Never 
            containers:
            - name: py 
              image: registry-<region-id>-vpc.ack.aliyuncs.com/acs/pytorch-examples:2.6.0-cuda12.2-1
              command:
              - python
              - main.py
              - --epochs
              - "1"
              - --batch_size
              - "8"
              - --accel
              env:
              - name: NVIDIA_DISABLE_REQUIRE
                value: "1"
              - name: PYTHONUNBUFFERED
                value: "1"
              resources:
                limits:
                  nvidia.com/gpu: 1
              workingDir: /root/examples/word_language_model
      
      说明

      请将<region-id>替换为集群所在开服地域的地域ID,例如cn-hangzhoucn-beijingcn-zhangjiakou等。

      • YAML中的resources.limits字段声明了该容器需要1GPU,Kubernetes调度器会据此将Pod调度到有空闲GPU资源的节点上。

    2. 部署应用。

      kubectl apply -f pytorch-examples-job.yaml
  2. 验证Pod是否成功调度到GPU节点。

    部署后,如果GPU资源短缺,Pod会因缺少GPU资源而处于Pending状态。

    1. 查看Pod状态。

      kubectl get pod -l app=pytorch-examples
    2. 查看Pod事件。

      kubectl describe pod -l app=pytorch-examples

      Events中,预期输出如下:

      Events:
        Type    Reason          Age   From               Message
        ----    ------          ----  ----               -------
        Normal  Scheduled       15s   default-scheduler  Successfully assigned default/pytorch-examples-m2qs7 to cn-hangzhou.10.61.65.169
        Normal  AllocIPSucceed  15s   terway-daemon      Alloc IP 10.61.65.172/16 took 39.976688ms
        Normal  Pulled          15s   kubelet            Container image "registry-cn-hangzhou-vpc.ack.aliyuncs.com/acs/pytorch-examples:2.6.0-cuda12.2-1" already present on machine and can be accessed by the pod
        Normal  Created         15s   kubelet            Container created
        Normal  Started         15s   kubelet            Container started
  3. 查看日志,验证pytorch是否在使用GPU。

    kubectl logs <YOUR-POD-NAME> -f

    预期结果:

    Using device: cuda
    | epoch   1 |   200/ 7459 batches | lr 20.00 | ms/batch  5.58 | loss  7.71 | ppl  2241.62
    | epoch   1 |   400/ 7459 batches | lr 20.00 | ms/batch  4.58 | loss  6.85 | ppl   946.49
    | epoch   1 |   600/ 7459 batches | lr 20.00 | ms/batch  4.57 | loss  6.54 | ppl   692.70
    | epoch   1 |   800/ 7459 batches | lr 20.00 | ms/batch  4.57 | loss  6.35 | ppl   574.73
    | epoch   1 |  1000/ 7459 batches | lr 20.00 | ms/batch  4.57 | loss  6.20 | ppl   494.75
    | epoch   1 |  1200/ 7459 batches | lr 20.00 | ms/batch  4.58 | loss  6.18 | ppl   482.76
    | epoch   1 |  1400/ 7459 batches | lr 20.00 | ms/batch  4.60 | loss  6.18 | ppl   483.99
    | epoch   1 |  1600/ 7459 batches | lr 20.00 | ms/batch  4.59 | loss  5.97 | ppl   390.43
    | epoch   1 |  1800/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.97 | ppl   390.31
    | epoch   1 |  2000/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.89 | ppl   361.39
    | epoch   1 |  2200/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.84 | ppl   344.48
    | epoch   1 |  2400/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.80 | ppl   330.62
    | epoch   1 |  2600/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.86 | ppl   351.71
    | epoch   1 |  2800/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.79 | ppl   327.62
    | epoch   1 |  3000/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.66 | ppl   286.30
    | epoch   1 |  3200/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.75 | ppl   313.55
    | epoch   1 |  3400/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.73 | ppl   306.97
    | epoch   1 |  3600/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.62 | ppl   274.96
    | epoch   1 |  3800/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.63 | ppl   279.61
    | epoch   1 |  4000/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.62 | ppl   274.83
    | epoch   1 |  4200/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.52 | ppl   248.50
    | epoch   1 |  4400/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.55 | ppl   256.37
    | epoch   1 |  4600/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.69 | ppl   297.25
    | epoch   1 |  4800/ 7459 batches | lr 20.00 | ms/batch  4.65 | loss  5.62 | ppl   275.78
    | epoch   1 |  5000/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.58 | ppl   265.67
    | epoch   1 |  5200/ 7459 batches | lr 20.00 | ms/batch  4.60 | loss  5.26 | ppl   191.98
    | epoch   1 |  5400/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.50 | ppl   245.12
    | epoch   1 |  5600/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.57 | ppl   261.86
    | epoch   1 |  5800/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.45 | ppl   233.85
    | epoch   1 |  6000/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.43 | ppl   228.95
    | epoch   1 |  6200/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.39 | ppl   219.26
    | epoch   1 |  6400/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.47 | ppl   236.98
    | epoch   1 |  6600/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.48 | ppl   240.89
    | epoch   1 |  6800/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.43 | ppl   228.92
    | epoch   1 |  7000/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.39 | ppl   219.44
    | epoch   1 |  7200/ 7459 batches | lr 20.00 | ms/batch  4.60 | loss  5.30 | ppl   199.58
    | epoch   1 |  7400/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.40 | ppl   221.76
    -----------------------------------------------------------------------------------------
    | end of epoch   1 | time: 35.56s | valid loss  5.47 | valid ppl   237.61
    -----------------------------------------------------------------------------------------
    =========================================================================================
    | End of training | test loss  5.41 | test ppl   224.45
    =========================================================================================

    日志开头显示Using device: cuda,说明pytorch GPU示例已实际运行。

    说明

    现有镜像无法覆盖所有显卡型号,如运行失败,请尝试将pytorch-examples-jobs.yaml中的镜像标签从2.6.0-cuda12.2-1更换到2.9.0-cuda13.0-2

  4. 验证GPU使用情况。

    pytorch GPU示例运行时,可以进入Pod中查看GPU使用情况:

    kubectl exec -it <YOUR-POD-NAME> -- bash   

    然后执行nvidia-smi命令,预期结果如下:

    root@pytorch-examples-xlqg6:~/examples/word_language_model# nvidia-smi
    Thu Jun  4 03:42:55 2026       
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
    |-----------------------------------------+------------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  NVIDIA XXX                     On  |   00000000:00:08.0 Off |                    0 |
    |  0%   52C    P0            147W /  150W |     513MiB /  23028MiB |     91%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
                                                                                             
    +-----------------------------------------------------------------------------------------+
    | Processes:                                                                              |
    |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
    |        ID   ID                                                               Usage      |
    |=========================================================================================|
    |    0   N/A  N/A               1      C   python                                  506MiB |
    +-----------------------------------------------------------------------------------------+

    以上输出表明GPU已成功调度,驱动及CUDA运行正常,pytorch正在使用GPU进行计算,GPU利用率达到91%。

    如果您有节点访问权限,也可在对应GPU节点上直接运行nvidia-smi。具体登录步骤,请参见节点登录

  5. 清理资源。

    kubectl delete job pytorch-examples