ContainerOS是阿里云为容器场景优化的官方操作系统,内置GPU驱动和容器运行时,开箱即可支持GPU工作负载。本文介绍如何在使用ContainerOS操作系统的节点上,通过创建GPU规格节点池并部署示例工作负载,验证GPU调度和运行是否正常。
适用范围
节点操作系统为ContainerOS 3.7及以上版本。如需升级ContainerOS版本,请参见更换操作系统。
暂不支持的机型为:ecs.gn5、ecs.gn5i、ecs.gn6v、ecs.gn6e、ecs.gn8v-tee。
步骤一:创建GPU节点池
步骤二:部署GPU工作负载并验证运行状态
创建并部署工作负载。
创建一个名为
pytorch-examples-job.yaml的文件,内容如下:apiVersion: batch/v1 kind: Job metadata: name: pytorch-examples labels: app: pytorch-examples spec: parallelism: 1 backoffLimit: 0 ttlSecondsAfterFinished: 3600 template: metadata: labels: app: pytorch-examples spec: restartPolicy: Never containers: - name: py image: registry-<region-id>-vpc.ack.aliyuncs.com/acs/pytorch-examples:2.6.0-cuda12.2-1 command: - python - main.py - --epochs - "1" - --batch_size - "8" - --accel env: - name: NVIDIA_DISABLE_REQUIRE value: "1" - name: PYTHONUNBUFFERED value: "1" resources: limits: nvidia.com/gpu: 1 workingDir: /root/examples/word_language_model说明请将
<region-id>替换为集群所在开服地域的地域ID,例如cn-hangzhou、cn-beijing、cn-zhangjiakou等。YAML中的
resources.limits字段声明了该容器需要1块GPU,Kubernetes调度器会据此将Pod调度到有空闲GPU资源的节点上。
部署应用。
kubectl apply -f pytorch-examples-job.yaml
验证Pod是否成功调度到GPU节点。
部署后,如果GPU资源短缺,Pod会因缺少GPU资源而处于
Pending状态。查看Pod状态。
kubectl get pod -l app=pytorch-examples查看Pod事件。
kubectl describe pod -l app=pytorch-examples在Events中,预期输出如下:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 15s default-scheduler Successfully assigned default/pytorch-examples-m2qs7 to cn-hangzhou.10.61.65.169 Normal AllocIPSucceed 15s terway-daemon Alloc IP 10.61.65.172/16 took 39.976688ms Normal Pulled 15s kubelet Container image "registry-cn-hangzhou-vpc.ack.aliyuncs.com/acs/pytorch-examples:2.6.0-cuda12.2-1" already present on machine and can be accessed by the pod Normal Created 15s kubelet Container created Normal Started 15s kubelet Container started
查看日志,验证pytorch是否在使用GPU。
kubectl logs <YOUR-POD-NAME> -f预期结果:
Using device: cuda | epoch 1 | 200/ 7459 batches | lr 20.00 | ms/batch 5.58 | loss 7.71 | ppl 2241.62 | epoch 1 | 400/ 7459 batches | lr 20.00 | ms/batch 4.58 | loss 6.85 | ppl 946.49 | epoch 1 | 600/ 7459 batches | lr 20.00 | ms/batch 4.57 | loss 6.54 | ppl 692.70 | epoch 1 | 800/ 7459 batches | lr 20.00 | ms/batch 4.57 | loss 6.35 | ppl 574.73 | epoch 1 | 1000/ 7459 batches | lr 20.00 | ms/batch 4.57 | loss 6.20 | ppl 494.75 | epoch 1 | 1200/ 7459 batches | lr 20.00 | ms/batch 4.58 | loss 6.18 | ppl 482.76 | epoch 1 | 1400/ 7459 batches | lr 20.00 | ms/batch 4.60 | loss 6.18 | ppl 483.99 | epoch 1 | 1600/ 7459 batches | lr 20.00 | ms/batch 4.59 | loss 5.97 | ppl 390.43 | epoch 1 | 1800/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.97 | ppl 390.31 | epoch 1 | 2000/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.89 | ppl 361.39 | epoch 1 | 2200/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.84 | ppl 344.48 | epoch 1 | 2400/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.80 | ppl 330.62 | epoch 1 | 2600/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.86 | ppl 351.71 | epoch 1 | 2800/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.79 | ppl 327.62 | epoch 1 | 3000/ 7459 batches | lr 20.00 | ms/batch 4.63 | loss 5.66 | ppl 286.30 | epoch 1 | 3200/ 7459 batches | lr 20.00 | ms/batch 4.64 | loss 5.75 | ppl 313.55 | epoch 1 | 3400/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.73 | ppl 306.97 | epoch 1 | 3600/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.62 | ppl 274.96 | epoch 1 | 3800/ 7459 batches | lr 20.00 | ms/batch 4.63 | loss 5.63 | ppl 279.61 | epoch 1 | 4000/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.62 | ppl 274.83 | epoch 1 | 4200/ 7459 batches | lr 20.00 | ms/batch 4.63 | loss 5.52 | ppl 248.50 | epoch 1 | 4400/ 7459 batches | lr 20.00 | ms/batch 4.63 | loss 5.55 | ppl 256.37 | epoch 1 | 4600/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.69 | ppl 297.25 | epoch 1 | 4800/ 7459 batches | lr 20.00 | ms/batch 4.65 | loss 5.62 | ppl 275.78 | epoch 1 | 5000/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.58 | ppl 265.67 | epoch 1 | 5200/ 7459 batches | lr 20.00 | ms/batch 4.60 | loss 5.26 | ppl 191.98 | epoch 1 | 5400/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.50 | ppl 245.12 | epoch 1 | 5600/ 7459 batches | lr 20.00 | ms/batch 4.62 | loss 5.57 | ppl 261.86 | epoch 1 | 5800/ 7459 batches | lr 20.00 | ms/batch 4.63 | loss 5.45 | ppl 233.85 | epoch 1 | 6000/ 7459 batches | lr 20.00 | ms/batch 4.64 | loss 5.43 | ppl 228.95 | epoch 1 | 6200/ 7459 batches | lr 20.00 | ms/batch 4.64 | loss 5.39 | ppl 219.26 | epoch 1 | 6400/ 7459 batches | lr 20.00 | ms/batch 4.64 | loss 5.47 | ppl 236.98 | epoch 1 | 6600/ 7459 batches | lr 20.00 | ms/batch 4.64 | loss 5.48 | ppl 240.89 | epoch 1 | 6800/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.43 | ppl 228.92 | epoch 1 | 7000/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.39 | ppl 219.44 | epoch 1 | 7200/ 7459 batches | lr 20.00 | ms/batch 4.60 | loss 5.30 | ppl 199.58 | epoch 1 | 7400/ 7459 batches | lr 20.00 | ms/batch 4.61 | loss 5.40 | ppl 221.76 ----------------------------------------------------------------------------------------- | end of epoch 1 | time: 35.56s | valid loss 5.47 | valid ppl 237.61 ----------------------------------------------------------------------------------------- ========================================================================================= | End of training | test loss 5.41 | test ppl 224.45 =========================================================================================日志开头显示
Using device: cuda,说明pytorch GPU示例已实际运行。说明现有镜像无法覆盖所有显卡型号,如运行失败,请尝试将
pytorch-examples-jobs.yaml中的镜像标签从2.6.0-cuda12.2-1更换到2.9.0-cuda13.0-2。验证GPU使用情况。
pytorch GPU示例运行时,可以进入Pod中查看GPU使用情况:
kubectl exec -it <YOUR-POD-NAME> -- bash然后执行
nvidia-smi命令,预期结果如下:root@pytorch-examples-xlqg6:~/examples/word_language_model# nvidia-smi Thu Jun 4 03:42:55 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.195.03 Driver Version: 570.195.03 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA XXX On | 00000000:00:08.0 Off | 0 | | 0% 52C P0 147W / 150W | 513MiB / 23028MiB | 91% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1 C python 506MiB | +-----------------------------------------------------------------------------------------+以上输出表明GPU已成功调度,驱动及CUDA运行正常,pytorch正在使用GPU进行计算,GPU利用率达到91%。
如果您有节点访问权限,也可在对应GPU节点上直接运行
nvidia-smi。具体登录步骤,请参见节点登录。清理资源。
kubectl delete job pytorch-examples