基于智能托管节点池按需执行GPU训练任务

更新时间:
复制 MD 格式

ACK Auto Mode 集群中可通过智能托管节点池提供 GPU 实例的弹性供应能力,实现训练任务即时 GPU 资源供给,并在任务结束后自动回收 GPU 节点,实现按需使用、零闲置成本。

适用范围

  • 节点操作系统为ContainerOS 3.7及以上版本。如需升级ContainerOS版本,请参见更换操作系统

  • 暂不支持的机型为:ecs.gn5、ecs.gn5i、ecs.gn6v、ecs.gn6e、ecs.gn8v-tee、ebmgn9ge、ebmgn9gc。

步骤一:创建GPU智能托管节点池

  1. ACK集群列表页面,单击目标集群名称,在集群详情页左侧导航栏,选择节点管理 > 节点池

  2. 单击创建节点池,按照页面提示完成配置。关键配置如下:

    配置项

    说明

    托管配置

    选择智能托管

    交换机

    节点池扩缩容时,根据扩缩容策略在选择的vSwitch可用区下扩缩节点。为保障高可用,建议选择2个及以上不同可用区。

    实例相关配置

    选择实例配置方式指定实例规格

    • 架构GPU云服务器

    • 实例规格:根据业务需求选择合适的实例规格簇,如ecs.gn7i-c8g1.2xlarge。为提高扩容成功率,建议选择多个实例规格。

    详细配置项说明,请参见创建和管理节点池

步骤二:创建并部署GPU训练任务

  1. 创建一个名为pytorch-examples-job.yaml的文件,内容如下:

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: pytorch-examples
      labels:
        app: pytorch-examples
    spec:
      parallelism: 1
      backoffLimit: 0
      ttlSecondsAfterFinished: 3600
    
      template:
        metadata:
          labels:
            app: pytorch-examples
        spec:
          restartPolicy: Never
          containers:
          - name: py
            image: registry-<region-id>-vpc.ack.aliyuncs.com/acs/pytorch-examples:2.6.0-cuda12.2-1
            command:
            - python
            - main.py
            - --epochs
            - "1"
            - --batch_size
            - "8"
            - --accel
            env:
            - name: NVIDIA_DISABLE_REQUIRE
              value: "1"
            - name: PYTHONUNBUFFERED
              value: "1"
            resources:
              limits:
                nvidia.com/gpu: 1
            workingDir: /root/examples/word_language_model
    
    说明

    请将<region-id>替换为集群所在开服地域的地域ID,例如cn-hangzhoucn-beijingcn-zhangjiakou等。

    • YAML中的resources.limits字段声明了该容器需要1GPU,Kubernetes调度器会据此将Pod调度到有空闲GPU资源的节点上。

  2. 部署应用。

    kubectl apply -f pytorch-examples-job.yaml

步骤三:结果验证

  1. 验证Pod是否成功调度到GPU节点。

    部署后,如果GPU资源短缺,Pod会因缺少GPU资源而处于Pending状态。

    1. 查看Pod状态。

      kubectl get pod -l app=pytorch-examples
    2. 查看Pod事件。

      kubectl describe pod <YOUR-POD-NAME>

      Events中,预期输出如下:

      Events:
        Type    Reason          Age   From               Message
        ----    ------          ----  ----               -------
        Normal  Scheduled       15s   default-scheduler  Successfully assigned default/pytorch-examples-***** to cn-beijing.10.61.65.169 to cn-beijing.10.61.65.169
        Normal  AllocIPSucceed  15s   terway-daemon      Alloc IP 10.61.65.172/16 took 39.976688ms
        Normal  Pulled          15s   kubelet            Container image "registry-cn-beijing-vpc.ack.aliyuncs.com/acs/pytorch-examples:2.6.0-cuda12.2-1" already present on machine and can be accessed by the pod
        Normal  Created         15s   kubelet            Container created
        Normal  Started         15s   kubelet            Container started
  2. 查看日志,验证PyTorch是否在使用GPU。

    kubectl logs <YOUR-POD-NAME> -f

    预期结果:

    Using device: cuda
    | epoch   1 |   200/ 7459 batches | lr 20.00 | ms/batch  5.58 | loss  7.71 | ppl  2241.62
    | epoch   1 |   400/ 7459 batches | lr 20.00 | ms/batch  4.58 | loss  6.85 | ppl   946.49
    | epoch   1 |   600/ 7459 batches | lr 20.00 | ms/batch  4.57 | loss  6.54 | ppl   692.70
    | epoch   1 |   800/ 7459 batches | lr 20.00 | ms/batch  4.57 | loss  6.35 | ppl   574.73
    | epoch   1 |  1000/ 7459 batches | lr 20.00 | ms/batch  4.57 | loss  6.20 | ppl   494.75
    | epoch   1 |  1200/ 7459 batches | lr 20.00 | ms/batch  4.58 | loss  6.18 | ppl   482.76
    | epoch   1 |  1400/ 7459 batches | lr 20.00 | ms/batch  4.60 | loss  6.18 | ppl   483.99
    | epoch   1 |  1600/ 7459 batches | lr 20.00 | ms/batch  4.59 | loss  5.97 | ppl   390.43
    | epoch   1 |  1800/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.97 | ppl   390.31
    | epoch   1 |  2000/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.89 | ppl   361.39
    | epoch   1 |  2200/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.84 | ppl   344.48
    | epoch   1 |  2400/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.80 | ppl   330.62
    | epoch   1 |  2600/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.86 | ppl   351.71
    | epoch   1 |  2800/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.79 | ppl   327.62
    | epoch   1 |  3000/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.66 | ppl   286.30
    | epoch   1 |  3200/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.75 | ppl   313.55
    | epoch   1 |  3400/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.73 | ppl   306.97
    | epoch   1 |  3600/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.62 | ppl   274.96
    | epoch   1 |  3800/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.63 | ppl   279.61
    | epoch   1 |  4000/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.62 | ppl   274.83
    | epoch   1 |  4200/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.52 | ppl   248.50
    | epoch   1 |  4400/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.55 | ppl   256.37
    | epoch   1 |  4600/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.69 | ppl   297.25
    | epoch   1 |  4800/ 7459 batches | lr 20.00 | ms/batch  4.65 | loss  5.62 | ppl   275.78
    | epoch   1 |  5000/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.58 | ppl   265.67
    | epoch   1 |  5200/ 7459 batches | lr 20.00 | ms/batch  4.60 | loss  5.26 | ppl   191.98
    | epoch   1 |  5400/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.50 | ppl   245.12
    | epoch   1 |  5600/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.57 | ppl   261.86
    | epoch   1 |  5800/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.45 | ppl   233.85
    | epoch   1 |  6000/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.43 | ppl   228.95
    | epoch   1 |  6200/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.39 | ppl   219.26
    | epoch   1 |  6400/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.47 | ppl   236.98
    | epoch   1 |  6600/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.48 | ppl   240.89
    | epoch   1 |  6800/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.43 | ppl   228.92
    | epoch   1 |  7000/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.39 | ppl   219.44
    | epoch   1 |  7200/ 7459 batches | lr 20.00 | ms/batch  4.60 | loss  5.30 | ppl   199.58
    | epoch   1 |  7400/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.40 | ppl   221.76
    -----------------------------------------------------------------------------------------
    | end of epoch   1 | time: 35.56s | valid loss  5.47 | valid ppl   237.61
    -----------------------------------------------------------------------------------------
    =========================================================================================
    | End of training | test loss  5.41 | test ppl   224.45
    =========================================================================================

    日志开头显示Using device: cuda,说明PyTorch GPU示例已实际运行。

    说明

    现有镜像无法覆盖所有显卡型号,如运行失败,请尝试将pytorch-examples-job.yaml中的镜像标签从2.6.0-cuda12.2-1更换到2.9.0-cuda13.0-2

  3. 验证GPU使用情况。

    PyTorch GPU实例运行时,可以进入Pod中查看GPU使用情况:

    kubectl exec -it <YOUR-POD-NAME> -- bash   

    然后执行nvidia-smi命令,预期结果如下:

    Thu Jun 11 06:08:11 2026
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
    |-----------------------------------------+------------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  NVIDIA XXX                     On  |   00000000:00:08.0 Off |                    0 |
    |  0%   52C    P0            147W /  150W |     513MiB /  23028MiB |     91%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
    
    +-----------------------------------------------------------------------------------------+
    | Processes:                                                                              |
    |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
    |        ID   ID                                                               Usage      |
    |=========================================================================================|
    |    0   N/A  N/A               1      C   python                                  506MiB |
    +-----------------------------------------------------------------------------------------+

    以上输出表明Pod已成功调度到GPU节点,驱动及CUDA运行正常,PyTorch正在使用GPU进行计算,GPU利用率达到91%。

  4. 清理资源。

    kubectl delete job pytorch-examples