Run on-demand GPU training jobs

更新时间:
复制 MD 格式

In an ACK Auto Mode cluster, an intelligent managed node pool provides elastic provisioning of GPU instances. It ensures just-in-time GPU resource supply for training jobs and automatically releases the GPU nodes after jobs are finished, enabling on-demand usage with zero idle costs.

Prerequisites

  • The node pool requires ContainerOS 3.7 or later. To upgrade the ContainerOS version, see Change the operating system.

  • The following instance types are not supported: ecs.gn5, ecs.gn5i, ecs.gn6v, ecs.gn6e, ecs.gn8v-tee, ebmgn9ge, and ebmgn9gc.

Step 1: Create a GPU intelligent managed node pool

  1. On the ACK Cluster List page, click the name of your target cluster. In the left-side navigation pane, choose Nodes > Node Pools.

  2. Click Create Node Pool and configure the following key parameters:

    Parameter

    Description

    Managed type

    Select intelligent hosting.

    vSwitch

    During scaling events, the node pool adds or removes nodes in the availability zones of the selected vSwitches according to the Scaling Policy. For high availability, we recommend that you select vSwitches in two or more different availability zones.

    Instance configurations

    Set Instance configuration method to Specify instance types.

    • Architecture: GPU-accelerated cloud server.

    • Instance types: Select instance types that meet your needs, such as ecs.gn7i-c8g1.2xlarge. To ensure successful scaling, we recommend selecting multiple instance types.

    For more information about the parameters, see Create and manage node pools.

Step 2: Create and deploy a GPU training job

  1. Create a file named pytorch-examples-job.yaml with the following content:

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: pytorch-examples
      labels:
        app: pytorch-examples
    spec:
      parallelism: 1
      backoffLimit: 0
      ttlSecondsAfterFinished: 3600
    
      template:
        metadata:
          labels:
            app: pytorch-examples
        spec:
          restartPolicy: Never
          containers:
          - name: py
            image: registry-<region-id>-vpc.ack.aliyuncs.com/acs/pytorch-examples:2.6.0-cuda12.2-1
            command:
            - python
            - main.py
            - --epochs
            - "1"
            - --batch_size
            - "8"
            - --accel
            env:
            - name: NVIDIA_DISABLE_REQUIRE
              value: "1"
            - name: PYTHONUNBUFFERED
              value: "1"
            resources:
              limits:
                nvidia.com/gpu: 1
            workingDir: /root/examples/word_language_model
    
    Note

    Replace <region-id> with the ID of the region where your cluster is located, such as cn-hangzhou, cn-beijing, or cn-zhangjiakou.

    • The resources.limits field in the YAML file specifies that the container requires one GPU. The Kubernetes scheduler uses this information to assign the Pod to a suitable node.

  2. Deploy the application.

    kubectl apply -f pytorch-examples-job.yaml

Step 3: Verify the results

  1. Verify that the Pod is scheduled to a GPU node.

    After deployment, if GPU resources are insufficient, the Pod enters the Pending state.

    1. Check the Pod status.

      kubectl get pod -l app=pytorch-examples
    2. Check the Pod events.

      kubectl describe pod <YOUR-POD-NAME>

      In the Events section, the expected output is similar to the following:

      Events:
        Type    Reason          Age   From               Message
        ----    ------          ----  ----               -------
        Normal  Scheduled       15s   default-scheduler  Successfully assigned default/pytorch-examples-***** to cn-beijing.10.61.65.169 to cn-beijing.10.61.65.169
        Normal  AllocIPSucceed  15s   terway-daemon      Alloc IP 10.61.65.172/16 took 39.976688ms
        Normal  Pulled          15s   kubelet            Container image "registry-cn-beijing-vpc.ack.aliyuncs.com/acs/pytorch-examples:2.6.0-cuda12.2-1" already present on machine and can be accessed by the pod
        Normal  Created         15s   kubelet            Container created
        Normal  Started         15s   kubelet            Container started
  2. Check the logs to verify that PyTorch is using the GPU.

    kubectl logs <YOUR-POD-NAME> -f

    Expected output:

    Using device: cuda
    | epoch   1 |   200/ 7459 batches | lr 20.00 | ms/batch  5.58 | loss  7.71 | ppl  2241.62
    | epoch   1 |   400/ 7459 batches | lr 20.00 | ms/batch  4.58 | loss  6.85 | ppl   946.49
    | epoch   1 |   600/ 7459 batches | lr 20.00 | ms/batch  4.57 | loss  6.54 | ppl   692.70
    | epoch   1 |   800/ 7459 batches | lr 20.00 | ms/batch  4.57 | loss  6.35 | ppl   574.73
    | epoch   1 |  1000/ 7459 batches | lr 20.00 | ms/batch  4.57 | loss  6.20 | ppl   494.75
    | epoch   1 |  1200/ 7459 batches | lr 20.00 | ms/batch  4.58 | loss  6.18 | ppl   482.76
    | epoch   1 |  1400/ 7459 batches | lr 20.00 | ms/batch  4.60 | loss  6.18 | ppl   483.99
    | epoch   1 |  1600/ 7459 batches | lr 20.00 | ms/batch  4.59 | loss  5.97 | ppl   390.43
    | epoch   1 |  1800/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.97 | ppl   390.31
    | epoch   1 |  2000/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.89 | ppl   361.39
    | epoch   1 |  2200/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.84 | ppl   344.48
    | epoch   1 |  2400/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.80 | ppl   330.62
    | epoch   1 |  2600/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.86 | ppl   351.71
    | epoch   1 |  2800/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.79 | ppl   327.62
    | epoch   1 |  3000/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.66 | ppl   286.30
    | epoch   1 |  3200/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.75 | ppl   313.55
    | epoch   1 |  3400/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.73 | ppl   306.97
    | epoch   1 |  3600/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.62 | ppl   274.96
    | epoch   1 |  3800/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.63 | ppl   279.61
    | epoch   1 |  4000/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.62 | ppl   274.83
    | epoch   1 |  4200/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.52 | ppl   248.50
    | epoch   1 |  4400/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.55 | ppl   256.37
    | epoch   1 |  4600/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.69 | ppl   297.25
    | epoch   1 |  4800/ 7459 batches | lr 20.00 | ms/batch  4.65 | loss  5.62 | ppl   275.78
    | epoch   1 |  5000/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.58 | ppl   265.67
    | epoch   1 |  5200/ 7459 batches | lr 20.00 | ms/batch  4.60 | loss  5.26 | ppl   191.98
    | epoch   1 |  5400/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.50 | ppl   245.12
    | epoch   1 |  5600/ 7459 batches | lr 20.00 | ms/batch  4.62 | loss  5.57 | ppl   261.86
    | epoch   1 |  5800/ 7459 batches | lr 20.00 | ms/batch  4.63 | loss  5.45 | ppl   233.85
    | epoch   1 |  6000/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.43 | ppl   228.95
    | epoch   1 |  6200/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.39 | ppl   219.26
    | epoch   1 |  6400/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.47 | ppl   236.98
    | epoch   1 |  6600/ 7459 batches | lr 20.00 | ms/batch  4.64 | loss  5.48 | ppl   240.89
    | epoch   1 |  6800/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.43 | ppl   228.92
    | epoch   1 |  7000/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.39 | ppl   219.44
    | epoch   1 |  7200/ 7459 batches | lr 20.00 | ms/batch  4.60 | loss  5.30 | ppl   199.58
    | epoch   1 |  7400/ 7459 batches | lr 20.00 | ms/batch  4.61 | loss  5.40 | ppl   221.76
    -----------------------------------------------------------------------------------------
    | end of epoch   1 | time: 35.56s | valid loss  5.47 | valid ppl   237.61
    -----------------------------------------------------------------------------------------
    =========================================================================================
    | End of training | test loss  5.41 | test ppl   224.45
    =========================================================================================

    The log output begins with Using device: cuda, confirming that the PyTorch job is running.

    Note

    The image may not support all GPU models. If the Job fails, try changing the image tag in pytorch-examples-job.yaml from 2.6.0-cuda12.2-1 to 2.9.0-cuda13.0-2.

  3. Verify GPU utilization.

    While the PyTorch job is running, run the following command in the Pod to check GPU utilization:

    kubectl exec -it <YOUR-POD-NAME> -- bash   

    Then, run the nvidia-smi command. The expected output is similar to the following:

    Thu Jun 11 06:08:11 2026
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
    |-----------------------------------------+------------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  NVIDIA XXX                     On  |   00000000:00:08.0 Off |                    0 |
    |  0%   52C    P0            147W /  150W |     513MiB /  23028MiB |     91%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
    
    +-----------------------------------------------------------------------------------------+
    | Processes:                                                                              |
    |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
    |        ID   ID                                                               Usage      |
    |=========================================================================================|
    |    0   N/A  N/A               1      C   python                                  506MiB |
    +-----------------------------------------------------------------------------------------+

    This output confirms that the Pod was scheduled to a GPU node, the driver and CUDA are running correctly, PyTorch is using the GPU, and utilization has reached 91%.

  4. Clean up resources.

    kubectl delete job pytorch-examples