Deploy and run GPU workloads

更新时间:
复制 MD 格式

After you enable Auto Mode for a cluster, you can use an Auto Mode node pool to dynamically scale GPU resources. This approach significantly reduces costs for GPU workloads with fluctuating demand, such as online inference services.

Key advantages

When you use an Auto Mode node pool to provision GPU-accelerated instances, the nodes run a GPU-optimized version of ContainerOS by default. This accelerates GPU node startup. The key advantages are as follows:

  • Pre-installed NVIDIA drivers for faster startup

    The GPU-optimized image comes with pre-installed NVIDIA drivers and the required runtime environment. This eliminates additional installation and configuration after node startup.

  • Automated node management to reduce operational complexity

    An Auto Mode node pool simplifies GPU maintenance by automating node pool management, system upgrades, component maintenance, and security patches to provide an out-of-the-box experience for GPU resources.

  • Streamlined base software stack

    Nodes use a more lightweight and secure ContainerOS, which accelerates node startup.

Prerequisites

Step 1: Create a GPU Auto Mode node pool

We recommend that you create a separate node pool for GPU workloads. When you submit a GPU workload, the system automatically creates GPU nodes based on resource requests. When the nodes become idle and meet the scale-in conditions, they are automatically released. This ensures you are charged only for the resources you use.

  1. On the ACK Clusters page, click the name of your cluster. In the left navigation pane, click Nodes > Node Pools.

  2. Click Create Node Pool and configure the parameters as prompted.

    The following table describes the key parameters. For more information about the parameters, see Create a node pool.

    Parameter

    Description

    Configure Managed Node Pool

    Select Auto Mode.

    vSwitch

    During scaling, nodes scale in or out in the availability zones of the selected vSwitches based on the scaling policy. For high availability, select vSwitches in two or more different availability zones.

    Instance Settings

    Set Instance Configuration Mode to Specify Instance Type.

    • Architecture: GPU-accelerated.

    • Instance Type: Select a suitable instance family based on your business requirements, such as ecs.gn7i-c8g1.2xlarge (NVIDIA A10). To increase the success rate of scale-out events, we recommend selecting multiple instance types.

    Taints

    To prevent non-GPU workloads from being scheduled onto expensive GPU nodes, we recommend that you add a taint to the node pool for logical isolation.

    • Key: nvidia.com/gpu

    • Value: true

    • Effect: NoSchedule

Step 2: Configure requests and tolerations

To ensure that your application can be scheduled to the node pool and trigger the automatic creation of GPU nodes, you must declare the GPU resource request and a toleration for the node taint in the YAML configuration.

  • Configure the GPU resource request: In the container's resources field, declare the required GPU resources.

    # ...
    spec:
      containers:
      - name: gpu-automode
        resources:
          limits:
            nvidia.com/gpu: 1   # Request one GPU.
    # ...
    
  • Configure the toleration: Add the tolerations field to match the taint of the node pool. This allows the pod to be scheduled onto nodes with that taint.

    The following toleration configuration is for reference only.
    # ...
    spec:
      tolerations:
        - key: "nvidia.com/gpu"  # Matches the taint key set on the node pool.
          operator: "Equal"
          value: "true"          # Matches the taint value set on the node pool.
          effect: "NoSchedule"   # Matches the taint effect set on the node pool.
    # ...

Step 3: Deploy GPU workload and verify autoscaling

This example uses a Stable Diffusion Web UI application to demonstrate the end-to-end deployment process and verify autoscaling.

  1. Create and deploy the workload.

    Example

    1. Create a file named stable-diffusion.yaml.

      This YAML file defines two resources:

      • Deployment: Defines the Stable Diffusion workload. The pod requests one NVIDIA GPU and is configured with the corresponding toleration.

      • Service: Creates a Service of the LoadBalancer type to expose the workload through a public IP address. The Service forwards requests to port 7860 of the container.

      apiVersion: apps/v1
      kind: Deployment
      metadata:
        labels:
          app: stable-diffusion
        name: stable-diffusion
        namespace: default
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: stable-diffusion
        template:
          metadata:
            labels:
              app: stable-diffusion
          spec:
            containers:
            - args:
              - --listen
              command:
              - python3
              - launch.py
              image: yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/stable-diffusion:v1.0.0-gpu
              imagePullPolicy: IfNotPresent
              name: stable-diffusion
              ports:
              - containerPort: 7860
                protocol: TCP
              readinessProbe:
                tcpSocket:
                  port: 7860
              resources:
                limits:
                  # Request one GPU.
                  nvidia.com/gpu: 1
                requests:
                  cpu: "6"
                  memory: 12Gi  
            # Declare the toleration to ensure that the pod can be scheduled to the corresponding node pool.
            tolerations:
            - key: "nvidia.com/gpu"
              operator: "Equal"
              value: "true"
              effect: "NoSchedule"   
      ---
      apiVersion: v1
      kind: Service
      metadata:
        annotations:
          # Specify that the Service is exposed over the internet.
          service.beta.kubernetes.io/alibaba-cloud-loadbalancer-address-type: internet
          service.beta.kubernetes.io/alibaba-cloud-loadbalancer-instance-charge-type: PayByCLCU
        name: stable-diffusion-svc
        namespace: default
      spec:
        externalTrafficPolicy: Local
        ports:
        - port: 7860
          protocol: TCP
          targetPort: 7860
         # Selects pods with the app=stable-diffusion label.
        selector:
          app: stable-diffusion
         # Create a Service of the LoadBalancer type.
        type: LoadBalancer
    2. Deploy the workload.

      kubectl apply -f stable-diffusion.yaml
  2. Verify automatic scale-out.

    After deployment, the pod enters the Pending state because no GPU resources are available.

    1. Check the pod status.

      kubectl get pod -l app=stable-diffusion
    2. Check the pod events.

      kubectl describe pod -l app=stable-diffusion

      In the Events section, you should see a FailedScheduling event, followed by a ProvisionNode event. This indicates that a scale-out has been triggered.

      ......
      Events:
        Type     Reason            Age                From               Message
        ----     ------            ----               ----               -------
        Warning  FailedScheduling  15m                default-scheduler  0/3 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 2 Insufficient cpu, 2 Insufficient memory, 2 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling., ,
        Normal   ProvisionNode     16m                GOATScaler         Provision node asa-2ze2h0f4m5ctpd8kn4f1 in Zone: cn-beijing-k with InstanceType: ecs.gn7i-c8g1.2xlarge, Triggered time 2025-11-19 02:58:01.096
        Normal   AllocIPSucceed    12m                terway-daemon      Alloc IP 10.XX.XX.141/16 took 4.764400743s
        Normal   Pulling           12m                kubelet            Pulling image "yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/stable-diffusion:v1.0.0-gpu"
        Normal   Pulled            3m48s              kubelet            Successfully pulled image "yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/stable-diffusion:v1.0.0-gpu" in 8m47.675s (8m47.675s including waiting). Image size: 11421866941 bytes.
        Normal   Created           3m42s              kubelet            Created container: stable-diffusion
        Normal   Started           3m24s              kubelet            Started container stable-diffusion
    3. Get the name of the node where the pod is running.

      # Store the name of the node where the pod is running in the NODE_NAME variable.
      NODE_NAME=$(kubectl get pod -l app=stable-diffusion -o jsonpath='{.items[0].spec.nodeName}')
      
      # Print the node name.
      echo "Stable Diffusion is running on node: $NODE_NAME"
      
      # View the details of the node to confirm that it is in the Ready state.
      kubectl get node $NODE_NAME
  3. Access Stable Diffusion.
    Wait a few minutes for the new node to join the cluster and the pod to start. Then, access the application over the internet.

    1. Run the following command to obtain the public IP address (EXTERNAL-IP) of the Service.

      kubectl get svc stable-diffusion-svc

      In the output, find the EXTERNAL-IP.

      NAME                   TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
      stable-diffusion-svc   LoadBalancer   192.XXX.XX.196   8.XXX.XX.68   7860:31302/TCP   18m
    2. In your browser, enter http://<EXTERNAL-IP>:7860.

      If the Stable Diffusion Web UI loads successfully, the workload is running on the GPU node.

  4. Verify automatic scale-in (manual trigger).
    To verify the automatic scale-in capability, you can manually delete the Deployment to idle the node.

    1. Delete the Deployment and Service that you created.

      # Delete the Deployment.
      kubectl delete deployment stable-diffusion
      
      # Delete the Service.
      kubectl delete service stable-diffusion-svc
    2. Observe the node scale-in.

      After the Scale-in Trigger Delay elapses (3 minutes by default in Auto Mode), the scaling component automatically removes the node from the cluster to reduce costs. Use the node name you obtained earlier to query the node again.

      kubectl get node $NODE_NAME

      The expected output indicates that the node cannot be found. This confirms that the node was automatically removed as expected.

      Error from server (NotFound): nodes "<nodeName>" not found