Achieve on-demand autoscaling and cost optimization for GPU workloads in ACK Auto Mode clusters-Container Service for Kubernetes(ACK)-阿里云帮助中心

After you enable Auto Mode for a cluster, you can use an Auto Mode node pool to dynamically scale GPU resources. This approach significantly reduces costs for GPU workloads with fluctuating demand, such as online inference services.

Key advantages

When you use an Auto Mode node pool to provision GPU-accelerated instances, the nodes run a GPU-optimized version of ContainerOS by default. This accelerates GPU node startup. The key advantages are as follows:

Pre-installed NVIDIA drivers for faster startup
The GPU-optimized image comes with pre-installed NVIDIA drivers and the required runtime environment. This eliminates additional installation and configuration after node startup.
Automated node management to reduce operational complexity
An Auto Mode node pool simplifies GPU maintenance by automating node pool management, system upgrades, component maintenance, and security patches to provide an out-of-the-box experience for GPU resources.
Streamlined base software stack
Nodes use a more lightweight and secure ContainerOS, which accelerates node startup.

Prerequisites

An Auto Mode cluster has been created.
Nodes run ContainerOS 3.6 or later. To upgrade the ContainerOS version, see Change the operating system.

Step 1: Create a GPU Auto Mode node pool

We recommend that you create a separate node pool for GPU workloads. When you submit a GPU workload, the system automatically creates GPU nodes based on resource requests. When the nodes become idle and meet the scale-in conditions, they are automatically released. This ensures you are charged only for the resources you use.

On the ACK Clusters page, click the name of your cluster. In the left navigation pane, click Nodes > Node Pools.

Click Create Node Pool and configure the parameters as prompted.

The following table describes the key parameters. For more information about the parameters, see Create a node pool.

Parameter	Description
Configure Managed Node Pool	Select Auto Mode.
vSwitch	During scaling, nodes scale in or out in the availability zones of the selected vSwitches based on the scaling policy. For high availability, select vSwitches in two or more different availability zones.
Instance Settings	Set Instance Configuration Mode to Specify Instance Type. Architecture: GPU-accelerated. Instance Type: Select a suitable instance family based on your business requirements, such as ecs.gn7i-c8g1.2xlarge (NVIDIA A10). To increase the success rate of scale-out events, we recommend selecting multiple instance types.
Taints	To prevent non-GPU workloads from being scheduled onto expensive GPU nodes, we recommend that you add a taint to the node pool for logical isolation. Key: nvidia.com/gpu Value: true Effect: NoSchedule

Step 2: Configure requests and tolerations

To ensure that your application can be scheduled to the node pool and trigger the automatic creation of GPU nodes, you must declare the GPU resource request and a toleration for the node taint in the YAML configuration.

Configure the GPU resource request: In the container's resources field, declare the required GPU resources.

# ...
spec:
  containers:
  - name: gpu-automode
    resources:
      limits:
        nvidia.com/gpu: 1   # Request one GPU.
# ...

Configure the toleration: Add the tolerations field to match the taint of the node pool. This allows the pod to be scheduled onto nodes with that taint.

The following toleration configuration is for reference only.

# ...
spec:
  tolerations:
    - key: "nvidia.com/gpu"  # Matches the taint key set on the node pool.
      operator: "Equal"
      value: "true"          # Matches the taint value set on the node pool.
      effect: "NoSchedule"   # Matches the taint effect set on the node pool.
# ...

Step 3: Deploy GPU workload and verify autoscaling

This example uses a Stable Diffusion Web UI application to demonstrate the end-to-end deployment process and verify autoscaling.

Create and deploy the workload.

Example

Create a file named stable-diffusion.yaml.

This YAML file defines two resources:

Deployment: Defines the Stable Diffusion workload. The pod requests one NVIDIA GPU and is configured with the corresponding toleration.
Service: Creates a Service of the LoadBalancer type to expose the workload through a public IP address. The Service forwards requests to port 7860 of the container.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: stable-diffusion
  name: stable-diffusion
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: stable-diffusion
  template:
    metadata:
      labels:
        app: stable-diffusion
    spec:
      containers:
      - args:
        - --listen
        command:
        - python3
        - launch.py
        image: yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/stable-diffusion:v1.0.0-gpu
        imagePullPolicy: IfNotPresent
        name: stable-diffusion
        ports:
        - containerPort: 7860
          protocol: TCP
        readinessProbe:
          tcpSocket:
            port: 7860
        resources:
          limits:
            # Request one GPU.
            nvidia.com/gpu: 1
          requests:
            cpu: "6"
            memory: 12Gi  
      # Declare the toleration to ensure that the pod can be scheduled to the corresponding node pool.
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"   
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    # Specify that the Service is exposed over the internet.
    service.beta.kubernetes.io/alibaba-cloud-loadbalancer-address-type: internet
    service.beta.kubernetes.io/alibaba-cloud-loadbalancer-instance-charge-type: PayByCLCU
  name: stable-diffusion-svc
  namespace: default
spec:
  externalTrafficPolicy: Local
  ports:
  - port: 7860
    protocol: TCP
    targetPort: 7860
   # Selects pods with the app=stable-diffusion label.
  selector:
    app: stable-diffusion
   # Create a Service of the LoadBalancer type.
  type: LoadBalancer

Deploy the workload.
```
kubectl apply -f stable-diffusion.yaml
```

Verify automatic scale-out.

After deployment, the pod enters the Pending state because no GPU resources are available.

Check the pod status.
```
kubectl get pod -l app=stable-diffusion
```

Check the pod events.

kubectl describe pod -l app=stable-diffusion

In the Events section, you should see a FailedScheduling event, followed by a ProvisionNode event. This indicates that a scale-out has been triggered.

......
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  15m                default-scheduler  0/3 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 2 Insufficient cpu, 2 Insufficient memory, 2 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling., ,
  Normal   ProvisionNode     16m                GOATScaler         Provision node asa-2ze2h0f4m5ctpd8kn4f1 in Zone: cn-beijing-k with InstanceType: ecs.gn7i-c8g1.2xlarge, Triggered time 2025-11-19 02:58:01.096
  Normal   AllocIPSucceed    12m                terway-daemon      Alloc IP 10.XX.XX.141/16 took 4.764400743s
  Normal   Pulling           12m                kubelet            Pulling image "yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/stable-diffusion:v1.0.0-gpu"
  Normal   Pulled            3m48s              kubelet            Successfully pulled image "yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/stable-diffusion:v1.0.0-gpu" in 8m47.675s (8m47.675s including waiting). Image size: 11421866941 bytes.
  Normal   Created           3m42s              kubelet            Created container: stable-diffusion
  Normal   Started           3m24s              kubelet            Started container stable-diffusion

Get the name of the node where the pod is running.

# Store the name of the node where the pod is running in the NODE_NAME variable.
NODE_NAME=$(kubectl get pod -l app=stable-diffusion -o jsonpath='{.items[0].spec.nodeName}')

# Print the node name.
echo "Stable Diffusion is running on node: $NODE_NAME"

# View the details of the node to confirm that it is in the Ready state.
kubectl get node $NODE_NAME

Access Stable Diffusion.
Wait a few minutes for the new node to join the cluster and the pod to start. Then, access the application over the internet.
1. Run the following command to obtain the public IP address (EXTERNAL-IP) of the Service.
```
kubectl get svc stable-diffusion-svc
```
  In the output, find the EXTERNAL-IP.
```
NAME                   TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
stable-diffusion-svc   LoadBalancer   192.XXX.XX.196   8.XXX.XX.68   7860:31302/TCP   18m
```
2. In your browser, enter http://<EXTERNAL-IP>:7860.
  If the Stable Diffusion Web UI loads successfully, the workload is running on the GPU node.
Verify automatic scale-in (manual trigger).
To verify the automatic scale-in capability, you can manually delete the Deployment to idle the node.
1. Delete the Deployment and Service that you created.
```
# Delete the Deployment.
kubectl delete deployment stable-diffusion

# Delete the Service.
kubectl delete service stable-diffusion-svc
```
2. Observe the node scale-in.
  After the Scale-in Trigger Delay elapses (3 minutes by default in Auto Mode), the scaling component automatically removes the node from the cluster to reduce costs. Use the node name you obtained earlier to query the node again.
```
kubectl get node $NODE_NAME
```
  The expected output indicates that the node cannot be found. This confirms that the node was automatically removed as expected.
```
Error from server (NotFound): nodes "<nodeName>" not found
```