使用智能托管节点池实现GPU工作负载的动态伸缩-容器服务Kubernetes版ACK-阿里云

集群开启智能托管模式后，可通过智能托管节点池动态伸缩 GPU 资源，为具有明显波峰波谷的在线推理等 GPU 工作负载场景显著降低成本。

适用范围

ACK托管集群（智能托管模式）
节点操作系统为ContainerOS 3.6及以上版本。如需升级ContainerOS版本，请参见更换操作系统。

步骤一：选用GPU规格创建智能托管节点池

建议为GPU工作负载创建单独的节点池。当提交需要GPU资源的工作负载时，系统将根据资源需求自动创建GPU节点；在节点空闲且满足缩容条件时自动释放，确保仅在实际使用期间产生费用。

在ACK集群列表页面，单击目标集群名称，在集群详情页左侧导航栏，选择节点管理 > 节点池。

单击创建节点池，按照页面提示完成配置。

关键配置如下。详细配置项说明，请参见创建节点池。

配置项	说明
托管配置	选择智能托管。
交换机	节点池扩缩容时，根据扩缩容策略在选择的vSwitch可用区下扩缩节点。为保障高可用，建议选择2个及以上不同可用区。
实例相关配置	选择实例配置方式为指定实例规格。架构：GPU云服务器。实例规格：根据业务需求选择合适的实例规格族，如ecs.gn7i-c8g1.2xlarge (NVIDIA A10)。为提高扩容成功率，建议选择多个实例规格。
污点（Taints）	为防止非GPU工作负载被调度到价格较高的GPU节点，建议通过污点实现逻辑隔离。键：nvidia.com/gpu 值：true Effect：NoSchedule

步骤二：为GPU工作负载配置资源请求与污点容忍

为确保应用可调度至节点池并触发GPU节点的自动创建，需在YAML配置中声明GPU资源需求及对节点污点的容忍策略。

配置GPU资源请求：在容器的resources字段中声明所需的GPU卡资源。

# ...
spec:
  containers:
  - name: gpu-automode
    resources:
      limits:
        nvidia.com/gpu: 1   # 请求 1 个 GPU 卡资源
# ...

配置污点容忍：添加tolerations字段，匹配节点池的污点，从而允许Pod调度到带有该污点的节点上。

# ...
spec:
   tolerations:
    - key: "nvidia.com/gpu"  # 匹配节点池设置的污点 Key
      operator: "Equal"
      value: "true"          # 匹配节点池设置的污点 Value
      effect: "NoSchedule"   # 匹配节点池设置的污点 Effect
# ...

步骤三：部署GPU工作负载并验证弹性伸缩效果

以一个Stable Diffusion Web UI应用为例，演示完整部署流程及弹性能力验证过程。

创建并部署工作负载。

展开查看示例

创建stable-diffusion.yaml。

该YAML包含两部分：

Deployment: 定义Stable Diffusion工作负载。其Pod请求1个NVIDIA GPU，并配置了对应的污点容忍。
Service: 创建LoadBalancer类型的Service，通过公网IP暴露工作负载，并将请求转发至容器的7860端口。

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: stable-diffusion
  name: stable-diffusion
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: stable-diffusion
  template:
    metadata:
      labels:
        app: stable-diffusion
    spec:
      containers:
      - args:
        - --listen
        command:
        - python3
        - launch.py
        image: yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/stable-diffusion:v1.0.0-gpu
        imagePullPolicy: IfNotPresent
        name: stable-diffusion
        ports:
        - containerPort: 7860
          protocol: TCP
        readinessProbe:
          tcpSocket:
            port: 7860
        resources:
          limits:
            # 请求 1 个 GPU 卡资源
            nvidia.com/gpu: 1
          requests:
            cpu: "6"
            memory: 12Gi  
      # 声明污点容忍，确保 Pod 可以调度到对应节点池
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"   
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    # 指定 LoadBalancer 的服务地址类型为公网
    service.beta.kubernetes.io/alibaba-cloud-loadbalancer-address-type: internet
    service.beta.kubernetes.io/alibaba-cloud-loadbalancer-instance-charge-type: PayByCLCU
  name: stable-diffusion-svc
  namespace: default
spec:
  externalTrafficPolicy: Local
  ports:
  - port: 7860
    protocol: TCP
    targetPort: 7860
   # 关联具有app=stable-diffusion标签的Pod
  selector:
    app: stable-diffusion
   # 创建 LoadBalancer 类型的服务类型
  type: LoadBalancer

部署工作负载。
```
kubectl apply -f stable-diffusion.yaml
```

验证节点自动扩容。

部署后，Pod会因缺少GPU资源而处于Pending状态。

查看Pod状态。
```
kubectl get pod -l app=stable-diffusion
```

查看Pod事件。

kubectl describe pod -l app=stable-diffusion

在Events中，预期先出现FailedScheduling，随后出现ProvisionNode事件，表明扩容已触发。

......
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  15m                default-scheduler  0/3 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 2 Insufficient cpu, 2 Insufficient memory, 2 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling., ,
  Normal   ProvisionNode     16m                GOATScaler         Provision node asa-2ze2h0f4m5ctpd8kn4f1 in Zone: cn-beijing-k with InstanceType: ecs.gn7i-c8g1.2xlarge, Triggered time 2025-11-19 02:58:01.096
  Normal   AllocIPSucceed    12m                terway-daemon      Alloc IP 10.XX.XX.141/16 took 4.764400743s
  Normal   Pulling           12m                kubelet            Pulling image "yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/stable-diffusion:v1.0.0-gpu"
  Normal   Pulled            3m48s              kubelet            Successfully pulled image "yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/stable-diffusion:v1.0.0-gpu" in 8m47.675s (8m47.675s including waiting). Image size: 11421866941 bytes.
  Normal   Created           3m42s              kubelet            Created container: stable-diffusion
  Normal   Started           3m24s              kubelet            Started container stable-diffusion

获取Pod所在的节点名称。

# 将Pod所在的节点名称存入变量 NODE_NAME
NODE_NAME=$(kubectl get pod -l app=stable-diffusion -o jsonpath='{.items[0].spec.nodeName}')

# 打印节点名称
echo "Stable Diffusion is running on node: $NODE_NAME"

# 查看该节点的详细信息，确认其已处于Ready状态
kubectl get node $NODE_NAME

访问Stable Diffusion。
等待几分钟，待新节点加入集群且Pod启动完成后，即可通过公网访问应用。
1. 执行以下命令获取服务的公网IP地址（EXTERNAL-IP）。
```
kubectl get svc stable-diffusion-svc
```
  在输出中，获取其EXTERNAL-IP。
```
NAME                   TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
stable-diffusion-svc   LoadBalancer   192.XXX.XX.196   8.XXX.XX.68   7860:31302/TCP   18m
```
2. 在浏览器中访问http://<EXTERNAL-IP>:7860。
  若页面显成功加载Stable Diffusion Web UI，则表明工作负载已在GPU节点上成功运行。
验证节点自动缩容（手动触发）。
为验证节点的自动缩容能力，可手动删除Deployment使节点进入闲置状态。
1. 删除此前创建的Deployment和Service。
```
# 删除Deployment
kubectl delete deployment stable-diffusion

# 删除Service
kubectl delete service stable-diffusion-svc
```
2. 观察节点缩容。
  节点伸缩组件会在达到缩容触发时延（智能托管模式下默认为3分钟）后，自动将其从集群中移除以节省成本。使用之前获取的节点名称再次查询该节点。
```
kubectl get node $NODE_NAME
```
  预期输出中提示找不到该节点，表明节点已按预期被自动缩容并释放。
```
Error from server (NotFound): nodes "<nodeName>" not found
```