Use the default Kubernetes GPU scheduling-Container Service for Kubernetes(ACK)-阿里云帮助中心

Container Service for Kubernetes (ACK) supports GPU scheduling and operations. Its default GPU usage mode aligns with the upstream Kubernetes community pattern. This topic walks through deploying a sample TensorFlow workload to validate GPU scheduling.

Prerequisites

Ensure you have:

An ACK cluster with at least one GPU node
kubectl configured to connect to the cluster
Access to the ACK console

Avoid bypassing standard GPU resource requests

For ACK-managed GPU nodes, request GPU resources only through the standard Kubernetes extended resource mechanism (nvidia.com/gpu in resources.limits). The following actions bypass this mechanism and pose security risks:

Running GPU applications directly on nodes
Using docker, podman, or nerdctl to create containers or request GPU resources (for example, docker run --gpus all or docker run -e NVIDIA_VISIBLE_DEVICES=all)
Adding NVIDIA_VISIBLE_DEVICES=all or NVIDIA_VISIBLE_DEVICES=<GPU ID> to the env section of a pod's YAML file
Using the NVIDIA_VISIBLE_DEVICES environment variable to request GPU resources for a pod
Defaulting NVIDIA_VISIBLE_DEVICES to all in a container image when not set in the pod YAML
Setting privileged: true in the pod's securityContext and running a GPU program

Why it matters: Non-standard GPU requests are invisible to the scheduler's resource tracking. This mismatch can cause the scheduler to over-allocate GPUs on a node, leading to multiple pods contending for the same GPU card (for example, competing for GPU memory) and causing workload failures. These methods may also trigger known errors reported by the NVIDIA community.

Verify GPU availability

Before deploying a workload, confirm that your GPU node exposes GPU capacity to the Kubernetes scheduler.

List nodes in the cluster:
```
kubectl get nodes
```
Describe a GPU node to check its capacity:
```
kubectl describe node <gpu-node-name>
```
In the Capacity section, nvidia.com/gpu must show a non-zero value:
```
Capacity:
 nvidia.com/gpu: 1
```
If nvidia.com/gpu is missing or shows 0, the NVIDIA device plugin may not be running on the node. Check the plugin DaemonSet deployment, GPU drivers, and node configuration before proceeding.

Deploy a GPU application

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the target cluster. In the left navigation pane, choose Workloads > Deployments.

On the Deployments page, click Create from YAML and paste this manifest:

apiVersion: v1
kind: Pod
metadata:
  name: tensorflow-mnist
  namespace: default
spec:
  containers:
  - image: registry.cn-beijing.aliyuncs.com/acs/tensorflow-mnist-sample:v1.5
    name: tensorflow-mnist
    command:
    - python
    - tensorflow-sample-code/tfjob/docker/mnist/main.py
    - --max_steps=100000
    - --data_dir=tensorflow-sample-code/data
    resources:
      limits:
        nvidia.com/gpu: 1  # Request one GPU card for this container.
    workingDir: /root
  restartPolicy: Always

GPU extended resources require limits only, not requests. Kubernetes does not allow requests without a matching limits for extended resources. The scheduler uses the limits value as the effective request.

In the left navigation pane, choose Workloads > Pods. Find and click the pod.
Click the Logs tab. Image pull and pod startup may take a few minutes. Once running, the log output confirms that GPU scheduling works correctly.

Next steps

To target specific GPU types in a heterogeneous cluster, use node labels and selectors. Label GPU nodes by accelerator type (for example, kubectl label nodes <node-name> accelerator=<gpu-model>), then add a nodeSelector to your pod spec.
For advanced GPU scheduling options such as GPU sharing and isolation, see the ACK GPU scheduling documentation.