Enable the scheduling feature

更新时间:
复制 MD 格式

When you deploy GPU compute jobs in an ACK managed cluster Pro, you can optimize resource utilization and schedule workloads with precision by assigning scheduling attribute labels (such as exclusive, shared, and topology-aware) and GPU model labels (for card model scheduling) to GPU nodes.

Scheduling label overview

GPU scheduling labels identify GPU models and resource allocation policies, enabling fine-grained resource management and efficient scheduling.

Scheduling mode

Label value

Use cases

Exclusive scheduling (Default)

ack.node.gpu.schedule: default

 For performance-critical tasks that require exclusive access to an entire GPU, such as model training and high-performance computing (HPC).

Shared scheduling

ack.node.gpu.schedule: cgpu

ack.node.gpu.schedule: core_mem

ack.node.gpu.schedule: share

ack.node.gpu.schedule: mps

Improves GPU utilization and is ideal for scenarios with multiple concurrent lightweight tasks, such as multitenancy and inference.

  • cgpu: Shared computing power with isolated GPU memory, based on Alibaba Cloud cGPU technology.

  • core_mem: Isolated computing power and GPU memory.

  • share: Shared computing power and GPU memory with no isolation.

  • mps: Shared computing power with isolated GPU memory, based on NVIDIA Multi-Process Service (MPS) isolation technology combined with Alibaba Cloud cGPU technology.

ack.node.gpu.placement: binpack

ack.node.gpu.placement: spread

Optimizes the resource allocation strategy on multi-GPU nodes when cgpu, core_mem, share, or mps shared scheduling is enabled.

  • binpack: (Default) Compact multi-card scheduling. Fills one GPU with Pods before allocating them to the next GPU. This reduces resource fragmentation and is ideal for scenarios that prioritize resource utilization or energy savings.

  • spread: Distributed multi-card scheduling. Spreads Pods across different GPUs to reduce the impact of a single-card failure. This is suitable for high-availability workloads.

Topology-aware scheduling

ack.node.gpu.schedule: topology

Automatically assigns the optimal combination of GPUs to a Pod based on the physical GPU topology within a single node. This is ideal for tasks that are sensitive to GPU-to-GPU communication latency.

Card model scheduling

aliyun.accelerator/nvidia_name: <GPU card name>

Use these labels with card model scheduling to set GPU memory capacity and total GPU card count for GPU jobs.
aliyun.accelerator/nvidia_mem: <GPU memory per card>
aliyun.accelerator/nvidia_count: <total number of GPU cards>

Schedules jobs to nodes with a specific GPU model or avoids nodes with a specific model.

Enable scheduling features

A node can have only one GPU scheduling mode (exclusive, shared, or topology-aware) enabled at a time. After you enable a mode, the extended resources reported by other scheduling modes are automatically set to 0.

Exclusive scheduling

If a node has no GPU scheduling labels, exclusive scheduling is the default mode. In this mode, a single GPU card is the smallest allocation unit for Pods.

If you have enabled another GPU scheduling mode, deleting the label alone does not restore exclusive scheduling. You must manually change the label value to ack.node.gpu.schedule: default to do so.

Shared scheduling

Shared scheduling is available only for ACK managed cluster Pro. For more information, see Limitations.

  1. Install the ack-ai-installer component

    1. Log on to the ACK console. In the left navigation pane, click Clusters.

    2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Applications > Cloud-native AI Suite.

    3. On the Cloud-native AI Suite page, click Deploy. On the Cloud-native AI Suite page, select Scheduling Policy Extension (Batch Task Scheduling, GPU Sharing, Topology-aware GPU Scheduling).

      To learn how to configure the compute scheduling policy supported by the cGPU service, see Install and use the cGPU component.
    4. On the Cloud-native AI Suite page, click Deploy Cloud-native AI Suite.

      In the component list on the Cloud-native AI Suite page, verify that the ack-ai-installer component is installed.

  2. Enable shared scheduling

    1. On the Clusters page, click the name of your target cluster. In the left-side navigation pane, choose Nodes > Node Pools.

    2. On the Node Pools page, click Create Node Pool, configure the node labels, and then click Confirm.

      You can keep the default values for the other settings. For information about the use cases of node labels, see Scheduling label overview.
      • Configure basic shared scheduling.

        Click the Add icon 节点标签 for Node Labels, set the Key to ack.node.gpu.schedule, and select one of the following label values: cgpu, core_mem, share, or mps (requires installing the MPS Control Daemon component).

      • Configure multi-card shared scheduling.

        If a node has multiple GPU cards and you want to optimize resource allocation, you can further configure multi-card shared scheduling in addition to basic shared scheduling.

        Click the Add icon 节点标签 for Node Labels, set the Key to ack.node.gpu.placement, and select one of the following label values: binpack or spread.

  3. Verify shared scheduling

    cgpu/share/mps

    Replace <NODE_NAME> with the name of your target node and run the following command to verify that cgpu, share, or mps shared scheduling is enabled.

    kubectl get nodes <NODE_NAME> -o yaml | grep -q "aliyun.com/gpu-mem"

    Expected output:

    aliyun.com/gpu-mem: "60"

    A non-zero value for the aliyun.com/gpu-mem field indicates that cgpu, share, or mps shared scheduling is enabled.

    core_mem

    Replace <NODE_NAME> with the name of your target node and run the following command to verify that core_mem shared scheduling is enabled.

    kubectl get nodes <NODE_NAME> -o yaml | grep -E 'aliyun\.com/gpu-core\.percentage|aliyun\.com/gpu-mem'

    Expected output:

    aliyun.com/gpu-core.percentage:"80"
    aliyun.com/gpu-mem:"6"

    If the aliyun.com/gpu-core.percentage and aliyun.com/gpu-mem fields are both non-zero, core_mem shared scheduling is enabled.

    binpack

    From the shared GPU GPU resource query tool, run the following command to check the GPU resource allocation on the node:

    kubectl inspect cgpu

    Expected output:

    NAME                   IPADDRESS      GPU0(Allocated/Total)  GPU1(Allocated/Total)  GPU2(Allocated/Total)  GPU3(Allocated/Total)  GPU Memory(GiB)
    cn-shanghai.192.0.2.109  192.0.2.109  15/15                   9/15                   0/15                   0/15                   24/60
    --------------------------------------------------------------------------------------
    Allocated/Total GPU Memory In Cluster:
    24/60 (40%)

    The output shows that GPU0 is fully allocated (15/15) while GPU1 is partially allocated (9/15). This confirms that the binpack strategy, which fills one GPU completely before allocating resources on the next, is active.

    spread

    From the shared scheduling GPU resource query tool, run the following command to check the GPU resource allocation on the node:

    kubectl inspect cgpu

    Expected output:

    NAME                   IPADDRESS      GPU0(Allocated/Total)  GPU1(Allocated/Total)  GPU2(Allocated/Total)  GPU3(Allocated/Total)  GPU Memory(GiB)
    cn-shanghai.192.0.2.109  192.0.2.109  4/15                   4/15                   0/15                   4/15                   12/60
    --------------------------------------------------------------------------------------
    Allocated/Total GPU Memory In Cluster:
    12/60 (20%)

    The output indicates that the allocated resources are 4/15 on GPU0, 4/15 on GPU1, and 4/15 on GPU3. This is consistent with the scheduling policy that prioritizes spreading Pods across different GPUs, which indicates that the spread policy has taken effect.

Topology-aware scheduling

Topology-aware scheduling is available only for ACK managed cluster Pro. For more information, see System component version requirements.

  1. Install the ack-ai-installer component.

  2. Enable topology-aware scheduling

    Replace <NODE_NAME> with the name of your target node and run the following command to add a label to the node and explicitly enable topology-aware GPU scheduling.

    kubectl label node <NODE_NAME> ack.node.gpu.schedule=topology
    After you enable topology-aware GPU scheduling on a node, it no longer supports non-topology-aware GPU workloads. To restore exclusive scheduling, run the kubectl label node <NODE_NAME> ack.node.gpu.schedule=default --overwrite command to change the label.
  3. Verify topology-aware scheduling

    Replace <NODE_NAME> with the name of your target node and run the following command to verify that topology-aware scheduling is enabled on the node.

    kubectl get nodes <NODE_NAME> -o yaml | grep aliyun.com/gpu

    Expected output:

    aliyun.com/gpu: "2"

    If the aliyun.com/gpu field is not 0, topology-aware scheduling is enabled.

Card model scheduling

Schedule jobs to nodes with a specified GPU model, or avoid specific models.

  1. View the GPU card model

    Run the following command to query the GPU card model of the nodes in your cluster.

    The NVIDIA_NAME field shows the GPU card model.
    kubectl get nodes -L aliyun.accelerator/nvidia_name

    The expected output is similar to the following:

    NAME                        STATUS   ROLES    AGE   VERSION            NVIDIA_NAME
    cn-shanghai.192.XX.XX.176   Ready    <none>   17d   v1.26.3-aliyun.1   Tesla-V100-SXM2-32GB
    cn-shanghai.192.XX.XX.177   Ready    <none>   17d   v1.26.3-aliyun.1   Tesla-V100-SXM2-32GB

    Alternative check methods

    On the Clusters page, click the name of the target cluster. In the left-side navigation pane, choose Workloads > Pods. For a running Pod (for example, tensorflow-mnist-multigpu-***), click Terminal in the Actions column. Select the target container from the drop-down list and run the following commands.

    • Query the card model: nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 | sed -e 's/ /-/g'

    • Query the GPU memory of each card: nvidia-smi --id=0 --query-gpu=memory.total --format=csv,noheader | sed -e 's/ //g'

    • Query the total number of GPU cards on the node: nvidia-smi -L | wc -l

  2. Enable card model scheduling

    1. On the Clusters page, click the name of your cluster. In the left navigation pane, click Workloads > Jobs.

    2. On the Jobs page, click Create from YAML. Use the following examples to create an application and enable card model scheduling.

      Specify card model

      Use the GPU card model scheduling label to ensure your application runs on nodes with a specific card model.

      In the code aliyun.accelerator/nvidia_name: "Tesla-V100-SXM2-32GB", replace Tesla-V100-SXM2-32GB with the actual card model of your node.

      YAML details

      apiVersion: batch/v1
      kind: Job
      metadata:
        name: tensorflow-mnist
      spec:
        parallelism: 1
        template:
          metadata:
            labels:
              app: tensorflow-mnist
          spec:
            nodeSelector:
              aliyun.accelerator/nvidia_name: "Tesla-V100-SXM2-32GB" # Runs the application on a Tesla V100-SXM2-32GB GPU.
            containers:
            - name: tensorflow-mnist
              image: registry.cn-beijing.aliyuncs.com/acs/tensorflow-mnist-sample:v1.5
              command:
              - python
              - tensorflow-sample-code/tfjob/docker/mnist/main.py
              - --max_steps=1000
              - --data_dir=tensorflow-sample-code/data
              resources:
                limits:
                  nvidia.com/gpu: 1
              workingDir: /root
            restartPolicy: Never

      After the job is created, choose Workloads > Pods from the left-side navigation pane. The Pod list shows an example Pod successfully scheduled to a matching node, confirming that scheduling based on the GPU card model label is functioning correctly.

      Exclude card model

      Use the GPU card model scheduling label with node affinity and anti-affinity to prevent your application from running on certain card models.

      In values: - "Tesla-V100-SXM2-32GB", replace Tesla-V100-SXM2-32GB with the actual card model of your node.

      YAML details

      apiVersion: batch/v1
      kind: Job
      metadata:
        name: tensorflow-mnist
      spec:
        parallelism: 1
        template:
          metadata:
            labels:
              app: tensorflow-mnist
          spec:
            affinity:
              nodeAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                  nodeSelectorTerms:
                  - matchExpressions:
                    - key: aliyun.accelerator/nvidia_name  # Card model scheduling label
                      operator: NotIn
                      values:
                      - "Tesla-V100-SXM2-32GB"            # Prevents the Pod from being scheduled to a node with a Tesla-V100-SXM2-32GB card.
            containers:
            - name: tensorflow-mnist
              image: registry.cn-beijing.aliyuncs.com/acs/tensorflow-mnist-sample:v1.5
              command:
              - python
              - tensorflow-sample-code/tfjob/docker/mnist/main.py
              - --max_steps=1000
              - --data_dir=tensorflow-sample-code/data
              resources:
                limits:
                  nvidia.com/gpu: 1
              workingDir: /root
            restartPolicy: Never

      After the job is created, the application will not be scheduled to nodes with the label key aliyun.accelerator/nvidia_name and value Tesla-V100-SXM2-32GB, but it can be scheduled to GPU nodes with other card models.