GPU-HPN topology-aware scheduling

更新时间:
复制 MD 格式

In an ACS cluster with GPU devices, you can schedule multiple GPU pods to the same GPU-HPN node. The pods can then exchange data between GPUs by using technologies such as NVLink. To ensure communication efficiency and fairness between GPU devices, ACS adheres to the partition constraints of different node models when scheduling devices. This topic describes the GPU partition scheduling mechanism in ACS and provides use case examples.

Prerequisites

This feature is supported only for pods of the gpu-hpn compute class and their corresponding node types.

Background

On a node, GPU devices interconnect through one or more communication channels. ACS allows pods with different GPU requirements to run on the same GPU-HPN node. To ensure efficient and fair GPU communication and prevent interference between pods, ACS schedules pods based on the GPU device topology. ACS creates partitions based on the number of GPUs requested by each pod to find the optimal allocation.

The following figure shows a node with eight GPUs. The GPUs are arranged in two groups of four. Within each group, the GPUs are directly interconnected. The two groups are connected through PCIe.

image

The following table describes how ACS creates partitions for pods with different GPU requests.

GPUs requested

Allocation results

8

[0,1,2,3,4,5,6,7]

4

[0,1,2,3], [4,5,6,7]

2

[0,1], [2,3], [4,5], [6,7]

1

[0], [1], [2], [3], [4], [5], [6], [7]

Continuously creating and deleting pods on a node can cause partition fragmentation on the GPU devices. This can prevent new pods from scheduling, leaving them in the Pending state. You can review the scheduling results of existing pods and, based on your business priorities, evict some pods to meet the resource demands of the pending pods.

GPU-HPN node partitions

The partition configurations and GPU model vary across different GPU-HPN node types in ACS.

gpu.p16en-16XL

This node type contains 16 GPUs of the P16EN model. The following table shows the possible allocation results for pods with different GPU requests.

GPUs requested

Allocation results

16

[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]

8

[0,1,2,3,4,5,6,7], [8,9,10,11,12,13,14,15]

4

[0,1,2,3], [4,5,6,7], [8,9,10,11], [12,13,14,15]

2

[0,3], [1,2], [4,7], [5,6], [8,11], [9,10], [12,15], [13,14]

1

[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15]

Pod scheduling results

Device allocation result

For a GPU-HPN pod, you can check its device allocation result in the pod annotation.

apiVersion: v1
kind: Pod
metadata:
  annotations:
    alibabacloud.com/device-allocation: '{"gpus": {"minor": [0,1,2,3]}}'

Scheduling failure due to partition fragmentation

When a pod cannot be scheduled, it enters the Pending state. Running the kubectl describe pod command displays a message similar to 0/5 nodes are available: xxx. The message Insufficient Partitioned GPU Devices indicates a scheduling failure due to resource fragmentation on the node. The following is an example.

kubectl describe pod pod-demo

Example output (other content is omitted):

...
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  26m    default-scheduler  0/5 nodes are available: 2 Node(s) Insufficient Partitioned GPU Devices, 1 Node(s) xxx, 2 Node(s) xxx.

FAQ

Avoiding partition fragmentation

  • Use different node labels to manage resources based on the GPU requirements of your pods. For example, you can assign pods that require 8 GPUs and pods that require 1 GPU to different nodes.

  • If pods are in the Pending state due to fragmentation, you can use descheduling to evict lower-priority pods. This frees up resources for the pending pods.

  • If your cluster is small or you cannot use labels to manage nodes, and your applications require various GPU specifications, consider using GPU pod capacity reservation to satisfy your application's resource requirements.

Selecting pods for eviction

  • Identify the resource requirements of the pending pod, for example, 8 GPUs.

  • Check the pod annotations on the target node and find the device allocation result in the alibabacloud.com/device-allocation annotation.

  • Based on the allocation result, determine which pods to evict. For example, a pod that requests 8 GPUs of the P16EN model requires an unallocated block of device IDs, such as [0,1,2,3,4,5,6,7] or [8,9,10,11,12,13,14,15].

  • Evict the pods by using a command such as evict or delete.

Partitions and custom schedulers

After a custom scheduler assigns a pod to a node, ACS handles device allocation for that pod. During the device allocation phase, ACS attempts to place devices as compactly as possible to avoid fragmentation.

The custom scheduler only needs to consider the total GPU capacity of a node. We recommend that you use a bin-packing scheduling policy for GPU resources, such as MostAllocated. This can effectively reduce partition fragmentation.

Topology awareness in schedulers

Scheduler type

Conditions

Description

ACS default scheduler

All of the following conditions are met:

  • The cluster is an ACS cluster.

  • The schedulerName of the pod is default-scheduler.

Any of the following conditions is met:

  • The scheduler version is v1.32.0-apsara.6.11.8.507bee55 or later, v1.31.0-aliyun-1.5.0 or later, v1.30.3-aliyun-1.6.0 or later, or any version from 1.33 onward.

  • The scheduler is an earlier version and the Enable custom labels and schedulers for GPU-HPN nodes option is not selected.

For more information, see kube-scheduler.

The scheduler is aware of the partition allocation status on the current node and excludes nodes that do not meet partition requirements. The pod scheduling failure event includes the Insufficient Partitioned GPU Devices message.

ACK default scheduler

All of the following conditions are met:

  • The cluster is an ACK managed cluster, an ACK One registered cluster, or an ACK One cluster for distributed Argo workflows.

  • The schedulerName of the pod is default-scheduler.

  • The scheduler version is v1.30.3-apsara.6.11.8.* or later, v1.32.0-apsara.6.11.8.* or later, v1.33.0-apsara.6.11.8.* or later, v1.34.0-apsara.6.11.8.* or later, or any version from 1.35 onward.

For more information, see kube-scheduler.

Other scheduler types, conditions, or configurations not listed above

The scheduler is unaware of the partition topology. The GPU-HPN node attempts to allocate devices in a consolidated manner. If the partition requirements are not met, the pod remains in the Pending state on the node until its requirements are satisfied. The event message includes Insufficient Partitioned GPU Devices. For more information, see How do I plan node resources and policies to avoid partition fragmentation?.