By default, shared GPU scheduling allocates GPU memory in 1 GiB units. This topic describes how to change the minimum memory allocation unit to 128 MiB for finer-grained allocations.
Prerequisites
You have an ACK managed Pro cluster of version 1.18.8 or later. For more information, see Create an ACK managed cluster and Upgrade a cluster.
Limitations
-
If your cluster contains pods that request shared GPU resources by using
aliyun.com/gpu-mem, you must delete these pods before you change the memory allocation unit. Otherwise, the scheduler may track resources incorrectly. -
This feature supports only nodes where GPU sharing is enabled without isolation (nodes with the
ack.node.gpu.schedule=sharelabel). For nodes that use both sharing and isolation (nodes with theack.node.gpu.schedule=cgpulabel), the isolation module limits each GPU card to a maximum of 16 pods, even if each pod requests only 128 MiB of GPU memory. -
When nodes report GPU memory resources in 128 MiB units, autoscaling is not supported. For example, if a pod requests 32 units of the
aliyun.com/gpu-memresource and no node in the cluster has enough GPU memory to meet this request, the pod remains in the Pending state. Even with autoscaling configured, the cluster will not add a new node for the pod. -
For clusters created before October 20, 2021, you must submit a ticket and ask technical support to restart the scheduler. The configuration takes effect only after the scheduler is restarted.
Change the memory allocation unit
ack-ai-installer not installed
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click .
-
At the bottom of the page, click Quick Deployment, select Scheduling Policy Extension (Batch Task Scheduling, GPU Sharing, Topology-aware GPU Scheduling), and then click Advanced.
-
Click Advanced Settings, add the code
gpuMemoryUnit: 128MiB, and then click OK.gpushare: enabled: true image: acs/gpushare-device-plugin tag: v3.4.3-b04ef87-aliyun imagePullPolicy: IfNotPresent mpsImage: acs/mps mpsTag: latest gpuMemoryUnit: 128MiB loglevel: 5 -
Click Deploy Cloud-native AI Suite.
The component is successfully deployed when the Status of ack-ai-installer changes from Deploying to Deployed.
ack-ai-installer installed
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click .
-
In the component list, find ack-ai-installer and click Uninstall in its row. Then, click OK.
-
After the uninstallation is complete, click Deployments in the row for ack-ai-installer. Add the line
gpuMemoryUnit: 128MiBto the configuration.image: acs/gpushare-device-plugin tag: v3.4.3-b04ef87-aliyun imagePullPolicy: IfNotPresent mpsImage: acs/mps mpsTag: latest gpuMemoryUnit: 128MiB loglevel: 5 -
Click OK.
The ack-ai-installer component is successfully redeployed when its Status changes from Deploying to Deployed.
Example: Request GPU memory
The following YAML shows a pod that requests 16 units of GPU memory by using the aliyun.com/gpu-mem resource. Because each unit represents 128 MiB, the pod requests a total of 2 GiB of GPU memory (16 × 128 MiB).
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: binpack
labels:
app: binpack
spec:
replicas: 1
serviceName: "binpack-1"
podManagementPolicy: "Parallel"
selector: # define how the deployment finds the pods it manages
matchLabels:
app: binpack-1
template: # Defines the pod template.
metadata:
labels:
app: binpack-1
spec:
containers:
- name: binpack-1
image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5
command:
- bash
- gpushare/run.sh
resources:
limits:
# Each unit is 128 MiB.
aliyun.com/gpu-mem: 16 # Requests 16 units, for a total of 2 GiB.