ACK managed Pro cluster supports GPU sharing, which enables scheduling of shared GPUs and memory isolation on Kubernetes. This topic describes how to configure a multi-GPU sharing policy.
Prerequisites
Multi-GPU sharing
Currently, multi-GPU sharing only supports memory isolation with shared compute resources, not with dedicated compute resources.
During model development, a workload may require multiple GPUs but not their full capacity. Assigning entire GPUs to the development environment can waste resources. The multi-GPU sharing feature helps you avoid this issue.
A multi-GPU sharing policy allows an application to request N GiB of GPU memory distributed across M GPUs. Each GPU provides N/M GiB of memory. The value of N/M must be an integer, and all M GPUs must be on the same Kubernetes node. For example, if you request 8 GiB of GPU memory and specify two GPUs, each GPU allocates 4 GiB.
-
Single-GPU sharing: A pod requests a fraction of the resources from a single GPU.
-
Multi-GPU sharing: A pod requests resources from multiple GPUs, with each GPU contributing an equal amount.
Configure multi-GPU sharing
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click .
-
On the Jobs page, click Create from YAML. Copy the following YAML content into the Template editor, and then click Create.
The key configurations in the YAML file are as follows:
-
The YAML file defines a Job that runs a TensorFlow MNIST sample. The Job requests a total of 8 GiB of GPU memory from two GPUs, with each GPU providing 4 GiB.
-
To request two GPUs, add the
aliyun.com/gpu-count: "2"label to the pod metadata. -
To request 8 GiB of GPU memory, specify
aliyun.com/gpu-mem: 8in theresources.limitssection of the container specification.
-
Verify multi-GPU sharing
On the Clusters page, click the name of your cluster. In the left navigation pane, click .
-
Find the pod you created, such as
tensorflow-mnist-multigpu-***. In the Actions column, click Terminal and run the following command.nvidia-smiExpected output:
Wed Jun 14 03:24:14 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:00:09.0 Off | 0 | | N/A 38C P0 61W / 300W | 569MiB / 4309MiB | 2% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:00:0A.0 Off | 0 | | N/A 36C P0 61W / 300W | 381MiB / 4309MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+The output shows that the pod can access two GPUs. The total memory for each GPU is 4,309 MiB, which corresponds to the requested 4 GiB of GPU memory, not the card's total physical memory of 16,160 MiB.
-
Find the pod you created, such as
tensorflow-mnist-multigpu-***. In the Actions column, click Logs to view the container logs. You should see the following key information:totalMemory: 4.21GiB freeMemory: 3.91GiB totalMemory: 4.21GiB freeMemory: 3.91GiBThis output confirms that the application can access approximately 4 GiB of total memory on each GPU, not the actual 16,160 MiB of physical memory. This shows that memory isolation is working correctly.