How to configure the GPU selection policy for GPU sharing-Container Service for Kubernetes(ACK)-阿里云帮助中心

By default, the scheduler allocates GPU resources by filling one GPU on a node before moving to another. This approach helps prevent gpu memory fragmentation. However, in some scenarios, you may want to distribute Pods across multiple GPUs to minimize the impact on your workloads if a single GPU fails. This topic describes how to configure the gpu selection policy for gpu sharing.

Prerequisites

How it works

When you use gpu sharing on a node with multiple GPUs, you can choose from two policies for allocating GPUs to Pods:

Binpack: (Default) The scheduler fills one GPU on a node before allocating resources from another. This approach prevents gpu memory fragmentation.
Spread: The scheduler distributes Pods as evenly as possible across all available GPUs on a node. This approach minimizes the number of workloads affected if a single GPU fails.

The following example shows a node with two GPUs, each with 15 GiB of gpu memory. Pod1 requests 2 GiB of gpu memory, and Pod2 requests 3 GiB.

Procedure

By default, the gpu selection policy for a node is Binpack. To use the Spread policy for gpu sharing, follow these steps.

Step 1: Create a node pool

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of the target cluster. In the left navigation pane, choose Nodes > Node Pools.

On the Node Pools page, click Create Node Pool in the upper-right corner.

On the Create Node Pool page, set the parameters and click Confirm. The following table describes the key parameters. For information about other parameters, see Create and manage a node pool.

Parameter	Description
Instance type	For Architecture, select GPU-accelerated instance and select an instance type. The `Spread` policy is effective only on nodes with multiple GPUs. Select an instance type with multiple GPUs.
Expected nodes	Specify the initial number of nodes in the node pool. Set this to 0 if you do not want to create nodes immediately.
Node label	Click to add the following two labels: Set Key to `ack.node.gpu.schedule` and Value to `cgpu`. This enables gpu sharing and gpu memory isolation. Set Key to `ack.node.gpu.placement` and Value to `spread`. This enables the `Spread` policy for the node.

Step 2: Submit a job

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click Workloads > Jobs.

In the upper-right corner of the page, click Create from YAML. Copy the following YAML content into the Template editor, modify the content as described in the comments, and click Create.

YAML details

apiVersion: batch/v1
kind: Job
metadata:
  name: tensorflow-mnist-spread
spec:
  parallelism: 3
  template:
    metadata:
      labels:
        app: tensorflow-mnist-spread
    spec:
      nodeSelector:
         kubernetes.io/hostname: <NODE_NAME> # Replace <NODE_NAME> with the name of a GPU node in your cluster to better observe the effect, for example, cn-shanghai.192.0.2.109.
      containers:
      - name: tensorflow-mnist-spread
        image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5
        command:
        - python
        - tensorflow-sample-code/tfjob/docker/mnist/main.py
        - --max_steps=100000
        - --data_dir=tensorflow-sample-code/data
        resources:
          limits:
            aliyun.com/gpu-mem: 4 # Requests 4 GiB of gpu memory.
        workingDir: /root
      restartPolicy: Never

YAML file description:

This YAML file defines a Job that uses the TensorFlow MNIST sample. The Job creates three Pods, and each Pod requests 4 GiB of gpu memory.
A Pod requests 4 GiB of gpu memory by defining aliyun.com/gpu-mem: 4 in the resources.limits section of the Pod specification.
To observe the effect on a single node, the YAML file adds a nodeSelector, kubernetes.io/hostname: <NODE_NAME>, to schedule the Pods to a specific node.

Step 3: Verify the Spread policy

Use the gpu inspection tool to query GPU resource allocation on the node:

kubectl inspect cgpu

NAME                   IPADDRESS      GPU0(Allocated/Total)  GPU1(Allocated/Total)  GPU2(Allocated/Total)  GPU3(Allocated/Total)  GPU Memory(GiB)
cn-shanghai.192.0.2.109  192.0.2.109  4/15                   4/15                   0/15                   4/15                   12/60
--------------------------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
12/60 (20%)

The output indicates that the three Pods are scheduled to different GPUs on the node. This confirms that the Spread policy is in effect.