Use GPU Sharing on GPU-HPN Nodes-Container Compute Service(ACS)-阿里云帮助中心

Alibaba Cloud Container Compute Service (ACS) supports GPU sharing on GPU-HPN nodes, which allows multiple pods to run on a single GPU device. Instead of dedicating an entire GPU to each workload, pods can request fractional GPU resources—such as a percentage of computing power and memory. GPU sharing also supports flexible resource requests and limits to accommodate various isolation and sharing needs.

Important

GPU sharing is available only in ACS clusters. This feature is in public preview in the Ulanqab and Shanghai Finance Cloud regions. To use it in other regions, submit a ticket.

Limitations

Note the following constraints:

GPU sharing provides fine-grained resource allocation within a single GPU. It does not support aggregated requests across multiple GPUs—for example, requesting 0.5 of the computing power from two different GPUs simultaneously.
The GPU sharing module manages the driver version for all pods that use GPU sharing. A driver version cannot be specified for an individual pod.
Each pod can have at most one container that uses GPU shared resources (typically the main container). Sidecar containers can only request CPU and memory.
A container cannot request both exclusive GPU resources (nvidia.com/gpu) and GPU shared resources (alibabacloud.com/gpu-core.percentage, alibabacloud.com/gpu-memory.percentage).

How it works

Pods do not access a GPU device directly. Instead, they interact with it through the GPU sharing module, which consists of two components:

Proxy module: Integrated into the pod by default. Intercepts API calls related to the GPU device and forwards them to the resource management module.
Resource management module: Runs GPU instructions on the actual GPU device and enforces resource limits based on the pod's resource description.

When GPU sharing is enabled, the resource management module automatically reserves some CPU and memory on the node. For details on reserved amounts, see Node configuration.

Pod states and QoS

Similar to OS process management, the GPU sharing module assigns each pod one of three states:

Hibernation: The pod has no GPU demand (initial state when a pod starts).
Ready: The pod is waiting for GPU resources to be allocated.
Running: The pod is actively using GPU resources.

When multiple pods compete for GPU resources simultaneously, the GPU sharing module uses Quality of Service (QoS) policies—queuing and preemption—to manage allocation.

Queuing policy (share-pool model only)

Pods in the ready state are queued using First In, First Out (FIFO). The pod that entered the ready state first receives resources first. If resources are insufficient, the preemption policy is triggered.

Preemption policy (share-pool model only)

When a pod in the ready queue cannot get resources, the GPU sharing module attempts to preempt a running pod using the following criteria:

Policy	Description
Filter	A running pod is eligible for preemption only if it has continuously occupied GPU resources for longer than `podMaxDurationMinutes` (default: 2 hours).
Scoring	Among eligible pods, those that have held GPU resources longer are preempted first.

If no running pod meets the filter condition, the queued pod continues to wait.

Choose a sharing model

ACS supports two sharing models. The main difference is whether a pod is assigned a fixed GPU device or can use any available GPU on the node:

Use share-pool when workloads are bursty or intermittent (such as Notebook sessions). Pods share resources from a common GPU pool, with QoS mechanisms (queuing and preemption) managing contention.
Use static when workloads need guaranteed, uninterrupted GPU access without queuing. Each pod is fixed to a specific GPU device.

Model	GPU assignment	`requests`/`limits`	Queuing	Preemption	Best for
share-pool	Any GPU with idle resources on the node	`requests` <= `limits`	FIFO	Configurable	Notebook development, off-peak multi-user workloads
static	Fixed GPU device, does not change at runtime	`requests` == `limits`	Not supported	Not supported	Small-scale AI apps that need guaranteed GPU access without queuing

Warning

For the static model, always set requests == limits for GPU computing power and GPU memory. If requests < limits, resource contention occurs between pods sharing the same GPU and can cause pods to be killed by an out-of-memory (OOM) error.

Example: off-peak resource sharing in Notebook scenarios

In Notebook development, workloads typically do not hold GPU resources continuously. With the share-pool model, pods use GPU resources only when they need them, and the QoS mechanism manages access when multiple pods request resources simultaneously.

Consider four pods on a node with two GPUs:

Pods A and B: requests=0.5, limits=0.5
Pods C and D: requests=0.5, limits=1

Based on requests, all four pods fit on the node.

Time T1: Pods A and C are running. Pods B and D are in the ready queue. The GPU sharing module tries to allocate resources to Pod D (first in queue). GPU 0 has 0.5 GPU idle, which satisfies Pod D's requests=0.5, but Pod D's limits=1 would cause resource contention with Pod A on the same GPU, so Pod D stays in the queue.

Time T2 – Phase 1: Pod C finishes and enters hibernation. GPU 1 becomes free and its resources are allocated to Pod D.

Time T2 – Phase 2: Pod B is allocated resources on GPU 0. Pod B has limits=0.5, so it can share GPU 0 with Pod A without resource contention.

Enable GPU sharing on a node

The following steps show how to enable the share-pool model, deploy a pod with fractional GPU resources, verify the configuration, and optionally disable the feature.

Prerequisites

Before you begin, ensure that you have:

An ACS cluster with GPU-HPN nodes.
kubectl configured to connect to the cluster.
Deleted any pods on the target node that request exclusive GPU resources. Pods that use only CPU and memory do not need to be deleted.

Step 1: Label the GPU-HPN node

List the GPU-HPN nodes in the cluster:

kubectl get node -l alibabacloud.com/node-type=reserved

Expected output:

NAME                     STATUS   ROLES   AGE   VERSION
cn-wulanchabu-c.cr-xxx   Ready    agent   59d   v1.28.3-aliyun

Add the alibabacloud.com/gpu-share-policy=share-pool label to enable GPU sharing:

kubectl label node cn-wulanchabu-c.cr-xxx alibabacloud.com/gpu-share-policy=share-pool

Step 2: Verify the node status

After applying the label, check that the feature is active on the node:

kubectl get node cn-wulanchabu-c.cr-xxx -o yaml

Expected output (truncated):

# The actual output may vary.
apiVersion: v1
kind: Node
spec:
  # ...
status:
  allocatable:
    # GPU shared resource description
    alibabacloud.com/gpu-core.percentage: "1600"
    alibabacloud.com/gpu-memory.percentage: "1600"
    # CPU, memory, and storage reserved for the GPU sharing module
    cpu: "144"
    memory: 1640Gi
    nvidia.com/gpu: "16"
    ephemeral-storage: 4608Gi
  capacity:
    # GPU shared resource description
    alibabacloud.com/gpu-core.percentage: "1600"
    alibabacloud.com/gpu-memory.percentage: "1600"
    cpu: "176"
    memory: 1800Gi
    nvidia.com/gpu: "16"
    ephemeral-storage: 6Ti
  conditions:
  # Indicates whether the GPU share policy configuration is valid
  - lastHeartbeatTime: "2025-01-07T04:13:04Z"
    lastTransitionTime: "2025-01-07T04:13:04Z"
    message: gpu share policy is valid.
    reason: Valied
    status: "True"
    type: GPUSharePolicyValid
  # Indicates the GPU share policy in effect on this node
  - lastHeartbeatTime: "2025-01-07T04:13:04Z"
    lastTransitionTime: "2025-01-07T04:13:04Z"
    message: gpu share policy is share-pool.
    reason: share-pool
    status: "True"
    type: GPUSharePolicy

Confirm the following in the output to verify that the feature is active:

allocatable and capacity include alibabacloud.com/gpu-core.percentage and alibabacloud.com/gpu-memory.percentage.
The GPUSharePolicyValid condition has status: "True".
The GPUSharePolicy condition has reason: share-pool.

If the node resources do not update as described, the configuration failed. Check the GPUSharePolicyValid condition's reason and message fields for details. See Node conditions for reason values.

Step 3: Deploy a pod with shared GPU resources

Create a file named gpu-share-demo.yaml. Set the GPU sharing model to share-pool, matching the node configuration:

apiVersion: v1
kind: Pod
metadata:
  labels:
    alibabacloud.com/compute-class: "gpu-hpn"
    # Set the GPU sharing model to share-pool, matching the node configuration
    alibabacloud.com/gpu-share-policy: "share-pool"
  name: gpu-share-demo
  namespace: default
spec:
  containers:
  - name: demo
    image: registry-cn-wulanchabu-vpc.ack.aliyuncs.com/acs/stress:v1.0.4
    args:
      - '1000h'
    command:
      - sleep
    resources:
      limits:
        cpu: '5'
        memory: 50Gi
        alibabacloud.com/gpu-core.percentage: 100   # Upper limit of computing power usage
        alibabacloud.com/gpu-memory.percentage: 100  # Upper limit of GPU memory usage; exceeding this causes a CUDA OOM error
      requests:
        cpu: '5'
        memory: 50Gi
        alibabacloud.com/gpu-core.percentage: 10    # Controls how many pods can be scheduled on the node
        alibabacloud.com/gpu-memory.percentage: 10   # Controls how many pods can be scheduled on the node

Deploy the pod:

kubectl apply -f gpu-share-demo.yaml

Step 4: Check GPU resource usage

kubectl exec -it gpu-share-demo -- /bin/bash

Inside the container, run nvidia-smi to view GPU resource allocation and usage. The applicable command depends on the GPU card type—nvidia-smi is for NVIDIA GPU devices. For other card types, submit a ticket for assistance.

Note

For share-pool pods, the BusID field in nvidia-smi output shows Pending when the pod is not actively using GPU resources. This is expected behavior, not an error.

Step 5 (optional): Disable GPU sharing on the node

Important

Before disabling GPU sharing, delete all pods on the node that use GPU shared resources. Pods that use only CPU and memory do not need to be deleted.

Delete the pod:
```
kubectl delete pod gpu-share-demo
```

Set the GPU sharing policy to none:

kubectl label node cn-wulanchabu-c.cr-xxx alibabacloud.com/gpu-share-policy=none

Verify the node status:

allocatable and capacity no longer include alibabacloud.com/gpu-core.percentage or alibabacloud.com/gpu-memory.percentage.
The GPUSharePolicy condition has status: "False" and reason: none.
CPU and memory in allocatable are restored to their original values.

kubectl get node cn-wulanchabu-c.cr-xxx -o yaml

Expected output (truncated):

apiVersion: v1
kind: Node
spec:
  # ...
status:
  allocatable:
    # Reserved CPU and memory are restored after the feature is disabled
    cpu: "176"
    memory: 1800Gi
    nvidia.com/gpu: "16"
    ephemeral-storage: 4608Gi
  capacity:
    cpu: "176"
    memory: 1800Gi
    nvidia.com/gpu: "16"
    ephemeral-storage: 6Ti
  conditions:
  - lastHeartbeatTime: "2025-01-07T04:13:04Z"
    lastTransitionTime: "2025-01-07T04:13:04Z"
    message: gpu share policy config is valid.
    reason: Valid
    status: "True"
    type: GPUSharePolicyValid
  - lastHeartbeatTime: "2025-01-07T04:13:04Z"
    lastTransitionTime: "2025-01-07T04:13:04Z"
    message: gpu share policy is none.
    reason: none
    status: "False"
    type: GPUSharePolicy

Node configuration

Enablement label

Set the alibabacloud.com/gpu-share-policy label on a node to enable or disable GPU sharing.

apiVersion: v1
kind: Node
metadata:
  labels:
    alibabacloud.com/gpu-share-policy: share-pool  # or: static, none

Value	Description
`none`	Disables GPU sharing on the node.
`share-pool`	Treats all GPUs on the node as a shared pool. Pods are not fixed to a specific GPU device.
`static`	GPU slicing mode. Each pod is assigned a fixed GPU device that does not change at runtime. The scheduler prioritizes placing pods on the same GPU to minimize fragmentation.

Important

If pods that use exclusive GPUs exist on the node, delete them before enabling the sharing policy.
If pods that use GPU shared resources exist on the node, delete them before modifying or disabling the sharing policy.
Pods that use only CPU and memory do not need to be deleted.

QoS configuration

Configure Quality of Service (QoS) parameters for GPU sharing using the alibabacloud.com/gpu-share-qos-config node annotation. These parameters apply only to the share-pool model.

apiVersion: v1
kind: Node
metadata:
  annotations:
    alibabacloud.com/gpu-share-qos-config: '{"preemptEnabled": true, "podMaxDurationMinutes": 120, "reservedEphemeralStorage": "1.5Ti"}'

Parameter	Type	Default	Description
`preemptEnabled`	Boolean	`true`	Whether to enable preemption.
`podMaxDurationMinutes`	Int	`120` (2 hours)	A pod can be preempted only after it has continuously occupied a GPU for longer than this duration. Must be greater than 0. Unit: minutes.
`reservedEphemeralStorage`	resource.Quantity	`1.5Ti`	Reserved local temporary storage per node. Must be greater than or equal to 0. Uses Kubernetes quantity format, such as `500Gi`.

Shared resource fields

When GPU sharing is enabled, the following fields are added to the node's allocatable and capacity. They are removed when the feature is disabled.

Field	Description	Calculation
`alibabacloud.com/gpu-core.percentage`	GPU computing power as a percentage.	`number of GPU devices × 100` (e.g., 16 GPUs → 1600)
`alibabacloud.com/gpu-memory.percentage`	GPU memory as a percentage.	`number of GPU devices × 100` (e.g., 16 GPUs → 1600)
`cpu`	CPU cores reserved for the GPU sharing module, deducted from `allocatable`.	`number of GPU devices × 2` (e.g., 16 GPUs → 32 cores reserved)
`memory`	Memory reserved for the GPU sharing module.	`number of GPU devices × 10 GB` (e.g., 16 GPUs → 160 GB reserved)
`ephemeral-storage`	Disk space reserved per node.	1.5 TB per node

Node conditions

The node conditions field reports two GPU sharing condition types.

GPUSharePolicyValid — whether the GPU sharing configuration is valid:

Field	Values	Description
`status`	`"True"`, `"False"`	`True`: configuration is valid. `False`: configuration is invalid; check `reason`.
`reason`	`Valid`, `InvalidParameters`, `InvalidExistingPods`, `ResourceNotEnough`	`Valid`: policy is valid. `InvalidParameters`: syntax error in the configuration. `InvalidExistingPods`: incompatible GPU pods exist on the node; the feature cannot be enabled or disabled. `ResourceNotEnough`: insufficient node resources for the GPU sharing module's basic overhead; delete some pods first.
`message`	—	Human-readable message.
`lastTransitionTime`, `lastHeartbeatTime`	UTC	Time when the condition was last updated.

GPUSharePolicy — the currently active GPU sharing policy:

Field	Values	Description
`status`	`"True"`, `"False"`	`True`: GPU sharing is enabled. `False`: GPU sharing is not enabled.
`reason`	`none`, `share-pool`, `static`	The policy currently in effect.
`message`	—	Human-readable message.
`lastTransitionTime`, `lastHeartbeatTime`	UTC	Time when the condition was last updated.

Pod configuration

To use GPU sharing, configure the following labels and resource requests on the pod.

apiVersion: v1
kind: Pod
metadata:
  labels:
    alibabacloud.com/compute-class: "gpu-hpn"         # Only gpu-hpn is supported
    alibabacloud.com/gpu-share-policy: "share-pool"   # Must match the node's sharing model
  name: gpu-share-demo
  namespace: default
spec:
  containers:
  - name: demo
    image: registry-cn-wulanchabu-vpc.ack.aliyuncs.com/acs/stress:v1.0.4
    args:
      - '1000h'
    command:
      - sleep
    resources:
      limits:
        cpu: '5'
        memory: 50Gi
        alibabacloud.com/gpu-core.percentage: 100
        alibabacloud.com/gpu-memory.percentage: 100
      requests:
        cpu: '5'
        memory: 50Gi
        alibabacloud.com/gpu-core.percentage: 10
        alibabacloud.com/gpu-memory.percentage: 10

Compute class

Label	Value	Description
`metadata.labels.alibabacloud.com/compute-class`	`gpu-hpn`	Only the `gpu-hpn` compute class is supported.

GPU sharing policy

Label	Type	Valid values	Description
`metadata.labels.alibabacloud.com/gpu-share-policy`	String	`none`, `share-pool`, `static`	Specifies the GPU sharing model for the pod. Only nodes that use the same model are considered for scheduling.

Resource requests

Specify GPU shared resources in the container's resources field using percentages of a single GPU's computing power and memory.

Field	Resource	Type	Valid values	Description
`requests`	`alibabacloud.com/gpu-core.percentage`	Int	share-pool: [10, 100]; static: [10, 100)	The percentage of a single GPU's computing power to request. Minimum: 10%. Controls how many pods can be scheduled on a node.
`requests`	`alibabacloud.com/gpu-memory.percentage`	Int	share-pool: [10, 100]; static: [10, 100)	The percentage of a single GPU's memory to request. Minimum: 10%.
`limits`	`alibabacloud.com/gpu-core.percentage`	Int	—	The upper limit of computing power usage at runtime.
`limits`	`alibabacloud.com/gpu-memory.percentage`	Int	—	The upper limit of GPU memory usage at runtime. Exceeding this causes a CUDA OOM error.

Both alibabacloud.com/gpu-core.percentage and alibabacloud.com/gpu-memory.percentage must be specified in both requests and limits.

The number of pods that can be scheduled on a node is also constrained by CPU, memory, and the node's maximum pod count.

FAQ

What happens to a pod waiting in the ready queue?

The pod periodically logs its waiting status:

You have been waiting for ${1} seconds. Approximate position: ${2}

${1} is the number of seconds the pod has been waiting. ${2} is its current position in the ready queue.

What monitoring metrics are available for GPU sharing pods?

The following metrics are available for share-pool pods:

Metric	Description	Example
`DCGM_FI_POOLING_STATUS`	Pod status in GPU sharing mode. Values: `0` = Hibernation (no GPU demand); `1` = Ready (waiting for resources); `2` = Normal (using GPU, duration < `podMaxDurationMinutes`); `3` = Preemptible (using GPU, duration > `podMaxDurationMinutes`, but no pods are queued).	`DCGM_FI_POOLING_STATUS{NodeName="cn-wulanchabu-c.cr-xxx",pod="gpu-share-demo",namespace="default"} 1`
`DCGM_FI_POOLING_POSITION`	Pod's position in the ready queue, starting from 1. Only appears when `DCGM_FI_POOLING_STATUS=1`.	`DCGM_FI_POOLING_POSITION{NodeName="cn-wulanchabu-c.cr-xxx",pod="gpu-share-demo",namespace="default"} 1`

How do GPU utilization metrics differ for shared GPU pods?

GPU utilization metrics for shared GPU pods are similar to those for exclusive GPU pods, with the following differences:

ACS pod monitoring: GPU computing power utilization and GPU memory usage are absolute values based on the entire GPU card—the same as in exclusive GPU scenarios.
In-container view (e.g., nvidia-smi): GPU memory usage is an absolute value, but computing power utilization is a relative value where the denominator is the pod's limit.
Device IDs: The device ID in metrics corresponds to the actual ID on the node and does not always start from 0.
share-pool model: The device number in metrics may change because the pod can use different GPU devices from the pool over time.

How do I prevent scheduling conflicts when GPU sharing is enabled on only some nodes?

The default ACS scheduler automatically matches pod and node types to avoid conflicts.

With a custom scheduler, an exclusive GPU pod might be scheduled onto a GPU sharing node because the node exposes both nvidia.com/gpu and GPU shared resources in its capacity. Use one of these approaches:

Scheduler plugin: Write a plugin that reads ACS node labels and Condition fields to filter out nodes with a mismatched GPU sharing policy. See Scheduling Framework.
Labels or taints: Add a label or taint to GPU sharing nodes, then configure affinity or toleration policies on your pods.

What information is available when a GPU sharing pod is preempted?

For share-pool pods, preemption generates both an Event and a Condition on the pod.

Events:

# This pod's GPU resources were preempted by <new-pod-name>
Warning  GPUSharePreempted  5m15s  gpushare   GPU is preempted by <new-pod-name>.
# This pod preempted GPU resources from <old-pod-name>
Warning  GPUSharePreempt    3m47s  gpushare   GPU is preempted from <old-pod-name>.

Condition:

- type: Interruption.GPUShareReclaim   # Condition type for GPU sharing preemption events
  status: "True"                        # True: a preemption or preemption-by action occurred
  reason: GPUSharePreempt               # GPUSharePreempt: this pod preempted another pod; GPUSharePreempted: this pod was preempted
  message: GPU is preempted from <old-pod-name>.
  lastTransitionTime: "2025-04-22T08:12:09Z"
  lastProbeTime: "2025-04-22T08:12:09Z"

How do I maximize pod density in a Notebook scenario?

For GPU sharing pods, CPU and memory requests can also be set lower than limits to increase pod density on a node. When the total limits across pods on a node exceeds the node's allocatable resources, pods compete for CPU and memory.

CPU: Competition shows up as CPU Steal Time in the pod's metrics.
Memory: Competition can trigger a node-level out-of-memory (OOM) error, causing some pods to be killed.

Plan pod priorities and resource specifications based on each application's characteristics. For node-level resource utilization data, see ACS GPU-HPN node-level monitoring metrics.