GPU sharing on GPU-HPN nodes lets multiple pods run on a single GPU device, so you can request fractional GPU resources instead of dedicating an entire GPU to each workload. This reduces idle capacity and helps you run more workloads per node—particularly useful for Notebook development sessions and small AI inference services.
GPU sharing is available only in ACS clusters. This feature is in public preview in the Ulanqab and Shanghai Finance Cloud regions. To use it in other regions, submit a ticketsubmit a ticket.
Limitations
Before using GPU sharing, be aware of the following constraints:
GPU sharing provides fine-grained resource allocation within a single GPU. It does not support aggregated requests across multiple GPUs—you cannot request 0.5 of the computing power from two different GPUs simultaneously.
The GPU sharing module manages the driver version for all pods that use GPU sharing. You cannot specify a driver version for an individual pod.
Each pod can have at most one container that uses GPU shared resources (typically the main container). Sidecar containers can only request CPU and memory.
A container cannot request both exclusive GPU resources (
nvidia.com/gpu) and GPU shared resources (alibabacloud.com/gpu-core.percentage,alibabacloud.com/gpu-memory.percentage).
How it works
Pods do not access a GPU device directly. Instead, they interact with it through the GPU sharing module, which consists of two components:
Proxy module: Integrated into the pod by default. Intercepts API calls related to the GPU device and forwards them to the resource management module.
Resource management module: Runs GPU instructions on the actual GPU device and enforces resource limits based on the pod's resource description.
When the GPU sharing feature is enabled, the resource management module automatically reserves some CPU and memory on the node. For details on reserved amounts, see Node configuration.
Pod states and Quality of Service
Similar to OS process management, the GPU sharing module assigns each pod one of three states:
Hibernation: The pod has no GPU demand (initial state when a pod starts).
Ready: The pod is waiting for GPU resources to be allocated.
Running: The pod is actively using GPU resources.
When multiple pods compete for GPU resources simultaneously, the GPU sharing module applies Quality of Service (QoS) policies to manage allocation fairly.
Queuing policy (share-pool model only)
Pods in the ready state are queued using First In, First Out (FIFO). The pod that entered the ready state first receives resources first. If resources are insufficient, the preemption policy is triggered.
Preemption policy (share-pool model only)
When a pod in the ready queue cannot get resources, the GPU sharing module attempts to preempt a running pod using the following criteria:
| Policy | Description |
|---|---|
| Filter | A running pod is eligible for preemption only if it has continuously occupied GPU resources for longer than podMaxDurationMinutes (default: 2 hours). |
| Scoring | Among eligible pods, those that have held GPU resources longer are preempted first. |
If no running pod meets the filter condition, the queued pod continues to wait.
Choose a sharing model
ACS supports two sharing models. The main difference is whether a pod is assigned a fixed GPU device or can use any available GPU on the node:
Use share-pool when workloads are bursty or intermittent (such as Notebook sessions). Pods share resources from a common GPU pool, with QoS mechanisms (queuing and preemption) managing contention.
Use static when workloads need guaranteed, uninterrupted GPU access without queuing. Each pod is fixed to a specific GPU device.
| Model | GPU assignment | requests/limits | Queuing | Preemption | Best for |
|---|---|---|---|---|---|
| share-pool | Any GPU with idle resources on the node | requests <= limits | FIFO | Configurable | Notebook development, off-peak multi-user workloads |
| static | Fixed GPU device, does not change at runtime | requests == limits | Not supported | Not supported | Small-scale AI apps that need guaranteed GPU access without queuing |
For the static model, always set requests == limits for GPU computing power and GPU memory. If requests < limits, resource competition occurs between pods sharing the same GPU and can cause pods to be killed by an out-of-memory (OOM) error.
Example: off-peak resource sharing in Notebook scenarios
In Notebook development, workloads typically do not hold GPU resources continuously. With the share-pool model, pods use GPU resources only when they need them, and the QoS mechanism manages access when multiple pods request resources simultaneously.
Consider four pods on a node with two GPUs:
Pods A and B:
requests=0.5,limits=0.5Pods C and D:
requests=0.5,limits=1
Based on requests, all four pods fit on the node.
Time T1: Pods A and C are running. Pods B and D are in the ready queue. The GPU sharing module tries to allocate resources to Pod D (first in queue). GPU 0 has 0.5 GPU idle, which satisfies Pod D's requests=0.5, but Pod D's limit=1 would cause resource competition with Pod A on the same GPU. So Pod D stays in the queue.
Time T2 – Phase 1: Pod C finishes and enters hibernation. GPU 1 becomes free and its resources are allocated to Pod D.
Time T2 – Phase 2: Pod B is allocated resources on GPU 0. Because Pod B's limit=0.5, it can share GPU 0 with Pod A without resource competition.
Enable GPU sharing on a node
The following steps show how to enable the share-pool model, deploy a pod with fractional GPU resources, verify the configuration, and optionally disable the feature.
Prerequisites
Before you begin, ensure that you have:
An ACS cluster with GPU-HPN nodes.
kubectlconfigured to connect to the cluster.Deleted any pods on the target node that request exclusive GPU resources. Pods that use only CPU and memory do not need to be deleted.
Step 1: Label the GPU-HPN node
List the GPU-HPN nodes in the cluster:
kubectl get node -l alibabacloud.com/node-type=reservedExpected output:
NAME STATUS ROLES AGE VERSION
cn-wulanchabu-c.cr-xxx Ready agent 59d v1.28.3-aliyunAdd the alibabacloud.com/gpu-share-policy=share-pool label to enable GPU sharing:
kubectl label node cn-wulanchabu-c.cr-xxx alibabacloud.com/gpu-share-policy=share-poolStep 2: Verify the node status
After applying the label, check that the feature is active on the node:
kubectl get node cn-wulanchabu-c.cr-xxx -o yamlExpected output (truncated):
# The actual output may vary.
apiVersion: v1
kind: Node
spec:
# ...
status:
allocatable:
# GPU shared resource description
alibabacloud.com/gpu-core.percentage: "1600"
alibabacloud.com/gpu-memory.percentage: "1600"
# CPU, memory, and storage reserved for the GPU sharing module
cpu: "144"
memory: 1640Gi
nvidia.com/gpu: "16"
ephemeral-storage: 4608Gi
capacity:
# GPU shared resource description
alibabacloud.com/gpu-core.percentage: "1600"
alibabacloud.com/gpu-memory.percentage: "1600"
cpu: "176"
memory: 1800Gi
nvidia.com/gpu: "16"
ephemeral-storage: 6Ti
conditions:
# Indicates whether the GPU share policy configuration is valid
- lastHeartbeatTime: "2025-01-07T04:13:04Z"
lastTransitionTime: "2025-01-07T04:13:04Z"
message: gpu share policy is valid.
reason: Valied
status: "True"
type: GPUSharePolicyValid
# Indicates the GPU share policy in effect on this node
- lastHeartbeatTime: "2025-01-07T04:13:04Z"
lastTransitionTime: "2025-01-07T04:13:04Z"
message: gpu share policy is share-pool.
reason: share-pool
status: "True"
type: GPUSharePolicyConfirm the following in the output to verify that the feature is active:
allocatableandcapacityincludealibabacloud.com/gpu-core.percentageandalibabacloud.com/gpu-memory.percentage.The
GPUSharePolicyValidcondition hasstatus: "True".The
GPUSharePolicycondition hasreason: share-pool.
If the node resources do not update as described, the configuration failed. Check the GPUSharePolicyValid condition's reason and message fields for details. See Node conditions for reason values.
Step 3: Deploy a pod with shared GPU resources
Create a file named gpu-share-demo.yaml. Set the GPU sharing model to share-pool, matching the node configuration:
apiVersion: v1
kind: Pod
metadata:
labels:
alibabacloud.com/compute-class: "gpu-hpn"
# Set the GPU sharing model to share-pool, matching the node configuration
alibabacloud.com/gpu-share-policy: "share-pool"
name: gpu-share-demo
namespace: default
spec:
containers:
- name: demo
image: registry-cn-wulanchabu-vpc.ack.aliyuncs.com/acs/stress:v1.0.4
args:
- '1000h'
command:
- sleep
resources:
limits:
cpu: '5'
memory: 50Gi
alibabacloud.com/gpu-core.percentage: 100 # Upper limit of computing power usage
alibabacloud.com/gpu-memory.percentage: 100 # Upper limit of GPU memory usage; exceeding this causes a CUDA OOM error
requests:
cpu: '5'
memory: 50Gi
alibabacloud.com/gpu-core.percentage: 10 # Controls how many pods can be scheduled on the node
alibabacloud.com/gpu-memory.percentage: 10 # Controls how many pods can be scheduled on the nodeDeploy the pod:
kubectl apply -f gpu-share-demo.yamlStep 4: Check GPU resource usage
Log in to the container to verify GPU resource usage:
kubectl exec -it gpu-share-demo -- /bin/bashInside the container, use nvidia-smi to view GPU resource allocation and usage. The command to use depends on your GPU card type—nvidia-smi applies to NVIDIA GPU devices. For other card types, submit a ticketsubmit a ticket for assistance.
For share-pool pods, the BusID field in nvidia-smi output shows Pending when the pod is not actively using GPU resources. This is expected behavior, not an error.
Step 5 (optional): Disable GPU sharing on the node
Before disabling GPU sharing, delete all pods on the node that use GPU shared resources. Pods that use only CPU and memory do not need to be deleted.
Delete the pod:
kubectl delete pod gpu-share-demoSet the GPU sharing policy to
none:kubectl label node cn-wulanchabu-c.cr-xxx alibabacloud.com/gpu-share-policy=noneVerify the node status:
allocatableandcapacityno longer includealibabacloud.com/gpu-core.percentageoralibabacloud.com/gpu-memory.percentage.The
GPUSharePolicycondition hasstatus: "False"andreason: none.CPU and memory in
allocatableare restored to their original values.
kubectl get node cn-wulanchabu-c.cr-xxx -o yamlExpected output (truncated):
apiVersion: v1 kind: Node spec: # ... status: allocatable: # Reserved CPU and memory are restored after the feature is disabled cpu: "176" memory: 1800Gi nvidia.com/gpu: "16" ephemeral-storage: 4608Gi capacity: cpu: "176" memory: 1800Gi nvidia.com/gpu: "16" ephemeral-storage: 6Ti conditions: - lastHeartbeatTime: "2025-01-07T04:13:04Z" lastTransitionTime: "2025-01-07T04:13:04Z" message: gpu share policy config is valid. reason: Valid status: "True" type: GPUSharePolicyValid - lastHeartbeatTime: "2025-01-07T04:13:04Z" lastTransitionTime: "2025-01-07T04:13:04Z" message: gpu share policy is none. reason: none status: "False" type: GPUSharePolicyConfirm the following in the output:
Node configuration
Enablement label
Set the alibabacloud.com/gpu-share-policy label on a node to enable or disable GPU sharing.
apiVersion: v1
kind: Node
metadata:
labels:
alibabacloud.com/gpu-share-policy: share-pool # or: static, none| Value | Description |
|---|---|
none | Disables GPU sharing on the node. |
share-pool | Treats all GPUs on the node as a shared pool. Pods are not fixed to a specific GPU device. |
static | GPU slicing mode. Each pod is assigned a fixed GPU device that does not change at runtime. The scheduler prioritizes placing pods on the same GPU to minimize fragmentation. |
If pods that use exclusive GPUs exist on the node, delete them before enabling the sharing policy.
If pods that use GPU shared resources exist on the node, delete them before modifying or disabling the sharing policy.
Pods that use only CPU and memory do not need to be deleted.
QoS configuration
Configure Quality of Service (QoS) parameters for GPU sharing using the alibabacloud.com/gpu-share-qos-config node annotation. These parameters apply only to the share-pool model.
apiVersion: v1
kind: Node
metadata:
annotations:
alibabacloud.com/gpu-share-qos-config: '{"preemptEnabled": true, "podMaxDurationMinutes": 120, "reservedEphemeralStorage": "1.5Ti"}'| Parameter | Type | Default | Description |
|---|---|---|---|
preemptEnabled | Boolean | true | Whether to enable preemption. |
podMaxDurationMinutes | Int | 120 (2 hours) | A pod can be preempted only after it has continuously occupied a GPU for longer than this duration. Must be greater than 0. Unit: minutes. |
reservedEphemeralStorage | resource.Quantity | 1.5Ti | Reserved local temporary storage per node. Must be greater than or equal to 0. Uses Kubernetes quantity format, such as 500Gi. |
Shared resource fields
When GPU sharing is enabled, the following fields are added to the node's allocatable and capacity. They are removed when the feature is disabled.
| Field | Description | Calculation |
|---|---|---|
alibabacloud.com/gpu-core.percentage | GPU computing power as a percentage. | number of GPU devices × 100 (e.g., 16 GPUs → 1600) |
alibabacloud.com/gpu-memory.percentage | GPU memory as a percentage. | number of GPU devices × 100 (e.g., 16 GPUs → 1600) |
cpu | CPU cores reserved for the GPU sharing module, deducted from allocatable. | number of GPU devices × 2 (e.g., 16 GPUs → 32 cores reserved) |
memory | Memory reserved for the GPU sharing module. | number of GPU devices × 10 GB (e.g., 16 GPUs → 160 GB reserved) |
ephemeral-storage | Disk space reserved per node. | 1.5 TB per node |
Node conditions
The node conditions field reports two GPU sharing condition types.
GPUSharePolicyValid — whether the GPU sharing configuration is valid:
| Field | Values | Description |
|---|---|---|
status | "True", "False" | True: configuration is valid. False: configuration is invalid; check reason. |
reason | Valid, InvalidParameters, InvalidExistingPods, ResourceNotEnough | Valid: policy is valid. InvalidParameters: syntax error in the configuration. InvalidExistingPods: incompatible GPU pods exist on the node; the feature cannot be enabled or disabled. ResourceNotEnough: insufficient node resources for the GPU sharing module's basic overhead; delete some pods first. |
message | — | Human-readable message. |
lastTransitionTime, lastHeartbeatTime | UTC | Time when the condition was last updated. |
GPUSharePolicy — the currently active GPU sharing policy:
| Field | Values | Description |
|---|---|---|
status | "True", "False" | True: GPU sharing is enabled. False: GPU sharing is not enabled. |
reason | none, share-pool, static | The policy currently in effect. |
message | — | Human-readable message. |
lastTransitionTime, lastHeartbeatTime | UTC | Time when the condition was last updated. |
Pod configuration
To use GPU sharing, configure the following labels and resource requests on the pod.
apiVersion: v1
kind: Pod
metadata:
labels:
alibabacloud.com/compute-class: "gpu-hpn" # Only gpu-hpn is supported
alibabacloud.com/gpu-share-policy: "share-pool" # Must match the node's sharing model
name: gpu-share-demo
namespace: default
spec:
containers:
- name: demo
image: registry-cn-wulanchabu-vpc.ack.aliyuncs.com/acs/stress:v1.0.4
args:
- '1000h'
command:
- sleep
resources:
limits:
cpu: '5'
memory: 50Gi
alibabacloud.com/gpu-core.percentage: 100
alibabacloud.com/gpu-memory.percentage: 100
requests:
cpu: '5'
memory: 50Gi
alibabacloud.com/gpu-core.percentage: 10
alibabacloud.com/gpu-memory.percentage: 10Compute class
| Label | Value | Description |
|---|---|---|
metadata.labels.alibabacloud.com/compute-class | gpu-hpn | Only the gpu-hpn compute class is supported. |
GPU sharing policy
| Label | Type | Valid values | Description |
|---|---|---|---|
metadata.labels.alibabacloud.com/gpu-share-policy | String | none, share-pool, static | Specifies the GPU sharing model for the pod. Only nodes that use the same model are considered for scheduling. |
Resource requests
Specify GPU shared resources in the container's resources field using percentages of a single GPU's computing power and memory.
| Field | Resource | Type | Valid values | Description |
|---|---|---|---|---|
requests | alibabacloud.com/gpu-core.percentage | Int | share-pool: [10, 100]; static: [10, 100) | The percentage of a single GPU's computing power to request. Minimum: 10%. Controls how many pods can be scheduled on a node. |
requests | alibabacloud.com/gpu-memory.percentage | Int | share-pool: [10, 100]; static: [10, 100) | The percentage of a single GPU's memory to request. Minimum: 10%. |
limits | alibabacloud.com/gpu-core.percentage | Int | — | The upper limit of computing power usage at runtime. |
limits | alibabacloud.com/gpu-memory.percentage | Int | — | The upper limit of GPU memory usage at runtime. Exceeding this causes a CUDA OOM error. |
Both alibabacloud.com/gpu-core.percentage and alibabacloud.com/gpu-memory.percentage must be specified in both requests and limits.
The number of pods that can be scheduled on a node is also constrained by CPU, memory, and the node's maximum pod count.
FAQ
What happens to a pod waiting in the ready queue?
The pod periodically logs its waiting status:
You have been waiting for ${1} seconds. Approximate position: ${2}${1} is the number of seconds the pod has been waiting. ${2} is its current position in the ready queue.
What monitoring metrics are available for GPU sharing pods?
The following metrics are available for share-pool pods:
| Metric | Description | Example |
|---|---|---|
DCGM_FI_POOLING_STATUS | Pod status in GPU sharing mode. Values: 0 = Hibernation (no GPU demand); 1 = Ready (waiting for resources); 2 = Normal (using GPU, duration < podMaxDurationMinutes); 3 = Preemptible (using GPU, duration > podMaxDurationMinutes, but no pods are queued). | DCGM_FI_POOLING_STATUS{NodeName="cn-wulanchabu-c.cr-xxx",pod="gpu-share-demo",namespace="default"} 1 |
DCGM_FI_POOLING_POSITION | Pod's position in the ready queue, starting from 1. Only appears when DCGM_FI_POOLING_STATUS=1. | DCGM_FI_POOLING_POSITION{NodeName="cn-wulanchabu-c.cr-xxx",pod="gpu-share-demo",namespace="default"} 1 |
How do GPU utilization metrics differ for shared GPU pods?
GPU utilization metrics work the same way as for exclusive GPU pods, with a few differences for shared GPU pods:
ACS pod monitoring: GPU computing power utilization and GPU memory usage are absolute values based on the entire GPU card—the same as in exclusive GPU scenarios.
In-container view (e.g.,
nvidia-smi): GPU memory usage is an absolute value, but computing power utilization is a relative value where the denominator is the pod'slimit.Device IDs: The device ID in metrics corresponds to the actual ID on the node and does not always start from 0.
share-pool model: The device number in metrics may change because the pod can use different GPU devices from the pool over time.
How do I prevent scheduling conflicts when GPU sharing is enabled on only some nodes?
The default ACS scheduler automatically matches pod and node types, avoiding conflicts.
If you use a custom scheduler, an exclusive GPU pod might be scheduled onto a GPU sharing node because the node exposes both nvidia.com/gpu and GPU shared resources in its capacity. Use one of these approaches:
Scheduler plugin: Write a plugin that reads ACS node labels and Condition fields to filter out nodes with a mismatched GPU sharing policy. See Scheduling Framework.
Labels or taints: Add a label or taint to GPU sharing nodes, then configure affinity or toleration policies on your pods.
What information is available when a GPU sharing pod is preempted?
For share-pool pods, preemption generates both an Event and a Condition on the pod.
Events:
# This pod's GPU resources were preempted by <new-pod-name>
Warning GPUSharePreempted 5m15s gpushare GPU is preempted by <new-pod-name>.
# This pod preempted GPU resources from <old-pod-name>
Warning GPUSharePreempt 3m47s gpushare GPU is preempted from <old-pod-name>.Condition:
- type: Interruption.GPUShareReclaim # Condition type for GPU sharing preemption events
status: "True" # True: a preemption or preemption-by action occurred
reason: GPUSharePreempt # GPUSharePreempt: this pod preempted another pod; GPUSharePreempted: this pod was preempted
message: GPU is preempted from <old-pod-name>.
lastTransitionTime: "2025-04-22T08:12:09Z"
lastProbeTime: "2025-04-22T08:12:09Z"How do I maximize pod density in a Notebook scenario?
For GPU sharing pods, you can also set CPU and memory requests lower than limits to increase pod density on a node. When the total limits across pods on a node exceeds the node's allocatable resources, pods compete for CPU and memory.
CPU: Competition shows up as CPU Steal Time in the pod's metrics.
Memory: Competition can trigger a node-level out-of-memory (OOM) error, causing some pods to be killed.
Plan pod priorities and resource specifications based on each application's characteristics. For node-level resource utilization data, see ACS GPU-HPN node-level monitoring metrics.