Use GPU sharing on GPU-HPN nodes

更新时间:
复制 MD 格式

GPU sharing on GPU-HPN nodes lets multiple pods run on a single GPU device, so you can request fractional GPU resources instead of dedicating an entire GPU to each workload. This reduces idle capacity and helps you run more workloads per node—particularly useful for Notebook development sessions and small AI inference services.

Important

GPU sharing is available only in ACS clusters. This feature is in public preview in the Ulanqab and Shanghai Finance Cloud regions. To use it in other regions, submit a ticketsubmit a ticket.

Limitations

Before using GPU sharing, be aware of the following constraints:

  • GPU sharing provides fine-grained resource allocation within a single GPU. It does not support aggregated requests across multiple GPUs—you cannot request 0.5 of the computing power from two different GPUs simultaneously.

  • The GPU sharing module manages the driver version for all pods that use GPU sharing. You cannot specify a driver version for an individual pod.

  • Each pod can have at most one container that uses GPU shared resources (typically the main container). Sidecar containers can only request CPU and memory.

  • A container cannot request both exclusive GPU resources (nvidia.com/gpu) and GPU shared resources (alibabacloud.com/gpu-core.percentage, alibabacloud.com/gpu-memory.percentage).

How it works

Pods do not access a GPU device directly. Instead, they interact with it through the GPU sharing module, which consists of two components:

  • Proxy module: Integrated into the pod by default. Intercepts API calls related to the GPU device and forwards them to the resource management module.

  • Resource management module: Runs GPU instructions on the actual GPU device and enforces resource limits based on the pod's resource description.

When the GPU sharing feature is enabled, the resource management module automatically reserves some CPU and memory on the node. For details on reserved amounts, see Node configuration.

image

Pod states and Quality of Service

Similar to OS process management, the GPU sharing module assigns each pod one of three states:

  • Hibernation: The pod has no GPU demand (initial state when a pod starts).

  • Ready: The pod is waiting for GPU resources to be allocated.

  • Running: The pod is actively using GPU resources.

image

When multiple pods compete for GPU resources simultaneously, the GPU sharing module applies Quality of Service (QoS) policies to manage allocation fairly.

Queuing policy (share-pool model only)

Pods in the ready state are queued using First In, First Out (FIFO). The pod that entered the ready state first receives resources first. If resources are insufficient, the preemption policy is triggered.

Preemption policy (share-pool model only)

When a pod in the ready queue cannot get resources, the GPU sharing module attempts to preempt a running pod using the following criteria:

PolicyDescription
FilterA running pod is eligible for preemption only if it has continuously occupied GPU resources for longer than podMaxDurationMinutes (default: 2 hours).
ScoringAmong eligible pods, those that have held GPU resources longer are preempted first.

If no running pod meets the filter condition, the queued pod continues to wait.

Choose a sharing model

ACS supports two sharing models. The main difference is whether a pod is assigned a fixed GPU device or can use any available GPU on the node:

  • Use share-pool when workloads are bursty or intermittent (such as Notebook sessions). Pods share resources from a common GPU pool, with QoS mechanisms (queuing and preemption) managing contention.

  • Use static when workloads need guaranteed, uninterrupted GPU access without queuing. Each pod is fixed to a specific GPU device.

ModelGPU assignmentrequests/limitsQueuingPreemptionBest for
share-poolAny GPU with idle resources on the noderequests <= limitsFIFOConfigurableNotebook development, off-peak multi-user workloads
staticFixed GPU device, does not change at runtimerequests == limitsNot supportedNot supportedSmall-scale AI apps that need guaranteed GPU access without queuing
Warning

For the static model, always set requests == limits for GPU computing power and GPU memory. If requests < limits, resource competition occurs between pods sharing the same GPU and can cause pods to be killed by an out-of-memory (OOM) error.

Example: off-peak resource sharing in Notebook scenarios

In Notebook development, workloads typically do not hold GPU resources continuously. With the share-pool model, pods use GPU resources only when they need them, and the QoS mechanism manages access when multiple pods request resources simultaneously.

image

Consider four pods on a node with two GPUs:

  • Pods A and B: requests=0.5, limits=0.5

  • Pods C and D: requests=0.5, limits=1

Based on requests, all four pods fit on the node.

Time T1: Pods A and C are running. Pods B and D are in the ready queue. The GPU sharing module tries to allocate resources to Pod D (first in queue). GPU 0 has 0.5 GPU idle, which satisfies Pod D's requests=0.5, but Pod D's limit=1 would cause resource competition with Pod A on the same GPU. So Pod D stays in the queue.

Time T2 – Phase 1: Pod C finishes and enters hibernation. GPU 1 becomes free and its resources are allocated to Pod D.

Time T2 – Phase 2: Pod B is allocated resources on GPU 0. Because Pod B's limit=0.5, it can share GPU 0 with Pod A without resource competition.

Enable GPU sharing on a node

The following steps show how to enable the share-pool model, deploy a pod with fractional GPU resources, verify the configuration, and optionally disable the feature.

image

Prerequisites

Before you begin, ensure that you have:

  • An ACS cluster with GPU-HPN nodes.

  • kubectl configured to connect to the cluster.

  • Deleted any pods on the target node that request exclusive GPU resources. Pods that use only CPU and memory do not need to be deleted.

Step 1: Label the GPU-HPN node

List the GPU-HPN nodes in the cluster:

kubectl get node -l alibabacloud.com/node-type=reserved

Expected output:

NAME                     STATUS   ROLES   AGE   VERSION
cn-wulanchabu-c.cr-xxx   Ready    agent   59d   v1.28.3-aliyun

Add the alibabacloud.com/gpu-share-policy=share-pool label to enable GPU sharing:

kubectl label node cn-wulanchabu-c.cr-xxx alibabacloud.com/gpu-share-policy=share-pool

Step 2: Verify the node status

After applying the label, check that the feature is active on the node:

kubectl get node cn-wulanchabu-c.cr-xxx -o yaml

Expected output (truncated):

# The actual output may vary.
apiVersion: v1
kind: Node
spec:
  # ...
status:
  allocatable:
    # GPU shared resource description
    alibabacloud.com/gpu-core.percentage: "1600"
    alibabacloud.com/gpu-memory.percentage: "1600"
    # CPU, memory, and storage reserved for the GPU sharing module
    cpu: "144"
    memory: 1640Gi
    nvidia.com/gpu: "16"
    ephemeral-storage: 4608Gi
  capacity:
    # GPU shared resource description
    alibabacloud.com/gpu-core.percentage: "1600"
    alibabacloud.com/gpu-memory.percentage: "1600"
    cpu: "176"
    memory: 1800Gi
    nvidia.com/gpu: "16"
    ephemeral-storage: 6Ti
  conditions:
  # Indicates whether the GPU share policy configuration is valid
  - lastHeartbeatTime: "2025-01-07T04:13:04Z"
    lastTransitionTime: "2025-01-07T04:13:04Z"
    message: gpu share policy is valid.
    reason: Valied
    status: "True"
    type: GPUSharePolicyValid
  # Indicates the GPU share policy in effect on this node
  - lastHeartbeatTime: "2025-01-07T04:13:04Z"
    lastTransitionTime: "2025-01-07T04:13:04Z"
    message: gpu share policy is share-pool.
    reason: share-pool
    status: "True"
    type: GPUSharePolicy

Confirm the following in the output to verify that the feature is active:

  • allocatable and capacity include alibabacloud.com/gpu-core.percentage and alibabacloud.com/gpu-memory.percentage.

  • The GPUSharePolicyValid condition has status: "True".

  • The GPUSharePolicy condition has reason: share-pool.

If the node resources do not update as described, the configuration failed. Check the GPUSharePolicyValid condition's reason and message fields for details. See Node conditions for reason values.

Step 3: Deploy a pod with shared GPU resources

Create a file named gpu-share-demo.yaml. Set the GPU sharing model to share-pool, matching the node configuration:

apiVersion: v1
kind: Pod
metadata:
  labels:
    alibabacloud.com/compute-class: "gpu-hpn"
    # Set the GPU sharing model to share-pool, matching the node configuration
    alibabacloud.com/gpu-share-policy: "share-pool"
  name: gpu-share-demo
  namespace: default
spec:
  containers:
  - name: demo
    image: registry-cn-wulanchabu-vpc.ack.aliyuncs.com/acs/stress:v1.0.4
    args:
      - '1000h'
    command:
      - sleep
    resources:
      limits:
        cpu: '5'
        memory: 50Gi
        alibabacloud.com/gpu-core.percentage: 100   # Upper limit of computing power usage
        alibabacloud.com/gpu-memory.percentage: 100  # Upper limit of GPU memory usage; exceeding this causes a CUDA OOM error
      requests:
        cpu: '5'
        memory: 50Gi
        alibabacloud.com/gpu-core.percentage: 10    # Controls how many pods can be scheduled on the node
        alibabacloud.com/gpu-memory.percentage: 10   # Controls how many pods can be scheduled on the node

Deploy the pod:

kubectl apply -f gpu-share-demo.yaml

Step 4: Check GPU resource usage

Log in to the container to verify GPU resource usage:

kubectl exec -it gpu-share-demo -- /bin/bash

Inside the container, use nvidia-smi to view GPU resource allocation and usage. The command to use depends on your GPU card type—nvidia-smi applies to NVIDIA GPU devices. For other card types, submit a ticketsubmit a ticket for assistance.

Note

For share-pool pods, the BusID field in nvidia-smi output shows Pending when the pod is not actively using GPU resources. This is expected behavior, not an error.

Step 5 (optional): Disable GPU sharing on the node

Important

Before disabling GPU sharing, delete all pods on the node that use GPU shared resources. Pods that use only CPU and memory do not need to be deleted.

  1. Delete the pod:

    kubectl delete pod gpu-share-demo
  2. Set the GPU sharing policy to none:

    kubectl label node cn-wulanchabu-c.cr-xxx alibabacloud.com/gpu-share-policy=none
  3. Verify the node status:

    • allocatable and capacity no longer include alibabacloud.com/gpu-core.percentage or alibabacloud.com/gpu-memory.percentage.

    • The GPUSharePolicy condition has status: "False" and reason: none.

    • CPU and memory in allocatable are restored to their original values.

    kubectl get node cn-wulanchabu-c.cr-xxx -o yaml

    Expected output (truncated):

    apiVersion: v1
    kind: Node
    spec:
      # ...
    status:
      allocatable:
        # Reserved CPU and memory are restored after the feature is disabled
        cpu: "176"
        memory: 1800Gi
        nvidia.com/gpu: "16"
        ephemeral-storage: 4608Gi
      capacity:
        cpu: "176"
        memory: 1800Gi
        nvidia.com/gpu: "16"
        ephemeral-storage: 6Ti
      conditions:
      - lastHeartbeatTime: "2025-01-07T04:13:04Z"
        lastTransitionTime: "2025-01-07T04:13:04Z"
        message: gpu share policy config is valid.
        reason: Valid
        status: "True"
        type: GPUSharePolicyValid
      - lastHeartbeatTime: "2025-01-07T04:13:04Z"
        lastTransitionTime: "2025-01-07T04:13:04Z"
        message: gpu share policy is none.
        reason: none
        status: "False"
        type: GPUSharePolicy

    Confirm the following in the output:

Node configuration

Enablement label

Set the alibabacloud.com/gpu-share-policy label on a node to enable or disable GPU sharing.

apiVersion: v1
kind: Node
metadata:
  labels:
    alibabacloud.com/gpu-share-policy: share-pool  # or: static, none
ValueDescription
noneDisables GPU sharing on the node.
share-poolTreats all GPUs on the node as a shared pool. Pods are not fixed to a specific GPU device.
staticGPU slicing mode. Each pod is assigned a fixed GPU device that does not change at runtime. The scheduler prioritizes placing pods on the same GPU to minimize fragmentation.
Important
  • If pods that use exclusive GPUs exist on the node, delete them before enabling the sharing policy.

  • If pods that use GPU shared resources exist on the node, delete them before modifying or disabling the sharing policy.

  • Pods that use only CPU and memory do not need to be deleted.

QoS configuration

Configure Quality of Service (QoS) parameters for GPU sharing using the alibabacloud.com/gpu-share-qos-config node annotation. These parameters apply only to the share-pool model.

apiVersion: v1
kind: Node
metadata:
  annotations:
    alibabacloud.com/gpu-share-qos-config: '{"preemptEnabled": true, "podMaxDurationMinutes": 120, "reservedEphemeralStorage": "1.5Ti"}'
ParameterTypeDefaultDescription
preemptEnabledBooleantrueWhether to enable preemption.
podMaxDurationMinutesInt120 (2 hours)A pod can be preempted only after it has continuously occupied a GPU for longer than this duration. Must be greater than 0. Unit: minutes.
reservedEphemeralStorageresource.Quantity1.5TiReserved local temporary storage per node. Must be greater than or equal to 0. Uses Kubernetes quantity format, such as 500Gi.

Shared resource fields

When GPU sharing is enabled, the following fields are added to the node's allocatable and capacity. They are removed when the feature is disabled.

FieldDescriptionCalculation
alibabacloud.com/gpu-core.percentageGPU computing power as a percentage.number of GPU devices × 100 (e.g., 16 GPUs → 1600)
alibabacloud.com/gpu-memory.percentageGPU memory as a percentage.number of GPU devices × 100 (e.g., 16 GPUs → 1600)
cpuCPU cores reserved for the GPU sharing module, deducted from allocatable.number of GPU devices × 2 (e.g., 16 GPUs → 32 cores reserved)
memoryMemory reserved for the GPU sharing module.number of GPU devices × 10 GB (e.g., 16 GPUs → 160 GB reserved)
ephemeral-storageDisk space reserved per node.1.5 TB per node

Node conditions

The node conditions field reports two GPU sharing condition types.

GPUSharePolicyValid — whether the GPU sharing configuration is valid:

FieldValuesDescription
status"True", "False"True: configuration is valid. False: configuration is invalid; check reason.
reasonValid, InvalidParameters, InvalidExistingPods, ResourceNotEnoughValid: policy is valid. InvalidParameters: syntax error in the configuration. InvalidExistingPods: incompatible GPU pods exist on the node; the feature cannot be enabled or disabled. ResourceNotEnough: insufficient node resources for the GPU sharing module's basic overhead; delete some pods first.
messageHuman-readable message.
lastTransitionTime, lastHeartbeatTimeUTCTime when the condition was last updated.

GPUSharePolicy — the currently active GPU sharing policy:

FieldValuesDescription
status"True", "False"True: GPU sharing is enabled. False: GPU sharing is not enabled.
reasonnone, share-pool, staticThe policy currently in effect.
messageHuman-readable message.
lastTransitionTime, lastHeartbeatTimeUTCTime when the condition was last updated.

Pod configuration

To use GPU sharing, configure the following labels and resource requests on the pod.

apiVersion: v1
kind: Pod
metadata:
  labels:
    alibabacloud.com/compute-class: "gpu-hpn"         # Only gpu-hpn is supported
    alibabacloud.com/gpu-share-policy: "share-pool"   # Must match the node's sharing model
  name: gpu-share-demo
  namespace: default
spec:
  containers:
  - name: demo
    image: registry-cn-wulanchabu-vpc.ack.aliyuncs.com/acs/stress:v1.0.4
    args:
      - '1000h'
    command:
      - sleep
    resources:
      limits:
        cpu: '5'
        memory: 50Gi
        alibabacloud.com/gpu-core.percentage: 100
        alibabacloud.com/gpu-memory.percentage: 100
      requests:
        cpu: '5'
        memory: 50Gi
        alibabacloud.com/gpu-core.percentage: 10
        alibabacloud.com/gpu-memory.percentage: 10

Compute class

LabelValueDescription
metadata.labels.alibabacloud.com/compute-classgpu-hpnOnly the gpu-hpn compute class is supported.

GPU sharing policy

LabelTypeValid valuesDescription
metadata.labels.alibabacloud.com/gpu-share-policyStringnone, share-pool, staticSpecifies the GPU sharing model for the pod. Only nodes that use the same model are considered for scheduling.

Resource requests

Specify GPU shared resources in the container's resources field using percentages of a single GPU's computing power and memory.

FieldResourceTypeValid valuesDescription
requestsalibabacloud.com/gpu-core.percentageIntshare-pool: [10, 100]; static: [10, 100)The percentage of a single GPU's computing power to request. Minimum: 10%. Controls how many pods can be scheduled on a node.
requestsalibabacloud.com/gpu-memory.percentageIntshare-pool: [10, 100]; static: [10, 100)The percentage of a single GPU's memory to request. Minimum: 10%.
limitsalibabacloud.com/gpu-core.percentageIntThe upper limit of computing power usage at runtime.
limitsalibabacloud.com/gpu-memory.percentageIntThe upper limit of GPU memory usage at runtime. Exceeding this causes a CUDA OOM error.

Both alibabacloud.com/gpu-core.percentage and alibabacloud.com/gpu-memory.percentage must be specified in both requests and limits.

The number of pods that can be scheduled on a node is also constrained by CPU, memory, and the node's maximum pod count.

FAQ

What happens to a pod waiting in the ready queue?

The pod periodically logs its waiting status:

You have been waiting for ${1} seconds. Approximate position: ${2}

${1} is the number of seconds the pod has been waiting. ${2} is its current position in the ready queue.

What monitoring metrics are available for GPU sharing pods?

The following metrics are available for share-pool pods:

MetricDescriptionExample
DCGM_FI_POOLING_STATUSPod status in GPU sharing mode. Values: 0 = Hibernation (no GPU demand); 1 = Ready (waiting for resources); 2 = Normal (using GPU, duration < podMaxDurationMinutes); 3 = Preemptible (using GPU, duration > podMaxDurationMinutes, but no pods are queued).DCGM_FI_POOLING_STATUS{NodeName="cn-wulanchabu-c.cr-xxx",pod="gpu-share-demo",namespace="default"} 1
DCGM_FI_POOLING_POSITIONPod's position in the ready queue, starting from 1. Only appears when DCGM_FI_POOLING_STATUS=1.DCGM_FI_POOLING_POSITION{NodeName="cn-wulanchabu-c.cr-xxx",pod="gpu-share-demo",namespace="default"} 1

How do GPU utilization metrics differ for shared GPU pods?

GPU utilization metrics work the same way as for exclusive GPU pods, with a few differences for shared GPU pods:

  • ACS pod monitoring: GPU computing power utilization and GPU memory usage are absolute values based on the entire GPU card—the same as in exclusive GPU scenarios.

  • In-container view (e.g., nvidia-smi): GPU memory usage is an absolute value, but computing power utilization is a relative value where the denominator is the pod's limit.

  • Device IDs: The device ID in metrics corresponds to the actual ID on the node and does not always start from 0.

  • share-pool model: The device number in metrics may change because the pod can use different GPU devices from the pool over time.

How do I prevent scheduling conflicts when GPU sharing is enabled on only some nodes?

The default ACS scheduler automatically matches pod and node types, avoiding conflicts.

If you use a custom scheduler, an exclusive GPU pod might be scheduled onto a GPU sharing node because the node exposes both nvidia.com/gpu and GPU shared resources in its capacity. Use one of these approaches:

  • Scheduler plugin: Write a plugin that reads ACS node labels and Condition fields to filter out nodes with a mismatched GPU sharing policy. See Scheduling Framework.

  • Labels or taints: Add a label or taint to GPU sharing nodes, then configure affinity or toleration policies on your pods.

What information is available when a GPU sharing pod is preempted?

For share-pool pods, preemption generates both an Event and a Condition on the pod.

Events:

# This pod's GPU resources were preempted by <new-pod-name>
Warning  GPUSharePreempted  5m15s  gpushare   GPU is preempted by <new-pod-name>.
# This pod preempted GPU resources from <old-pod-name>
Warning  GPUSharePreempt    3m47s  gpushare   GPU is preempted from <old-pod-name>.

Condition:

- type: Interruption.GPUShareReclaim   # Condition type for GPU sharing preemption events
  status: "True"                        # True: a preemption or preemption-by action occurred
  reason: GPUSharePreempt               # GPUSharePreempt: this pod preempted another pod; GPUSharePreempted: this pod was preempted
  message: GPU is preempted from <old-pod-name>.
  lastTransitionTime: "2025-04-22T08:12:09Z"
  lastProbeTime: "2025-04-22T08:12:09Z"

How do I maximize pod density in a Notebook scenario?

For GPU sharing pods, you can also set CPU and memory requests lower than limits to increase pod density on a node. When the total limits across pods on a node exceeds the node's allocatable resources, pods compete for CPU and memory.

  • CPU: Competition shows up as CPU Steal Time in the pod's metrics.

  • Memory: Competition can trigger a node-level out-of-memory (OOM) error, causing some pods to be killed.

Plan pod priorities and resource specifications based on each application's characteristics. For node-level resource utilization data, see ACS GPU-HPN node-level monitoring metrics.