Use capacity scheduling to improve the resource utilization of ACK clusters-Container Service for Kubernetes(ACK)-阿里云帮助中心

Prerequisites

Ensure that you have:

An ACK managed Pro cluster running Kubernetes 1.20 or later

Key concepts

ElasticQuotaTree is a CustomResourceDefinition (CRD) defining a hierarchy of elastic quota groups. Each node represents a quota boundary. Leaf nodes map to one or more namespaces, and pods in those namespaces are scheduled within their leaf node's quota limits.

The two core fields in each quota node are:

Field	Meaning
`min`	Guaranteed resources. The scheduler ensures this amount is available, reclaiming borrowed resources if needed.
`max`	Maximum resources the quota node can use, including idle resources borrowed from other nodes.

Resource borrowing and reclaiming work as follows:

A pod is scheduled if its requested resources, added to the node's current usage, stay within max.
If the node's usage exceeds min, the excess is borrowed from idle capacity elsewhere in the tree.
When another quota node needs its min resources back, the scheduler selects pods from the borrowing node to evict based on factors such as job priority, availability, and creation time.

Features

Hierarchical quotas: Configure elastic quotas at multiple levels (for example, matching your organization structure). Each leaf node can map to multiple namespaces, but each namespace belongs to only one leaf node.
Resource borrowing and reclaiming: Idle min resources can be borrowed by other quota nodes. Borrowed resources are reclaimed automatically when the original owner needs them.
Extended resource support: Beyond CPU and memory, capacity scheduling supports GPU (nvidia.com/gpu) and other Kubernetes-supported resource types.
Node affinity with ResourceFlavor: Attach a ResourceFlavor to a quota node to restrict pods in that quota to specific nodes. See Configure ResourceFlavor for node affinity.

Configure capacity scheduling

This example uses a cluster with a single ecs.sn2.13xlarge node (56 vCPUs and 224 GiB of memory).

Step 1: Create namespaces

kubectl create ns namespace1
kubectl create ns namespace2
kubectl create ns namespace3
kubectl create ns namespace4

Step 2: Create an ElasticQuotaTree

Create the ElasticQuotaTree in the kube-system namespace. This example uses a two-level hierarchy with four leaf quota nodes.

ElasticQuotaTree only takes effect when created in the kube-system namespace.

apiVersion: scheduling.sigs.k8s.io/v1beta1
kind: ElasticQuotaTree
metadata:
  name: elasticquotatree
  namespace: kube-system
spec:
  root:
    name: root
    max:
      cpu: 40
      memory: 40Gi
      nvidia.com/gpu: 4
    min:
      cpu: 40
      memory: 40Gi
      nvidia.com/gpu: 4
    children:
      - name: root.a
        max:
          cpu: 40
          memory: 40Gi
          nvidia.com/gpu: 4
        min:
          cpu: 20
          memory: 20Gi
          nvidia.com/gpu: 2
        children:
          - name: root.a.1
            namespaces:
              - namespace1
            max:
              cpu: 20
              memory: 20Gi
              nvidia.com/gpu: 2
            min:
              cpu: 10
              memory: 10Gi
              nvidia.com/gpu: 1
          - name: root.a.2
            namespaces:
              - namespace2
            max:
              cpu: 20
              memory: 40Gi
              nvidia.com/gpu: 2
            min:
              cpu: 10
              memory: 10Gi
              nvidia.com/gpu: 1
      - name: root.b
        max:
          cpu: 40
          memory: 40Gi
          nvidia.com/gpu: 4
        min:
          cpu: 20
          memory: 20Gi
          nvidia.com/gpu: 2
        children:
          - name: root.b.1
            namespaces:
              - namespace3
            max:
              cpu: 20
              memory: 20Gi
              nvidia.com/gpu: 2
            min:
              cpu: 10
              memory: 10Gi
              nvidia.com/gpu: 1
          - name: root.b.2
            namespaces:
              - namespace4
            max:
              cpu: 20
              memory: 20Gi
              nvidia.com/gpu: 2
            min:
              cpu: 10
              memory: 10Gi
              nvidia.com/gpu: 1

Important

The ElasticQuotaTree must satisfy these constraints:

Within each quota node: min ≤ max
For each parent node: sum of children's min values ≤ parent's min value
For the root node: min = max ≤ total cluster resources
Each namespace belongs to exactly one leaf node; a leaf node can contain multiple namespaces

Step 3: Verify the ElasticQuotaTree

kubectl get ElasticQuotaTree -n kube-system

Expected output:

NAME               AGE
elasticquotatree   68s

Observe resource borrowing and reclaiming

These scenarios show how borrowing and reclaiming work as workloads are deployed across the four namespaces.

Borrow idle resources

Deploy a workload in namespace1. This Deployment requests 5 replicas, each using 5 vCPUs (25 vCPUs total).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx1
  namespace: namespace1
  labels:
    app: nginx1
spec:
  replicas: 5
  selector:
    matchLabels:
      app: nginx1
  template:
    metadata:
      name: nginx1
      labels:
        app: nginx1
    spec:
      containers:
      - name: nginx1
        image: nginx
        resources:
          limits:
            cpu: 5
          requests:
            cpu: 5

Check pod status in namespace1.

kubectl get pods -n namespace1

Expected output:

NAME                     READY   STATUS    RESTARTS   AGE
nginx1-744b889544-52dbg   1/1     Running   0          70s
nginx1-744b889544-6l4s9   1/1     Running   0          70s
nginx1-744b889544-cgzlr   1/1     Running   0          70s
nginx1-744b889544-w2gr7   1/1     Running   0          70s
nginx1-744b889544-zr5xz   0/1     Pending   0          70s

root.a.1 (namespace1) has min=10 CPU and max=20 CPU. The 5 pods request 25 vCPUs total, exceeding max=20. The first 4 pods (20 vCPUs) run — 10 from guaranteed min, 10 borrowed from idle capacity. The 5th pod stays Pending because total requests exceed max.

Deploy a workload in namespace2. This Deployment requests 5 replicas, each using 5 vCPUs.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx2
  namespace: namespace2
  labels:
    app: nginx2
spec:
  replicas: 5
  selector:
    matchLabels:
      app: nginx2
  template:
    metadata:
      name: nginx2
      labels:
        app: nginx2
    spec:
      containers:
      - name: nginx2
        image: nginx
        resources:
          limits:
            cpu: 5
          requests:
            cpu: 5

Check pod status in both namespaces.

kubectl get pods -n namespace1

Expected output:

NAME                     READY   STATUS    RESTARTS   AGE
nginx1-744b889544-52dbg   1/1     Running   0          111s
nginx1-744b889544-6l4s9   1/1     Running   0          111s
nginx1-744b889544-cgzlr   1/1     Running   0          111s
nginx1-744b889544-w2gr7   1/1     Running   0          111s
nginx1-744b889544-zr5xz   0/1     Pending   0          111s

kubectl get pods -n namespace2

Expected output:

NAME                     READY   STATUS    RESTARTS   AGE
nginx2-556f95449f-4gl8s   1/1     Running   0          111s
nginx2-556f95449f-crwk4   1/1     Running   0          111s
nginx2-556f95449f-gg6q2   0/1     Pending   0          111s
nginx2-556f95449f-pnz5k   1/1     Running   0          111s
nginx2-556f95449f-vjpmq   1/1     Running   0          111s

The same borrowing logic applies to namespace2. root.a.2 has min=10 and max=20, so 4 pods run and 1 stays Pending. Now, namespace1 and namespace2 together consume all 40 vCPUs allocated to root (root.max.cpu=40).

Return borrowed resources

Deploy a workload in namespace3. This Deployment requests 5 replicas, each using 5 vCPUs.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx3
  namespace: namespace3
  labels:
    app: nginx3
spec:
  replicas: 5
  selector:
    matchLabels:
      app: nginx3
  template:
    metadata:
      name: nginx3
      labels:
        app: nginx3
    spec:
      containers:
      - name: nginx3
        image: nginx
        resources:
          limits:
            cpu: 5
          requests:
            cpu: 5

Check pod status across all three namespaces.

kubectl get pods -n namespace1

Expected output:

NAME                      READY   STATUS    RESTARTS   AGE
nginx1-744b889544-52dbg   1/1     Running   0          6m17s
nginx1-744b889544-cgzlr   1/1     Running   0          6m17s
nginx1-744b889544-nknns   0/1     Pending   0          3m45s
nginx1-744b889544-w2gr7   1/1     Running   0          6m17s
nginx1-744b889544-zr5xz   0/1     Pending   0          6m17s

kubectl get pods -n namespace2

Expected output:

NAME                      READY   STATUS    RESTARTS   AGE
nginx2-556f95449f-crwk4   1/1     Running   0          4m22s
nginx2-556f95449f-ft42z   1/1     Running   0          4m22s
nginx2-556f95449f-gg6q2   0/1     Pending   0          4m22s
nginx2-556f95449f-hfr2g   1/1     Running   0          3m29s
nginx2-556f95449f-pvgrl   0/1     Pending   0          3m29s

kubectl get pods -n namespace3

Expected output:

NAME                     READY   STATUS    RESTARTS   AGE
nginx3-578877666-msd7f   1/1     Running   0          4m
nginx3-578877666-nfdwv   0/1     Pending   0          4m10s
nginx3-578877666-psszr   0/1     Pending   0          4m11s
nginx3-578877666-xfsss   1/1     Running   0          4m22s
nginx3-578877666-xpl2p   0/1     Pending   0          4m10s

root.b.1 (namespace3) has a guaranteed min=10 CPU. To provide this guarantee, the scheduler reclaims 10 vCPUs that root.a had borrowed from root.b. It selects pods to evict under root.a based on factors such as job priority, availability, and creation time to free the 10 vCPUs. As a result, nginx3 gets its 10-vCPU minimum: 2 pods run, and 3 stay Pending.

Deploy a workload in namespace4. This Deployment requests 5 replicas, each using 5 vCPUs.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx4
  namespace: namespace4
  labels:
    app: nginx4
spec:
  replicas: 5
  selector:
    matchLabels:
      app: nginx4
  template:
    metadata:
      name: nginx4
      labels:
        app: nginx4
    spec:
      containers:
      - name: nginx4
        image: nginx
        resources:
          limits:
            cpu: 5
          requests:
            cpu: 5

Check pod status across all four namespaces.

kubectl get pods -n namespace1

Expected output:

NAME                      READY   STATUS    RESTARTS   AGE
nginx1-744b889544-cgzlr   1/1     Running   0          8m20s
nginx1-744b889544-cwx8l   0/1     Pending   0          55s
nginx1-744b889544-gjkx2   0/1     Pending   0          55s
nginx1-744b889544-nknns   0/1     Pending   0          5m48s
nginx1-744b889544-zr5xz   1/1     Running   0          8m20s

kubectl get pods -n namespace2

Expected output:

NAME                      READY   STATUS    RESTARTS   AGE
nginx2-556f95449f-cglpv   0/1     Pending   0          3m45s
nginx2-556f95449f-crwk4   1/1     Running   0          9m31s
nginx2-556f95449f-gg6q2   1/1     Running   0          9m31s
nginx2-556f95449f-pvgrl   0/1     Pending   0          8m38s
nginx2-556f95449f-zv8wn   0/1     Pending   0          3m45s

kubectl get pods -n namespace3

Expected output:

NAME                     READY   STATUS    RESTARTS   AGE
nginx3-578877666-msd7f   1/1     Running   0          8m46s
nginx3-578877666-nfdwv   0/1     Pending   0          8m56s
nginx3-578877666-psszr   0/1     Pending   0          8m57s
nginx3-578877666-xfsss   1/1     Running   0          9m8s
nginx3-578877666-xpl2p   0/1     Pending   0          8m56s

kubectl get pods -n namespace4

Expected output:

NAME                      READY   STATUS    RESTARTS   AGE
nginx4-754b767f45-g9954   1/1     Running   0          4m32s
nginx4-754b767f45-j4v7v   0/1     Pending   0          4m32s
nginx4-754b767f45-jk2t7   0/1     Pending   0          4m32s
nginx4-754b767f45-nhzpf   0/1     Pending   0          4m32s
nginx4-754b767f45-tv5jj   1/1     Running   0          4m32s

The same reclaim logic applies for root.b.2 (namespace4): the scheduler reclaims 10 vCPUs borrowed by root.a, and nginx4 gets its 10-vCPU minimum — 2 pods run, 3 stay Pending. All four quota nodes now run on their guaranteed min resources, with no idle capacity left in the cluster.

Configure ResourceFlavor for node affinity

ResourceFlavor is a Kueue CRD that binds a quota node to specific nodes by matching node labels.

Prerequisites

Ensure that you have:

The ResourceFlavor CRD is applied (not installed by default)
kube-scheduler version higher than 6.9.0 (release notes, upgrade)

Only the nodeLabels field takes effect in ResourceFlavor.

Create a ResourceFlavor

This example creates a ResourceFlavor named spot that targets nodes labeled instance-type: spot.

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "spot"
spec:
  nodeLabels:
    instance-type: spot

Associate a ResourceFlavor with an elastic quota

To bind a ResourceFlavor to a quota node, declare it in the ElasticQuotaTree using the attributes.resourceflavors field.

apiVersion: scheduling.sigs.k8s.io/v1beta1
kind: ElasticQuotaTree
metadata:
  name: elasticquotatree
  namespace: kube-system
spec:
  root:
    name: root
    max:
      cpu: 999900
      memory: 400000Gi
      nvidia.com/gpu: 100000
    min:
      cpu: 999900
      memory: 400000Gi
      nvidia.com/gpu: 100000
    children:
    - name: child
      namespaces:
      - default
      attributes:
        resourceflavors: spot
      max:
        cpu: 99
        memory: 40Gi
        nvidia.com/gpu: 10
      min:
        cpu: 99
        memory: 40Gi
        nvidia.com/gpu: 10

With this configuration, pods in the child quota node (namespace default) are scheduled only to nodes with the instance-type: spot label.

Next steps

See kube-scheduler release notes.
kube-scheduler also supports gang scheduling, which schedules all pods in a group together — if any pod cannot be scheduled, none are. Suited for big data workloads such as Spark and Hadoop. See Work with gang scheduling.