Use load-aware scheduling

更新时间:
复制 MD 格式

By default, the ACK scheduler filters nodes based only on their resource requests. We recommend that you enable the load-aware scheduling feature in ACK Pro clusters. This feature analyzes the actual load of each node and preferentially schedules Pods to nodes with lower loads. This improves load balancing across the cluster and reduces the risk of node failures.

Prerequisites

  • The ack-koordinator component of version 1.1.1-ack.1 or later is installed. For more information, see ack-koordinator (ack-slo-manager).

  • The ACK kube-scheduler in your cluster must use a version that supports load-aware scheduling. See the table below for details.

    ACK version

    Supported ACK scheduler version

    1.26 or later

    All versions are supported.

    1.24

    v1.24.6-ack-4.0 or later

    1.22

    v1.22.15-ack-4.0 or later

Billing

The ack-koordinator component is free to install and use. However, additional fees may be incurred in the following scenarios:

  • ack-koordinator is a self-managed component and consumes worker node resources after installation. You can configure the resource requests for each module when you install the component.

  • By default, ack-koordinator exposes monitoring metrics for features such as resource profiling and fine-grained scheduling in Prometheus format. If you select the Enable Prometheus Monitoring for ACK-Koordinator option when you configure the component and use the Alibaba Cloud Prometheus service, these metrics are considered custom metrics and incur fees. The fees depend on factors such as your cluster size and the number of applications. Before you enable this feature, carefully read the Billing of Prometheus instances documentation for Alibaba Cloud Prometheus to understand the free quota and billing policies for custom metrics. You can monitor and manage your resource usage by querying usage data.

Limitations

This feature is available only in ACK Pro clusters. For more information, see Create an ACK Pro cluster.

How load-aware scheduling works

Load-aware scheduling is a plugin for the ACK kube-scheduler and is implemented based on the Kubernetes Scheduling Framework. Unlike the native Kubernetes scheduler, which primarily makes scheduling decisions based on resource allocation, the ACK scheduler understands the actual resource load on each node. By analyzing historical load statistics and predicting the needs of new Pods, the scheduler places Pods on nodes with lower loads. This achieves better load balancing and prevents application or node failures caused by overloaded nodes.

As shown in the following figure, Requested represents the amount of resources that are requested, and Usage represents the amount of resources that are actually used. Only the resources that are actually used are counted as the actual load. Given two identical nodes, the ACK scheduler assigns a newly created Pod to Node B, which has a lower load.

1

To account for dynamic changes in node utilization over time due to the cluster environment and workload traffic, the ack-koordinator component provides a descheduling feature to prevent extreme load imbalance from re-emerging in the cluster after Pods are scheduled. You can achieve optimal cluster load balancing by combining load-aware scheduling with hotspot descheduling. For more information about hotspot descheduling, see Use hotspot descheduling.

How it works

The load-aware scheduling feature is implemented by the kube-scheduler and ack-koordinator components working together. The ack-koordinator component is responsible for collecting and reporting node resource utilization. The ACK scheduler uses the utilization data to score and rank nodes, and prioritizes nodes with lower loads for scheduling. For more information about the component architecture, see ack-koordinator component architecture.

Scheduling policies

Policy name

Description

Node filtering

When node filtering is enabled, the scheduler filters nodes based on their actual load. If a node's actual load exceeds the configured load threshold, the scheduler will not schedule Pods to that node. This feature is disabled by default. You can enable it by modifying the loadAwareThreshold parameter in the component configuration. For more information, see Kube Scheduler Parameter Configuration.

Important

If node auto scaling is enabled for a cluster, configuring load-aware threshold filtering may cause unexpected node scaling. This is because auto scaling scales out nodes based on pending Pods, whereas it scales in nodes based on the cluster's allocation rate. If you need to enable both node auto scaling and load-aware node filtering, you must adjust the configuration based on your cluster's capacity and utilization. For more information, see Enable node auto scaling.

Node sorting

Load-aware scheduling considers both CPU and memory dimensions. The scheduler uses a weighted formula to score nodes and prioritizes those with higher scores. When you enable the feature by selecting Specifies whether to enable load-aware node scoring during pod scheduling, you can further customize the weights for CPU and memory. For more information, see the loadAwareResourceWeight parameter in Kube Scheduler Parameters.

Formula: ((1 - CPU utilization) * CPU weight + (1 - memory utilization) * memory weight) / (CPU weight + memory weight). CPU and memory utilization are expressed as percentages.

Resource utilization calculation algorithm

The calculation of resource utilization supports multiple configurations, such as average and percentile values. The default is the average value over the last 5 minutes. For more information, see Kube Scheduler Parameter Configuration. Additionally, memory usage data excludes the Page Cache because it can be reclaimed by the operating system. Note that the utilization value returned by the kubectl top node command includes the Page Cache. To view the actual memory usage information, see Connect to and Configure Alibaba Cloud Prometheus Monitoring.

Step 1: Enable load-aware scheduling

Important

Ensure the ack-koordinator component is version 1.1.1-ack.1 or later. Otherwise, load-aware scheduling will not take effect.

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Add-ons.

  3. On the Add-ons page, locate Kube Scheduler, and then on the Kube Scheduler card, click Configuration.

  4. In the dialog box, configure the parameters based on the following table and then click OK.

    The following table describes the main parameters for load-aware scheduling. For a detailed description of all parameters and their component version dependencies, see kube-scheduler and Custom scheduler parameters.

    Parameter

    Type

    Description

    Value

    Example

    loadAwareThreshold

    A list consisting of a resource name (resourceName) and a threshold (threshold).

    Specifies the threshold for a resource type based on the node filtering policy.

    • resourceName: cpu or memory.

    • threshold: An integer from 0 to 100.

    The default value is empty, which means the filtering feature is disabled.

    • resourceName: cpu

    • threshold: 80

    loadAwareResourceWeight

    A list consisting of a resource name (resourceName) and a weight (resourceWeight).

    This is the scoring weight of the resource type for the node sorting strategy. This setting takes effect only if you select Enable load-aware scoring for Pod scheduling.

    • resourceName: Validated by schema. Only cpu or memory is supported.

    • resourceWeight: An integer from 1 to 100.

    Default: cpu=1, memory=1.

    • resourceName: cpu

    • resourceWeight: 1

    loadAwareAggregatedUsageAggregationType

    enum

    The aggregation type for load statistics. The types are defined as follows:

    • avg: Average value.

    • p50: The 50th percentile value, the median.

    • p90, p95, p99: 90th, 95th, and 99th percentile values, respectively.

    • avg

    • p50

    • p90

    • p95

    • p99

    The default value is avg.

    p90

    In the navigation pane on the left, click Cluster Information. On the Basic Information tab, wait for the cluster status to become Running, which indicates that the feature is enabled.

Step 2: Verify load-aware scheduling

The following example uses a cluster with three 4-core, 16 GB nodes.

  1. Create a file named stress-demo.yaml with the following YAML content.

    YAML content

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: stress-demo
      namespace: default
      labels:
        app: stress-demo
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: stress-demo
      template:
        metadata:
          name: stress-demo
          labels:
            app: stress-demo
        spec:
          containers:
            - args:
                - '--vm'
                - '2'
                - '--vm-bytes'
                - '1600M'
                - '-c'
                - '2'
                - '--vm-hang'
                - '2'
              command:
                - stress
              image: polinux/stress
              imagePullPolicy: Always
              name: stress
              resources:
                limits:
                  cpu: '2'
                  memory: 4Gi
                requests:
                  cpu: '2'
                  memory: 4Gi
          restartPolicy: Always
  2. Run the following command to create a Pod. This will increase the load on one of the nodes.

    kubectl create -f stress-demo.yaml
    # Expected output
    deployment.apps/stress-demo created
  3. Run the following command to check the status of the Pod until it is running.

    kubectl get pod -o wide

    Expected output:

    NAME                           READY   STATUS    RESTARTS   AGE   IP           NODE                    NOMINATED NODE   READINESS GATES
    stress-demo-7fdd89cc6b-g****   1/1     Running   0          82s   10.XX.XX.112   cn-beijing.10.XX.XX.112   <none>           <none>

    The output shows that the Pod stress-demo-7fdd89cc6b-g**** is scheduled to the node cn-beijing.10.XX.XX.112.

    Wait about 3 minutes for the Pod to initialize and for the node's load to increase.

  4. Run the following command to check the load on each node.

    kubectl top node

    Expected output:

    NAME                    CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
    cn-beijing.10.XX.XX.110   92m          2%     1158Mi          9%
    cn-beijing.10.XX.XX.111   77m          1%     1162Mi          9%
    cn-beijing.10.XX.XX.112   2105m        53%    3594Mi          28%

    The output shows that the node cn-beijing.10.XX.XX.111 has the lowest load, while the node cn-beijing.10.XX.XX.112 has the highest load. This indicates an uneven load distribution in the cluster.

  5. Create a file named nginx-with-loadaware.yaml with the following YAML content.

    YAML content

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nginx-with-loadaware
      namespace: default
      labels:
        app: nginx
    spec:
      replicas: 6
      selector:
        matchLabels:
          app: nginx
      template:
        metadata:
          name: nginx
          labels:
            app: nginx
        spec:
          containers:
          - name: nginx
            image: nginx
            resources:
              limits:
                cpu: 500m
              requests:
                cpu: 500m
  6. Run the following command to create the Pods.

    kubectl create -f nginx-with-loadaware.yaml
    # Expected output
    deployment/nginx-with-loadaware created
  7. Run the following command to view the Pod scheduling details.

    kubectl get pods -l app=nginx -o wide

    Expected output:

    NAME                                    READY   STATUS    RESTARTS   AGE   IP             NODE                       NOMINATED NODE   READINESS GATES
    nginx-with-loadaware-5646666d56-2****   1/1     Running   0          18s   10.XX.XX.118   cn-beijing.10.XX.XX.110    <none>           <none>
    nginx-with-loadaware-5646666d56-7****   1/1     Running   0          18s   10.XX.XX.115   cn-beijing.10.XX.XX.110    <none>           <none>
    nginx-with-loadaware-5646666d56-k****   1/1     Running   0          18s   10.XX.XX.119   cn-beijing.10.XX.XX.110    <none>           <none>
    nginx-with-loadaware-5646666d56-q****   1/1     Running   0          18s   10.XX.XX.113   cn-beijing.10.XX.XX.111    <none>           <none>
    nginx-with-loadaware-5646666d56-s****   1/1     Running   0          18s   10.XX.XX.120   cn-beijing.10.XX.XX.111    <none>           <none>
    nginx-with-loadaware-5646666d56-z****   1/1     Running   0          18s   10.XX.XX.116   cn-beijing.10.XX.XX.111    <none>           <none>

    The output shows that with load-aware scheduling enabled, the scheduler avoids the high-load node cn-beijing.10.XX.XX.112 and schedules the Pods to other, less-loaded nodes.

Related operations

Modify the load-aware scheduling configuration

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Add-ons.

  3. On the Add-ons page, find Kube Scheduler, and then in the Kube Scheduler card, click Configuration.

  4. In the Kube Scheduler Parameters dialog box, modify the configuration parameters for load-aware scheduling and click OK.

    In the navigation pane on the left, click Cluster Information. On the Basic Information tab, wait for the cluster status to become Running. This indicates that the update is complete.

Disable load-aware scheduling

In the Kube Scheduler Parameters dialog box, deselect Specifies whether to enable load-aware node scoring during pod scheduling, delete the loadAwareResourceWeight and loadAwareThreshold parameters, and then click OK.

In the navigation pane on the left, click Cluster Information. On the Basic Information tab, wait for the cluster status to become Running. This indicates that the update is complete.

FAQ

Why not always the lowest-load node?

If the scheduler placed a batch of new Pods onto the single node with the lowest load, that node could quickly become overloaded and create a new hotspot.

To prevent this, the scheduler preemptively adjusts a node's score as soon as a new Pod is scheduled to it, compensating for reporting delays. This prevents over-scheduling Pods to a single node and creating a new hotspot.

What else affects scheduling?

The K8s scheduler consists of multiple plugins. During the scheduling process, many plugins, such as the affinity and topology spread plugins, contribute to the scoring for node sorting. The final node sort order is determined by the combined influence of these plugins. You can adjust the scoring weight of each plugin as needed.

Is the old protocol supported after an upgrade?

To use the load-aware scheduling feature with an older protocol, you must add the annotation alibabacloud.com/loadAwareScheduleEnabled: "true" to your Pods.

The ACK scheduler is backward compatible, allowing you to upgrade seamlessly. After upgrading, we recommend that you enable the global load-aware scheduling policy for the scheduler by following the steps in Step 1: Enable load-aware scheduling. This eliminates the need to configure each Pod individually.

Important

The ACK scheduler for Kubernetes 1.22 maintains compatibility with the old protocol. For Kubernetes 1.24, support for the old protocol ended on August 30, 2023. We recommend upgrading your cluster and using the new configuration method. For information on upgrading your cluster, see Manually upgrade an ACK cluster.

The following tables describe the protocol support and component version requirements.

1.26 and later

ACK scheduler version

ack-koordinator version

Pod annotation protocol

Console parameter

All ACK scheduler versions

≥1.1.1-ack.1

No

Yes

1.24

ACK scheduler version

ack-koordinator version

Pod annotation protocol

Console parameter

≥v1.24.6-ack-4.0

≥1.1.1-ack.1

Yes

Yes

≥v1.24.6-ack-3.1 and <v1.24.6-ack-4.0

≥0.8.0

Yes

No

1.22 and earlier

ACK scheduler version

ack-koordinator version

Pod annotation protocol

Console parameter

≥1.22.15-ack-4.0

≥1.1.1-ack.1

Yes

Yes

≥1.22.15-ack-2.0 and <1.22.15-ack-4.0

≥0.8.0

Yes

No

  • ≥v1.20.4-ack-4.0 and ≤v1.20.4-ack-8.0

  • v1.18-ack-4.0

≥0.3.0 and <0.8.0

Yes

No