Scheduling overview

更新时间:
复制 MD 格式

In a Kubernetes cluster, scheduling is the process by which the scheduler component (kube-scheduler) assigns Pods to the most suitable nodes based on resource planning to improve application availability and cluster resource utilization. ACK provides more flexible and comprehensive scheduling policies for different workloads, including job scheduling, topology-aware scheduling, QoS-aware scheduling, and descheduling.

Before you begin

  • This topic describes cluster scheduling solutions for cluster O&M engineers (including cluster resource administrators) and application developers. You can select an appropriate scheduling policy based on your business scenario and role.

    • Cluster O&M engineers: Focus on managing cluster costs, maximizing resource utilization, ensuring high availability, and maintaining load balancing across nodes to prevent single points of failure.

    • Application developers: Focus on easily deploying and managing applications and obtaining the necessary resources (such as CPU, GPU, and memory) based on performance requirements.

  • To make the most of the scheduling policies in ACK, you should first understand fundamental Kubernetes concepts by reviewing the official documentation for the Kubernetes Scheduler, node labels, eviction, and pod topology spread constraints.

    Additionally, the default scheduling policy of the ACK Scheduler is consistent with the upstream Kubernetes scheduler and involves two stages: Filter and Score.

Kubernetes-native scheduling policies

Kubernetes-native scheduling policies can be divided into node scheduling policies and inter-Pod scheduling policies.

  • Node scheduling policies focus on node characteristics and resource availability to ensure Pods are scheduled to nodes that meet their requirements.

  • Inter-Pod scheduling policies focus on controlling the placement and distribution of Pods to optimize their overall layout and ensure high application availability.

Policy

Description

Use case

nodeSelector

Uses key-value pairs to label nodes. A nodeSelector in the Pod configuration then schedules the Pod to a node with a matching label.

For example, you can use nodeSelector to schedule Pods to specific nodes or schedule Pods to a specific node pool.

Provides basic node selection but does not support more complex scheduling rules, such as preferences (soft rules).

nodeAffinity

This is a more flexible and fine-grained Pod scheduling strategy than NodeSelector, which supports hard scheduling rules (requiredDuringSchedulingIgnoredDuringExecution) and soft scheduling rules (preferredDuringSchedulingIgnoredDuringExecution).

Places Pods based on node characteristics such as region, instance type, or hardware configuration. Anti-affinity rules can be used to spread Pods across nodes.

Taints and tolerations

A taint consists of a key, a value, and an effect. Common effects include NoSchedule, PreferNoSchedule, and NoExecute. After a node is tainted, only Pods with tolerations that match the node's taint can be scheduled on it.

  • Used to reserve dedicated node resources for specific applications, such as reserving GPU nodes for AI/ML workloads.

    ACK also allows you to add taints or labels to a node pool, which enables specific applications to be scheduled to that pool. For more information, see Create and manage node pools.
  • Pod eviction based on taints and tolerations, such as adding a NoExecute taint to an unhealthy node.

podAffinity and podAntiAffinity

You can use Pod labels to specify whether a Pod should be scheduled to certain nodes by configuring hard scheduling rules (requiredDuringSchedulingIgnoredDuringExecution) and soft scheduling rules (preferredDuringSchedulingIgnoredDuringExecution).

  • Used to co-locate related Pods on the same or nearby nodes to reduce network latency and improve communication efficiency.

  • Used to spread critical applications across different nodes or fault domains.

Scheduling policies provided by ACK

If native Kubernetes scheduling policies cannot meet your more complex business requirements, such as sequential scale-out and reverse-order scale-in for different instance resource types or load-aware scheduling based on actual node resource usage, you can use the advanced scheduling policies provided by ACK.

Scheduling resource priority

  • Intended role: cluster O&M engineer

  • Description: If your cluster contains different types of instance resources, such as ECS and Elastic Container Instance (ECI), with different billing methods such as subscription, pay-as-you-go, and preemptible instances, consider configuring scheduling resource priority. This policy allows you to specify the order in which Pods are scheduled to different types of node resources and enables reverse-order scale-in.

Policy

Description

Typical use case

References

Custom priority scheduling for elastic resources

You can customize the ResourcePolicy during application deployment or scaling to set the order in which application Pods are scheduled to different types of node resources, for example, first to subscription ECS, then to pay-as-you-go ECS, and finally to ECI.

During scale-in, it also supports reverse-order scale-in. For example, it prioritizes deleting ECI Pods, followed by Pods on pay-as-you-go ECS instances, and finally Pods on subscription ECS instances.

  • Specifying preferred or avoided nodes to balance resource utilization across the cluster.

  • For applications with high performance requirements, prioritizing scheduling Pods to high-performance nodes.

  • For applications without high performance requirements, prioritizing scheduling Pods to preemptible instances or nodes with available computing resources to reduce costs.

Custom priority scheduling for elastic resources

Job scheduling

  • Intended role: cluster O&M engineer

  • Description: The default scheduler places Pods based on predefined rules but is not optimized for co-scheduling Pods in batch processing tasks. To address this, ACK supports Gang Scheduling and Capacity Scheduling for batch computing jobs.

Policy

Description

Typical use case

References

Gang Scheduling

Ensures that a group of related Pods are either all scheduled or none are, preventing deadlocks that occur when only some of the required Pods can be scheduled.

  • Batch jobs: The job contains multiple interdependent task groups that must be processed simultaneously.

  • Distributed computing: For example, machine learning training tasks or other distributed applications that require tightly coordinated execution.

  • High-performance computing: The job may require a full set of resources to be available at the same time before execution can begin.

Use Gang scheduling

Capacity Scheduling

Reserves a certain amount of resource capacity for specific namespaces or user groups and improves overall resource utilization through resource sharing when cluster resources are scarce.

Ideal for multi-tenant scenarios where different tenants have varying resource usage cycles and patterns, leading to low overall cluster utilization. This policy allows resources to be borrowed and reclaimed on top of fixed allocations.

Use Capacity Scheduling

Topology-aware scheduling

  • Intended role: cluster O&M engineer

  • Description: In machine learning and big data analytics jobs, Pods often have high network communication demands. By default, the scheduler spreads Pods evenly across the cluster, which can increase job completion time. Native node or pod affinity methods cannot retry scheduling across multiple different topology domains, and nodes typically only have zone-level labels.

Description

Typical use case

References

The scheduler adds a gang scheduling identifier to the job, requiring that all Pods acquire their necessary resources simultaneously. Combined with topology-aware scheduling, this allows the scheduler to find a topology domain that can satisfy the entire job.

You can also use the deployment set feature of node pools to schedule Pods to ECS instances within the same low-latency deployment set, which further improves job performance.

Ideal for machine learning or big data analytics jobs with significant network communication between Pods. The goal is to allow the job to retry scheduling across multiple topology domains until one with sufficient resources is found, which reduces the job execution time.

Load-aware scheduling

  • Intended roles: cluster O&M engineer, application developer

  • Description: With native Kubernetes scheduling, the scheduler makes decisions primarily based on resource allocation by comparing a Pod's resource requests with the unallocated resources on a node. However, a node's actual utilization changes dynamically over time, cluster environment, and workload traffic. The native scheduler is unaware of this real-time load.

Description

Typical use case

References

By analyzing historical node load statistics and predicting the needs of newly scheduled Pods, the ACK scheduler is aware of the actual resource usage on nodes. It prioritizes scheduling Pods to nodes with lower loads to achieve load balancing, which prevents application or node failures caused by overloaded nodes.

Ideal for latency-sensitive applications that have specific requirements for metrics like request pressure or access latency and are sensitive to resource quality.

Use load-aware scheduling

Use this feature with load-aware hotspot descheduling to prevent severe load imbalances from reoccurring after Pods are scheduled.

QoS-aware scheduling

  • Intended roles: cluster O&M engineer, application developer

  • Description: You can configure specific Quality of Service (QoS) classes for Pods, including Guaranteed, Burstable, and BestEffort. When node resources are insufficient, the kubelet can determine the eviction order of Pods based on their QoS class. ACK provides differentiated Service Level Objectives (SLO) features to improve the performance and service quality of latency-sensitive applications while ensuring resource availability for lower-priority tasks.

Policy

Description

Typical use case

References

CPU Burst

Because the CPU Limit mechanism restricts resource usage over a fixed time period, containers can be throttled. The CPU Burst feature allows a container to accumulate CPU time slices during idle periods to meet bursty resource demands. This improves container performance, reduces latency, and enhances the application's service quality.

  • Containerized applications that consume high CPU resources during startup and loading but have relatively normal CPU usage during regular operation.

  • Web services and applications, such as e-commerce or online games, where CPU demand can spike suddenly and must be handled quickly to manage traffic surges.

Enable the CPU Burst performance optimization policy

CPU topology-aware scheduling

For performance-sensitive applications, this policy pins a Pod to specific CPU cores on a node, which mitigates performance degradation caused by CPU context switching and cross-NUMA memory access.

  • Applications not yet adapted for cloud-native environments, for example, setting the number of threads based on the physical cores of the entire machine instead of container specifications, which leads to performance issues.

  • Applications running on multi-core machines like ECS Bare Metal Instances (Intel, AMD) that experience significant performance degradation from cross-NUMA memory access.

  • Applications that are highly sensitive to CPU context switching and cannot tolerate the resulting performance jitter.

Enable CPU topology-aware scheduling

GPU topology-aware scheduling

When multiple GPU cards are deployed in a cluster and multiple GPU-intensive Pods run simultaneously, they may compete for the node's GPU resources. This can cause Pods to switch frequently between different GPUs or even NUMA nodes, which impacts performance. GPU topology-aware scheduling intelligently assigns workloads to different GPU cards, which reduces cross-NUMA node memory access and improves application performance and responsiveness.

  • High-performance computing that requires efficient data transfer and processing in large-scale distributed computations.

  • Machine learning and deep learning, which require large amounts of GPU resources for training and need to distribute training tasks reasonably across GPUs.

  • Graphics rendering and game development, which require rendering tasks to be efficiently distributed among different GPUs.

Dynamic resource overcommitment

Quantifies allocated but unused resources in the cluster and makes them available to low-priority tasks, which enables resource overcommitment. This must be used with the following single-node QoS policies to prevent performance interference between applications.

  • Elastic resource restriction: Controls the amount of CPU resources that low-priority Pods can use when the node's overall resource usage is within a safe threshold, which ensures container stability.

  • CPU QoS for containers: Prioritizes CPU performance for high-priority applications based on the container's QoS class.

  • Memory QoS for containers: Prioritizes memory performance for high-priority application Pods based on the container's QoS class, which delays the triggering of whole-machine memory reclamation.

  • L3 cache and memory bandwidth isolation for containers: Prioritizes the use of resources like L3 cache and memory bandwidth allocation (MBA) for high-priority applications based on the container's QoS class.

Used to improve cluster resource utilization through colocation. Typical colocation scenarios include machine learning training and inference, big data batch jobs and data analysis, and running online services alongside offline backup services.

Dynamically modify Pod resource parameters

In Kubernetes 1.27 and earlier versions, modifying a container's parameters while a Pod is running requires updating the PodSpec and resubmitting it, which would trigger a Pod restart. ACK allows you to modify single-node isolation parameters like CPU, memory, and disk I/O without restarting the Pod.

Suitable only for temporary adjustments to Pod resources (CPU and memory).

Dynamically modify the resource parameters of a Pod

Descheduling

  • Intended roles: cluster O&M engineer, application developer

  • Description: A cluster's state is constantly changing. For various reasons, you may need to move running Pods to other nodes, a process known as descheduling.

Policy

Description

Typical use case

References

Descheduling

In scenarios such as uneven cluster utilization creating hotspot nodes or changes in node properties making existing Pod scheduling rules no longer optimal, you may need to deschedule poorly placed Pods to another node. This ensures Pods run on the best possible nodes, which safeguards cluster high availability and workload efficiency.

  • Uneven workload distribution in the cluster causes some nodes to become overloaded, such as when different applications are scheduled to the same node in a colocation scenario.

  • Overall cluster resource utilization is low, and you want to decommission some nodes to save costs.

  • The cluster has significant resource fragmentation, where total resources are sufficient but individual nodes lack enough resources.

  • Taints or labels have been added to or removed from a node.

Load-aware hotspot descheduling

Combining load-aware scheduling with hotspot descheduling allows you to not only monitor node load changes in real-time but also automatically rebalances nodes that exceed a safe load threshold by descheduling Pods, which prevents extreme load imbalances.

Use load-aware hotspot descheduling

Billing

When you use the scheduling features provided by ACK, you may incur fees for the scheduling components, in addition to cluster management fees and costs for related cloud resources as described in Billing.

  • The default ACK scheduler is provided by the kube-scheduler component. As a control plane component, it is free to install and use.

  • The ack-koordinator component provides these resource scheduling optimization and descheduling capabilities. The ack-koordinator component itself is free to install and use, but additional fees may apply in certain scenarios. For more information, see ack-koordinator (ack-slo-manager).

FAQ

If you encounter issues when you use scheduling features, see Scheduling FAQ for troubleshooting.

References