Auto Scaling

更新时间:
复制 MD 格式

Auto Scaling adjusts compute, storage, and network resources in response to real-time demand — adding capacity during traffic spikes and releasing it during quiet periods — without manual intervention.

Variable load is a fundamental challenge for cloud workloads. A fixed resource allocation either wastes money during off-peak periods or fails to meet demand during spikes. Auto Scaling resolves this by matching capacity to actual load, keeping performance consistent while controlling cost.

How it works

An effective Auto Scaling strategy has three core characteristics:

  1. Automation: The system executes scaling actions — adding or removing instances, pods, or containers — without manual intervention or changes to application code, based on preset rules and policies.

  2. Real-time response: The system evaluates live metrics against configured thresholds and schedules to determine when and how much to scale, adjusting capacity as workload conditions change.

  3. Flexibility: Scaling can be triggered by a range of metrics — CPU utilization, memory utilization, network bandwidth, queries per second (QPS), response time (RT), or custom business metrics — and refined continuously as expectations evolve.

Scaling methods

Alibaba Cloud's Elastic Scaling Service (ESS) offers the advantages of automation, cost reduction, high availability, flexibility, intelligence, and ease of auditing. Through simple operations, various scaling modes can be configured to implement automated scaling mechanisms based on business scenarios, enabling the system to quickly respond to workload changes and provide better user experience and service quality.

There are several scaling approaches available. Each addresses a different problem; combining them delivers the most robust results.

Fixed instance count

Run a fixed number of instances regardless of load. Simple to configure, but results in resource waste during off-peak periods and potential under-provisioning during unexpected spikes.

HPA (Horizontal Pod Autoscaler)

Kubernetes' standard reactive scaling mechanism. Horizontal Pod Autoscaler (HPA) continuously measures current metric values against a target and adjusts replica count accordingly. Because HPA reacts to observed load, it starts scaling only after demand has already increased — this lag can matter for latency-sensitive services.

Configure scale-out and scale-in thresholds separately with a meaningful gap to prevent flapping (rapid oscillation where new instances are added and immediately removed).

CronHPA

Extends HPA with schedule-based rules. Use CronHPA when load patterns are predictable — for example, pre-warming capacity before a known peak or scaling down overnight.

AHPA (Adaptive Horizontal Pod Autoscaler)

Alibaba Cloud's predictive scaling solution. Adaptive Horizontal Pod Autoscaler (AHPA) analyzes historical metric patterns to forecast demand and provisions capacity before load arrives, eliminating the startup lag inherent in reactive approaches.

Key characteristics:

  • Faster: Millisecond-level prediction with second-level scaling response

  • Stable: Combines active (predictive) and passive (reactive) signals to suppress flapping

  • Precise: Supports minute-level boundary protection to cap scale events

Serverless containers (ECI)

Elastic Container Instance (ECI) provides on-demand container capacity with no cluster management or capacity planning. You pay only for actual usage. ECI is well-suited for batch jobs, CI/CD pipelines, Apache Spark workloads, and burst-heavy online applications.

Combining scaling methods

No single method covers every scenario. A robust approach combines scheduled and reactive scaling:

  • Use CronHPA to pre-warm instances before predictable peaks (for example, an e-commerce flash sale or a nightly batch window).

  • Use HPA or AHPA to handle unpredictable demand fluctuations within each time window.

  • Use ECI as a fast burst buffer when on-cluster capacity is exhausted.

This layered strategy eliminates the cold-start delay that reactive-only scaling faces during sudden load spikes.

Choosing a scaling trigger

System metrics are easy to instrument, but business metrics often make better scaling signals. Consider two queues each containing 500 messages with a 500 ms oldest-message age:

  • If the queue feeds an email notification service, the business stakeholder may accept the current latency and not want the cost of scaling out.

  • If the queue feeds the critical path of a real-time online game with a 100 ms service level agreement (SLA), immediate scale-out is required.

Same system metric, opposite business decisions. Where possible, define scaling thresholds against metrics that reflect business outcomes — QPS, response time, or queue critical-time — rather than infrastructure utilization alone.

Implementation steps

  1. Instrument your system: Identify the metrics that best reflect application health and user experience. Set up monitoring and confirm metrics are available at the frequency your scaling policy requires.

  2. Define rules and thresholds: Set scale-out and scale-in thresholds with a sufficient gap. Configure stabilization windows to suppress flapping.

  3. Configure scaling mechanisms: Configure automated scaling mechanisms based on your rules and policies. This can be achieved through cloud service providers' automatic scaling functions, automation scripts, or container orchestration tools.

  4. Monitor and tune: Review scaling events regularly. Adjust thresholds and policies based on observed behavior, targeting stable, cost-efficient operation under both typical and peak load.

What's next

  • [Elastic Scaling Service (ESS) overview]()

  • [Configure HPA for Kubernetes workloads]()

  • [Get started with AHPA]()

  • [Elastic Container Instance (ECI) overview]()