Instance scaling limits and elastic policies

更新时间:
复制 MD 格式

After you set the minimum number of instances for a function to one or more, you can configure elastic policies for the minimum instance count. These policies scale the number of instances in or out based on your business needs and scale-out speed limits. Scaling can occur during specific time periods or when a metric reaches a set threshold. This approach ensures performance and improves instance utilization.

Instance scaling behavior

When the minimum number of instances is one or more, the system first assigns requests to these instances. If the current load exceeds their capacity, the system automatically scales out more elastic instances.

As the number of invocation requests increases, Function Compute continuously creates new instances until there are enough to handle the requests or the configured instance limit is reached. During the scale-out process, the speed is limited. For more information, see Instance scale-out speed limits by region.

The instance scaling behavior varies depending on whether the minimum number of instances is 0 or 1 or more. The following sections describe the behavior as function invocation requests increase.

Minimum number of instances is 0

If the total number of instances or the instance scale-out speed exceeds the limit, Function Compute returns a throttling error with an HTTP Status of 429. The following figure shows the throttling behavior of Function Compute in a scenario where the number of invocations grows rapidly.

image
  • In section ① of the figure: Before the number of burst instances is reached, Function Compute immediately creates instances. This process involves cold starts but does not generate throttling errors.

  • In section ② of the figure: After the number of burst instances is reached, the growth in the number of instances is limited by speed. Some requests receive throttling errors.

  • In section ③ of the figure: After the number of instances exceeds the quota limit, some requests receive throttling errors.

Minimum number of instances is 1 or more

A large burst of invocations can cause request failures due to throttling limits on instance creation. Cold starts also increase request latency. To prevent these issues, you can set the minimum number of instances to one or more to pre-allocate resources.

Under the same load as the scenario where the minimum number of instances is 0, the throttling behavior after setting the minimum number of instances to 1 or more is as follows.

image
  • In section ① of the figure: Before the minimum instances are fully utilized, requests are executed immediately. This process has no cold starts and no throttling errors.

  • In section ② of the figure: After the minimum instances are fully utilized and before the number of elastic instances reaches the burst limit, Function Compute immediately creates new instances. This process involves cold starts but does not generate throttling errors.

Instance scale-out speed limits by region

Region

Burst instances

Instance growth rate

China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen)

300

300 per minute

Other

100

100 per minute

Note

If you require a higher scale-out speed, join the DingTalk user group (Group ID 64970014484) to submit a request.

Elastic policies for minimum instances

A fixed minimum number of instances ensures performance but can waste resources during off-peak periods. Function Compute offers dynamic elastic policies to automatically adjust the minimum instance count based on time or metrics, which improves resource utilization.

Important
  • When an elastic policy is active, it overwrites the function's initial Minimum Number Of Instances configuration. When no elastic policy is active, the system reverts to the initial Minimum Number Of Instances configuration.

  • If multiple elastic policies are configured, the system calculates the Minimum Number Of Instances for each triggered policy. The highest of these values from all currently active policies becomes the actual Minimum Number Of Instances.

For more information, see How is the current minimum number of instances calculated?.

Scheduled scaling

Scenarios

You can use this policy when your function has predictable traffic patterns or periodic peaks. If the invocation concurrency exceeds the capacity of the minimum instances, the system automatically scales out additional elastic instances.

Configuration example

You can configure two scheduled actions. The first action scales out the minimum number of instances before traffic increases. The second action scales in the minimum number of instances after traffic decreases. The following figure illustrates this process.

image

The following example shows how to use the PutProvisionConfig API to configure scheduled scaling parameters. This policy is for a function named function_1. The time zone is set to Asia/Shanghai (UTC+8). The policy is active from 2024-08-01 10:00:00 to 2024-08-30 10:00:00 (UTC+8). Every day at 20:00 (UTC+8), the minimum number of instances scales out to 50. At 22:00 (UTC+8), it scales in to 10.

"scheduledActions": [
    {
      "name": "scale_up_action",
      "startTime": "2024-08-01T10:00:00",
      "endTime": "2024-08-30T10:00:00",
      "target": 50,
      "scheduleExpression": "cron(0 0 20 * * *)",
      "timeZone": "Asia/Shanghai"
    },
    {
      "name": "scale_down_action",
      "startTime": "2024-08-01T10:00:00",
      "endTime": "2024-08-30T10:00:00",
      "target": 10,
      "scheduleExpression": "cron(0 0 22 * * *)",
      "timeZone": "Asia/Shanghai"
    }
  ]

The parameters are described as follows.

Parameter

Description

name

The name of the scheduled task.

startTime

The time when the configuration becomes effective. If no time zone is set, UTC is used by default.

endTime

The time when the configuration stops being effective. If no time zone is set, UTC is used by default.

target

The target minimum number of instances.

scheduleExpression

The scheduling information. If no time zone is set, the schedule runs in UTC, which is UTC+8 minus 8 hours.

The following two formats are supported:

  • At expressions - "at(yyyy-mm-ddThh:mm:ss)": Schedules the task to run only once.

    For example, to schedule a task at 20:00 on April 1, 2024 (UTC+8), set the time zone to Asia/Shanghai and configure this parameter as at(2024-04-01T20:00:00).

  • Cron expressions - "cron(0 0 4 * * *)": Schedules the task to run multiple times using the standard crontab format.

    For example, to schedule a task every day at 20:00 (UTC+8), set the time zone to Asia/Shanghai and configure this parameter as cron(0 0 20 * * *).

timeZone

The specified time zone.

Cron expression description

The fields in a cron(Seconds Minutes Hours Day-of-month Month Day-of-week) expression are described as follows.

Field

Value range

Allowed special characters

Seconds

0–59

None

Minutes

0–59

, - * /

Hours

0–23

, - * /

Day-of-month

1–31

, - * ? /

Month

1–12 or JAN–DEC

, - * /

Day-of-week

1–7 or MON–SUN

, - * ?

The special characters in a cron expression are described as follows.

Character

Definition

Example

*

Indicates any value or every value.

In the Minutes field, a value of 0 indicates that the task is executed at the beginning of every minute.

,

Indicates a list of values.

In the Day-of-week field, MON,WED,FRI means Monday, Wednesday, and Friday.

-

Indicates a range.

In the Hours field, 10-12 indicates the time from 10:00 to 12:00 UTC.

?

Indicates an uncertain value.

This indicates that no specific value is assigned. For example, if you specify a day of the month, you can use ? in the Day-of-week field because the day of the week is irrelevant.

/

Indicates an increment. n/m means starting at n, with an increment of m.

In the minute field, 3/5 indicates that the task starts at the 3rd minute and runs every 5 minutes.

Water Level Scaling

Scenarios

Function Compute periodically collects key metrics and automatically scales the minimum number of instances up or down based on the configured Range for the Minimum Number of Instances. This helps align the instance count with actual resource usage. This policy is ideal for functions with unpredictable traffic patterns because it helps maintain a stable level of resource utilization.

Key metric descriptions

Metric-based scaling policies automatically adjust the minimum number of instances by tracking the following key metrics. Choosing the right metric is crucial for achieving efficient elasticity.

  • Instance concurrency utilization

    • Definition: The ratio of the total number of concurrent requests being processed by all provisioned instances (instances within the minimum instance range) to the maximum total number of concurrent requests that these instances can handle, measured over a collection period.

    • Formula: Total current concurrent requests / (Current minimum number of instances × Concurrency per instance)

    • Scenarios: Suitable for most general-purpose web services, API gateways, and other I/O-intensive or CPU-intensive businesses where the primary bottleneck is request processing capacity.

  • Memory utilization

    • Definition: The memory usage of all provisioned instances during the collection period.

    • Formula: Average memory used / Function memory configuration.

    • Scenarios: Suitable for memory-intensive businesses, such as big data processing, image transformation, and deep learning model pre-processing, where the performance bottleneck is memory consumption rather than request concurrency.

  • GPU resource utilization

    • Definition: For GPU-accelerated instances, you can track more granular GPU resource usage, including GPU utilization and GPU memory utilization.

      • GPU utilization: Reflects how busy the GPU computing cores are.

      • GPU memory utilization: Reflects the usage of GPU memory.

    • Scenarios: Specifically for functions that require GPU acceleration, such as AI inference and scientific computing. You can choose the appropriate metric for scaling based on whether the model depends more on compute resources or GPU memory.

Configuration example

This example uses the Instance Concurrency Utilization metric. When traffic increases and triggers the scale-out threshold, the number of instances scales out until it reaches the upper limit of the configured range. Requests that exceed this capacity are allocated to on-demand elastic instances. When traffic decreases and triggers the scale-in threshold, the number of instances begins to scale in. The following figure illustrates this process.

image
Note
  • When you configure a metric-based scaling policy for the minimum number of instances, you must enable instance-level metrics. Otherwise, a 400 InstanceMetricsRequired error is reported. For more information about how to enable instance-level metrics, see Configure instance-level metrics.

  • Instance concurrency utilization only counts the concurrency of elastic instances within the minimum instance range. It does not include data from on-demand elastic instances.

  • Instance concurrency utilization is the ratio of the current concurrent requests handled by the Minimum Instances to the maximum concurrent requests supported by the Minimum Instance configuration. The value ranges from 0 to 1.

The following example shows how to use the PutProvisionConfig API to configure a metric-based scaling policy for a function named function_1. The policy is active from 2024-08-01 10:00:00 to 2024-08-30 10:00:00 in the Asia/Shanghai (UTC+8) time zone. The policy tracks the ProvisionedConcurrencyUtilization metric with a target Instance Concurrency Utilization of 60%. The system scales out to a maximum of 100 when utilization exceeds 60% and scales in to a minimum of 10 when utilization falls below 60%.

"targetTrackingPolicies": [
    {
      "name": "action_1",
      "startTime": "2024-08-01T10:00:00",
      "endTime": "2024-08-30T10:00:00",
      "metricType": "ProvisionedConcurrencyUtilization",
      "metricTarget": 0.6,
      "minCapacity": 10,
      "maxCapacity": 100,
      "timeZone": "Asia/Shanghai"
    }
  ]

The parameters are described as follows.

Parameter

Description

name

The name of the metric-based task.

startTime

The time when the metric-based scaling configuration becomes effective. If no time zone is set, UTC is used by default.

endTime

The time when the metric-based scaling configuration stops being effective. If no time zone is set, UTC is used by default.

metricType

The name of the metric to track. In this example, it is ProvisionedConcurrencyUtilization.

metricTarget

The target value for the metric. For example, if you set this value to 0.6, scaling out begins when Instance Concurrency Utilization exceeds 60%, and scaling in begins when it falls below 60%.

minCapacity

The lower limit for the minimum number of instances.

maxCapacity

The upper limit for the minimum number of instances.

timeZone

The specified time zone.

Scale-out and scale-in calculation principles

Scaling in uses a scale-in coefficient to ensure a conservative process. The coefficient ranges from 0 (exclusive) to 1 (inclusive). The scale-in coefficient is a system parameter that slows down the scale-in speed to prevent it from being too fast. You do not need to set it manually. The final target value is obtained by rounding up the calculation result. The calculation logic is as follows.

  • Scale-out target = Current minimum number of instances × (Current metric value / Configured utilization threshold)

  • Scale-in target = Current minimum number of instances × Scale-in coefficient × (1 - Current metric value / Configured utilization threshold)

For example, assume that the current metric value is 80%, the configured Instance Concurrency Utilization is 40%, and the current minimum number of instances is 100. The calculation is 100 × (80% / 40%) = 200. Based on this result, the minimum number of instances is scaled out to 200, which cannot exceed the configured function quota. This ensures that the utilization rate remains at approximately 40% after scaling out.

How is the current minimum number of instances calculated?

The following example explains how the current minimum number of instances is calculated. It is determined by both the initially configured minimum number of instances and the target minimum number of instances set in the scheduled scaling policies.

Assume the initial minimum number of instances is 5. Two scheduled scaling policies are configured with the time zone set to Asia/Shanghai (UTC+8). The policies are active from 2025-06-09 10:00:00 to 2025-06-11 00:00:00 (UTC+8). Within this active period, the minimum number of instances scales out to 20 every day at 10:00 (UTC+8) and scales in to 10 every day at 22:00 (UTC+8). The policy configuration is as follows:

{
    "defaultTarget": 5,
    "scheduledActions": [
        {
            "name": "scale_up_action",
            "startTime": "2025-06-09T10:00:00",
            "endTime": "2025-06-11T00:00:00",
            "target": 20,
            "scheduleExpression": "cron(0 0 10 * * *)",
            "timeZone": "Asia/Shanghai"
        },
        {
            "name": "scale_down_action",
            "startTime": "2025-06-09T10:00:00",
            "endTime": "2025-06-11T00:00:00",
            "target": 10,
            "scheduleExpression": "cron(0 0 22 * * *)",
            "timeZone": "Asia/Shanghai"
        }
    ]
}

The following figure shows the value of the current minimum number of instances at different times:

image

Maximum responsive concurrency

The calculation for the maximum responsive concurrency of a function instance varies based on the instance concurrency setting:

  • Single concurrency per instance

    Maximum responsive concurrency = Number of function instances

  • Multiple concurrencies per instance

    Maximum responsive concurrency = Number of function instances × Concurrency per instance

For more information about the scenarios, benefits, configuration, and effects of instance concurrency, see Set instance concurrency.

References

To limit the maximum number of instances for a function, see Configure function quotas. After configuration, if the total number of executing instances for this function exceeds the limit, Function Compute returns a throttling error.