After you set the minimum number of instances for a function to one or more, you can configure elastic policies for the minimum instance count. These policies scale the number of instances in or out based on your business needs and scale-out speed limits. Scaling can occur during specific time periods or when a metric reaches a set threshold. This approach ensures performance and improves instance utilization.
Instance scaling behavior
When the minimum number of instances is one or more, the system first assigns requests to these instances. If the current load exceeds their capacity, the system automatically scales out more elastic instances.
As the number of invocation requests increases, Function Compute continuously creates new instances until there are enough to handle the requests or the configured instance limit is reached. During the scale-out process, the speed is limited. For more information, see Instance scale-out speed limits by region.
The instance scaling behavior varies depending on whether the minimum number of instances is 0 or 1 or more. The following sections describe the behavior as function invocation requests increase.
Minimum number of instances is 0
If the total number of instances or the instance scale-out speed exceeds the limit, Function Compute returns a throttling error with an HTTP Status of 429. The following figure shows the throttling behavior of Function Compute in a scenario where the number of invocations grows rapidly.
In section ① of the figure: Before the number of burst instances is reached, Function Compute immediately creates instances. This process involves cold starts but does not generate throttling errors.
In section ② of the figure: After the number of burst instances is reached, the growth in the number of instances is limited by speed. Some requests receive throttling errors.
In section ③ of the figure: After the number of instances exceeds the quota limit, some requests receive throttling errors.
Minimum number of instances is 1 or more
A large burst of invocations can cause request failures due to throttling limits on instance creation. Cold starts also increase request latency. To prevent these issues, you can set the minimum number of instances to one or more to pre-allocate resources.
Under the same load as the scenario where the minimum number of instances is 0, the throttling behavior after setting the minimum number of instances to 1 or more is as follows.
In section ① of the figure: Before the minimum instances are fully utilized, requests are executed immediately. This process has no cold starts and no throttling errors.
In section ② of the figure: After the minimum instances are fully utilized and before the number of elastic instances reaches the burst limit, Function Compute immediately creates new instances. This process involves cold starts but does not generate throttling errors.
Instance scale-out speed limits by region
Region | Burst instances | Instance growth rate |
China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen) | 300 | 300 per minute |
Other | 100 | 100 per minute |
If you require a higher scale-out speed, join the DingTalk user group (Group ID 64970014484) to submit a request.
Elastic policies for minimum instances
A fixed minimum number of instances ensures performance but can waste resources during off-peak periods. Function Compute offers dynamic elastic policies to automatically adjust the minimum instance count based on time or metrics, which improves resource utilization.
When an elastic policy is active, it overwrites the function's initial Minimum Number Of Instances configuration. When no elastic policy is active, the system reverts to the initial Minimum Number Of Instances configuration.
If multiple elastic policies are configured, the system calculates the Minimum Number Of Instances for each triggered policy. The highest of these values from all currently active policies becomes the actual Minimum Number Of Instances.
For more information, see How is the current minimum number of instances calculated?.
Scheduled scaling
Scenarios
You can use this policy when your function has predictable traffic patterns or periodic peaks. If the invocation concurrency exceeds the capacity of the minimum instances, the system automatically scales out additional elastic instances.
Configuration example
You can configure two scheduled actions. The first action scales out the minimum number of instances before traffic increases. The second action scales in the minimum number of instances after traffic decreases. The following figure illustrates this process.
The following example shows how to use the PutProvisionConfig API to configure scheduled scaling parameters. This policy is for a function named function_1. The time zone is set to Asia/Shanghai (UTC+8). The policy is active from 2024-08-01 10:00:00 to 2024-08-30 10:00:00 (UTC+8). Every day at 20:00 (UTC+8), the minimum number of instances scales out to 50. At 22:00 (UTC+8), it scales in to 10.
"scheduledActions": [
{
"name": "scale_up_action",
"startTime": "2024-08-01T10:00:00",
"endTime": "2024-08-30T10:00:00",
"target": 50,
"scheduleExpression": "cron(0 0 20 * * *)",
"timeZone": "Asia/Shanghai"
},
{
"name": "scale_down_action",
"startTime": "2024-08-01T10:00:00",
"endTime": "2024-08-30T10:00:00",
"target": 10,
"scheduleExpression": "cron(0 0 22 * * *)",
"timeZone": "Asia/Shanghai"
}
]Cron expression description
Water Level Scaling
Scenarios
Function Compute periodically collects key metrics and automatically scales the minimum number of instances up or down based on the configured Range for the Minimum Number of Instances. This helps align the instance count with actual resource usage. This policy is ideal for functions with unpredictable traffic patterns because it helps maintain a stable level of resource utilization.
Key metric descriptions
Metric-based scaling policies automatically adjust the minimum number of instances by tracking the following key metrics. Choosing the right metric is crucial for achieving efficient elasticity.
Instance concurrency utilization
Definition: The ratio of the total number of concurrent requests being processed by all provisioned instances (instances within the minimum instance range) to the maximum total number of concurrent requests that these instances can handle, measured over a collection period.
Formula:
Total current concurrent requests / (Current minimum number of instances × Concurrency per instance)Scenarios: Suitable for most general-purpose web services, API gateways, and other I/O-intensive or CPU-intensive businesses where the primary bottleneck is request processing capacity.
Memory utilization
Definition: The memory usage of all provisioned instances during the collection period.
Formula:
Average memory used / Function memory configuration.Scenarios: Suitable for memory-intensive businesses, such as big data processing, image transformation, and deep learning model pre-processing, where the performance bottleneck is memory consumption rather than request concurrency.
GPU resource utilization
Definition: For GPU-accelerated instances, you can track more granular GPU resource usage, including
GPU utilizationandGPU memory utilization.GPU utilization: Reflects how busy the GPU computing cores are.GPU memory utilization: Reflects the usage of GPU memory.
Scenarios: Specifically for functions that require GPU acceleration, such as AI inference and scientific computing. You can choose the appropriate metric for scaling based on whether the model depends more on compute resources or GPU memory.
Configuration example
This example uses the Instance Concurrency Utilization metric. When traffic increases and triggers the scale-out threshold, the number of instances scales out until it reaches the upper limit of the configured range. Requests that exceed this capacity are allocated to on-demand elastic instances. When traffic decreases and triggers the scale-in threshold, the number of instances begins to scale in. The following figure illustrates this process.
When you configure a metric-based scaling policy for the minimum number of instances, you must enable instance-level metrics. Otherwise, a
400 InstanceMetricsRequirederror is reported. For more information about how to enable instance-level metrics, see Configure instance-level metrics.Instance concurrency utilization only counts the concurrency of elastic instances within the minimum instance range. It does not include data from on-demand elastic instances.
Instance concurrency utilization is the ratio of the current concurrent requests handled by the Minimum Instances to the maximum concurrent requests supported by the Minimum Instance configuration. The value ranges from 0 to 1.
The following example shows how to use the PutProvisionConfig API to configure a metric-based scaling policy for a function named function_1. The policy is active from 2024-08-01 10:00:00 to 2024-08-30 10:00:00 in the Asia/Shanghai (UTC+8) time zone. The policy tracks the ProvisionedConcurrencyUtilization metric with a target Instance Concurrency Utilization of 60%. The system scales out to a maximum of 100 when utilization exceeds 60% and scales in to a minimum of 10 when utilization falls below 60%.
"targetTrackingPolicies": [
{
"name": "action_1",
"startTime": "2024-08-01T10:00:00",
"endTime": "2024-08-30T10:00:00",
"metricType": "ProvisionedConcurrencyUtilization",
"metricTarget": 0.6,
"minCapacity": 10,
"maxCapacity": 100,
"timeZone": "Asia/Shanghai"
}
]Scale-out and scale-in calculation principles
Scaling in uses a scale-in coefficient to ensure a conservative process. The coefficient ranges from 0 (exclusive) to 1 (inclusive). The scale-in coefficient is a system parameter that slows down the scale-in speed to prevent it from being too fast. You do not need to set it manually. The final target value is obtained by rounding up the calculation result. The calculation logic is as follows.
Scale-out target = Current minimum number of instances × (Current metric value / Configured utilization threshold)
Scale-in target = Current minimum number of instances × Scale-in coefficient × (1 - Current metric value / Configured utilization threshold)
For example, assume that the current metric value is 80%, the configured Instance Concurrency Utilization is 40%, and the current minimum number of instances is 100. The calculation is 100 × (80% / 40%) = 200. Based on this result, the minimum number of instances is scaled out to 200, which cannot exceed the configured function quota. This ensures that the utilization rate remains at approximately 40% after scaling out.
How is the current minimum number of instances calculated?
The following example explains how the current minimum number of instances is calculated. It is determined by both the initially configured minimum number of instances and the target minimum number of instances set in the scheduled scaling policies.
Assume the initial minimum number of instances is 5. Two scheduled scaling policies are configured with the time zone set to Asia/Shanghai (UTC+8). The policies are active from 2025-06-09 10:00:00 to 2025-06-11 00:00:00 (UTC+8). Within this active period, the minimum number of instances scales out to 20 every day at 10:00 (UTC+8) and scales in to 10 every day at 22:00 (UTC+8). The policy configuration is as follows:
{
"defaultTarget": 5,
"scheduledActions": [
{
"name": "scale_up_action",
"startTime": "2025-06-09T10:00:00",
"endTime": "2025-06-11T00:00:00",
"target": 20,
"scheduleExpression": "cron(0 0 10 * * *)",
"timeZone": "Asia/Shanghai"
},
{
"name": "scale_down_action",
"startTime": "2025-06-09T10:00:00",
"endTime": "2025-06-11T00:00:00",
"target": 10,
"scheduleExpression": "cron(0 0 22 * * *)",
"timeZone": "Asia/Shanghai"
}
]
}The following figure shows the value of the current minimum number of instances at different times:
Maximum responsive concurrency
The calculation for the maximum responsive concurrency of a function instance varies based on the instance concurrency setting:
Single concurrency per instance
Maximum responsive concurrency = Number of function instances
Multiple concurrencies per instance
Maximum responsive concurrency = Number of function instances × Concurrency per instance
For more information about the scenarios, benefits, configuration, and effects of instance concurrency, see Set instance concurrency.
References
To limit the maximum number of instances for a function, see Configure function quotas. After configuration, if the total number of executing instances for this function exceeds the limit, Function Compute returns a throttling error.