System protection provides node-level traffic protection against unexpected situations. For example, if an interface without traffic protection rules encounters a sudden traffic surge, system protection acts as a safety net to ensure application stability. Microservices Governance offers several system protection features for both server-side and client-side traffic, including adaptive overload protection, total QPS throttling, total concurrency throttling, abnormal call circuit breaking, and slow call circuit breaking.
For a comparison between system protection and traffic protection, see System protection vs. traffic protection.
Prerequisites
-
Connect your application to Microservices Governance. For instructions, see Connect an ACK microservice application to the MSE governance center and Connect an ECS microservice application to the MSE governance center.
Procedure
-
Log on to the MSE console, and select a region in the top navigation bar.
-
In the left-side navigation pane, choose Microservices Governance > Application Governance.
-
On the Application list page, click the resource card of the desired application. In the left-side navigation pane, click Traffic management.
-
Click the system protection tab and configure the features.
Adaptive overload protection
Adaptive overload protection requires agent version 3.1.4 or later.
How it works
Adaptive overload protection uses CPU utilization as a metric for system load. It adaptively adjusts the throttling rate for server-side traffic to keep CPU utilization stable and near a configured threshold, even during unexpected traffic surges.
Scope
Adaptive overload protection applies to all server-side interfaces and has a lower priority than traffic protection rules.
Use cases
Adaptive overload protection provides CPU-based baseline protection for server-side interfaces. It is suitable for CPU-bound applications where an unexpected traffic surge on an interface continuously increases system CPU utilization and affects the response time (RT) of core interfaces.
Steady-state CPU utilization varies by application. You can determine the maximum steady-state CPU utilization through stress testing or by analyzing historical data, and then configure the threshold by adding a safety margin to this value.
Console
The left panel lists adaptive overload protection events, and the right panel shows the average CPU utilization trend for the application's nodes over the last 5 minutes.
Events are reported at the node level based on algorithm-driven state changes. They include throttling start, throttling ongoing, and throttling end events.
At the top of the page, you can find the feature toggle and the Simulated Execution setting. In the CPU utilization chart on the right, a blue dashed line indicates the protection threshold.
Click the View link in the Actions column for an event to query the CPU utilization data for the corresponding node IP. The timeline jumps to the event's report time, so you can observe the node's CPU utilization and throttling probability when the event was triggered.
|
Parameter |
Description |
|
Mode |
|
|
CPU utilization |
Defines the expected CPU utilization threshold. Adaptive overload protection uses the system's actual CPU utilization and this configured threshold to adaptively adjust the interface throttling probability. This helps maintain CPU utilization within a small range around the threshold under high load by rejecting a portion of requests. |
|
exception settings |
Total QPS throttling
How it works
Total QPS throttling tracks the total queries per second (QPS) at the node level, the sum of QPS for all server-side interfaces on a single node. It throttles requests that exceed the specified threshold.
Total QPS throttling requires agent version 4.2.0 or later.
Scope
Total QPS throttling applies to all server-side interfaces and has a lower priority than traffic protection rules.
Use cases
Not all systems are CPU-bound. Some applications may experience performance degradation at low CPU loads due to memory, network, or other resource constraints. Total QPS throttling provides a traffic-based protection mechanism by limiting the total QPS of a node.
This applies to scenarios where an unexpected traffic surge on an interface leads to resource contention, which in turn affects core interfaces.
You can determine the steady-state total QPS for a node through stress testing or by analyzing historical data, and then configure the threshold by adding a safety margin to this value.
Console
The left panel lists total QPS throttling events, and the right panel shows the trend of the average total QPS for the application's nodes over the last 5 minutes.
Events are reported every 5 minutes at the node and interface level where throttling occurred, covering the preceding 5-minute period.
The line chart on the right displays the Total QPS, Passed QPS, and Blocked QPS curves, along with a threshold reference line to help you observe the throttling effect.
Click the View link for an event to query the total QPS for the corresponding node IP. The timeline jumps to the event's report time, so you can check if the node's total QPS and throttling behavior meet your expectations. To view more detailed information at the interface and node level, navigate to the interface or node details page. Direct links will be provided in a future release.
The request data panel on the right displays trend curves for Total QPS, Passed QPS, and Blocked QPS.
|
Parameter |
Description |
|
Mode |
|
|
Total QPS Threshold |
The total QPS threshold at the node level. |
|
exception settings |
Total concurrency throttling
How it works
Total concurrency throttling tracks the total number of concurrent requests at the node level, the sum of concurrent requests for all server-side interfaces on a single node. It throttles requests that exceed the specified threshold.
Total concurrency throttling requires agent version 4.2.0 or later.
Scope
Total concurrency throttling applies to all server-side interfaces and has a lower priority than traffic protection rules.
Use cases
In scenarios with high response times (RT), typically over 1 second, QPS-based throttling has a significant drawback. When system resources such as thread pools, memory, or connection pools are occupied, request queuing increases, which further inflates the interface RT. If only QPS-based throttling is used, a small number of requests can still enter the system each second. However, the queued requests cannot be processed within a second, leading to further queue buildup and a rise in RT for both new and existing requests. Concurrency throttling immediately rejects new requests if a certain number of requests are already being processed. Although some requests are throttled, the system can process new requests with minimal queuing time after it finishes the current ones. This significantly improves the overall request success rate and average RT.
This applies to scenarios where an unexpected traffic surge on an interface leads to resource contention, queue buildup, and increased RT for all requests.
You can determine the steady-state total concurrency for a node through stress testing or by analyzing historical data, and then configure the threshold by adding a safety margin to this value.
Console
The left panel lists total concurrency throttling events, and the right panel shows the trend of the average total concurrent requests for the application's nodes over the last 5 minutes.
Events are reported every 5 minutes at the node and interface level where total concurrency throttling occurred, covering the preceding 5-minute period.
Click the View link for an event to query the total concurrency for the corresponding node IP. The timeline jumps to the event's report time, so you can check if the node's total concurrency and throttling behavior meet your expectations. To view more detailed information at the interface and node level, navigate to the interface or node details page. Direct links will be provided in a future release.
|
Parameter |
Description |
|
Mode |
|
|
Total Concurrency Threshold |
The total concurrency threshold at the node level. |
|
exception settings |
Abnormal call circuit breaking
How it works
Abnormal call circuit breaking tracks the error rate for each client interface. When the error rate exceeds the configured threshold, it trips the circuit for that interface. During the circuit breaking period, subsequent requests to that interface use fast failure. The system sends probe requests after a specified duration. If a probe request succeeds, the circuit closes, and normal traffic resumes.
Abnormal call circuit breaking requires agent version 4.2.0 or later.
Scope
Abnormal call circuit breaking applies to all client-side interfaces, except for those that have interface-specific circuit breaking rules configured.
Use cases
Abnormal call circuit breaking is primarily used in two scenarios.
Timeout scenarios: A high rate of timeout errors on a client interface often indicates an issue with the service provider. This can lead to request queuing in the calling application, affecting its other interfaces. By using circuit breaking, requests can fail fast during the provider's outage, which prevents queue buildup.
Non-timeout scenarios: For high rates of non-timeout errors on a client interface, abnormal call circuit breaking can trigger a degradation effect by throwing a throttling exception that you can handle, which optimizes the user experience during abnormal conditions.
Console
The left panel lists abnormal call circuit breaking events, and the right panel shows the top 10 interfaces with the highest error rates over the last 5 minutes.
Events are reported every 5 minutes at the node and interface level where abnormal call circuit breaking occurred, covering the preceding 5-minute period.
|
Parameter |
Description |
|
Mode |
|
|
Circuit Breaking Ratio Threshold (%) |
The circuit breaking ratio threshold at the interface level. |
|
exception settings |
|
|
Advanced Settings |
|
|
statistics window duration (s) |
The length of the statistics window. Valid values: 1 second to 120 minutes. |
|
circuit breaking duration (s) |
The duration for which the circuit remains open after it is tripped. During this period, all requests to the resource fail fast. |
|
minimum number of requests |
The minimum number of requests required to trip the circuit. If the number of requests in the current statistics window is less than this value, the rule is not triggered even if the circuit breaking condition is met. |
|
Circuit breaker recovery strategy |
The recovery strategy for the circuit breaker when it enters the recovery phase (half-open state).
|
Slow call circuit breaking
How it works
Slow call circuit breaking tracks the slow call rate for each client interface. When the slow call rate exceeds the configured threshold, the circuit for that interface is tripped. During the circuit breaking period, subsequent requests to the interface use fast failure. The system sends probe requests after a specified duration. If a probe request succeeds, the circuit closes, and normal traffic resumes.
Slow call circuit breaking requires agent version 4.2.0 or later.
Scope
Slow call circuit breaking applies to all client-side interfaces, except for those that have interface-specific circuit breaking rules configured.
Use cases
The use case for slow call circuit breaking is almost identical to the timeout scenario for abnormal call circuit breaking. The key difference is that you can dynamically adjust the RT threshold that defines a slow call, independent of any timeout configurations.
Console
The left panel lists slow call circuit breaking events, and the right panel shows the top 10 interfaces by average RT over the last 5 minutes.
Events are reported every 5 minutes at the node and interface level where slow call circuit breaking occurred, covering the preceding 5-minute period.
|
Parameter |
Description |
|
Mode |
|
|
Slow Call RT (ms) |
Requests with response times exceeding this value are considered slow calls. |
|
Degradation Threshold (%) |
When the percentage of requests with an RT longer than the Slow Call RT exceeds this threshold, circuit breaking is triggered. |
|
exception settings |
|
|
Advanced Settings |
|
|
statistics window duration (s) |
The length of the statistics window. Valid values: 1 second to 120 minutes. |
|
circuit breaking duration (s) |
The duration for which the circuit remains open after it is tripped. During this period, all requests to the resource fail fast. |
|
minimum number of requests |
The minimum number of requests required to trip the circuit. If the number of requests in the current statistics window is less than this value, the rule is not triggered even if the circuit breaking condition is met. |
|
Circuit breaker recovery strategy |
The recovery strategy for the circuit breaker when it enters the recovery phase (half-open state).
|
Exceptions
How it works
You can configure exception settings for all system protection features. Interfaces added to the exception list bypass rule checks and are always permitted.
Exception settings require agent version 4.2.0 or later.
Use cases
Typically, you only need to configure exception settings for health check interfaces and critical system interfaces. This prevents health checks from impacting node health status. It also allows critical interfaces, which have their own specific throttling limits, to operate without being affected by system-level throttling.
Configuration
The panel on the left lists interfaces that you can select based on recent calls. If an interface is not shown, you can enter its name in the input box and press Enter to add it to the selected list.
You can remove a selected interface by clicking the × icon next to it or clear all selections by clicking Remove all. Selected interfaces are not affected by the preceding configurations.
FAQ
System protection vs. traffic protection
-
Both system protection and traffic protection ensure application stability, but they cover different scenarios and result in different levels of traffic loss.
-
When throttling is triggered, system protection returns a 429 status code. Custom configurations are not currently supported.
-
System protection provides traffic protection based on node-level metrics, ensuring the stability of the application itself and covering most scenarios. However, because it treats all interfaces equally from the application's perspective, it does not account for the varying importance and load impact of different interfaces within the same application. Traffic protection allows you to configure different thresholds for different interfaces. This approach covers more scenarios and minimizes traffic loss while providing the same level of protection.
-
Overall, both system protection and traffic protection are effective safeguards. However, traffic protection offers better coverage and less traffic loss, while system protection is simpler to configure. Therefore, the best practice is to use a combination of both: use system protection to ensure baseline application stability, and use traffic protection to reduce traffic loss through fine-grained configuration without compromising protection effectiveness.
Related documents
For more information about traffic protection policies, see Traffic protection.