Configure system protection

更新时间:
复制 MD 格式

System protection provides node-level traffic protection against unexpected situations. For example, if an interface without traffic protection rules encounters a sudden traffic surge, system protection acts as a safety net to ensure application stability. Microservices Governance offers several system protection features for both server-side and client-side traffic, including adaptive overload protection, total QPS throttling, total concurrency throttling, abnormal call circuit breaking, and slow call circuit breaking.

Note

For a comparison between system protection and traffic protection, see System protection vs. traffic protection.

Prerequisites

Procedure

  1. Log on to the MSE console, and select a region in the top navigation bar.

  2. In the left-side navigation pane, choose Microservices Governance > Application Governance.

  3. On the Application list page, click the resource card of the desired application. In the left-side navigation pane, click Traffic management.

  4. Click the system protection tab and configure the features.

Adaptive overload protection

Note

Adaptive overload protection requires agent version 3.1.4 or later.

How it works

Adaptive overload protection uses CPU utilization as a metric for system load. It adaptively adjusts the throttling rate for server-side traffic to keep CPU utilization stable and near a configured threshold, even during unexpected traffic surges.

Scope

Adaptive overload protection applies to all server-side interfaces and has a lower priority than traffic protection rules.

Use cases

Adaptive overload protection provides CPU-based baseline protection for server-side interfaces. It is suitable for CPU-bound applications where an unexpected traffic surge on an interface continuously increases system CPU utilization and affects the response time (RT) of core interfaces.

Steady-state CPU utilization varies by application. You can determine the maximum steady-state CPU utilization through stress testing or by analyzing historical data, and then configure the threshold by adding a safety margin to this value.

Console

The left panel lists adaptive overload protection events, and the right panel shows the average CPU utilization trend for the application's nodes over the last 5 minutes.

Events are reported at the node level based on algorithm-driven state changes. They include throttling start, throttling ongoing, and throttling end events.

At the top of the page, you can find the feature toggle and the Simulated Execution setting. In the CPU utilization chart on the right, a blue dashed line indicates the protection threshold.

Click the View link in the Actions column for an event to query the CPU utilization data for the corresponding node IP. The timeline jumps to the event's report time, so you can observe the node's CPU utilization and throttling probability when the event was triggered.

Parameter

Description

Mode

  • Close: Adaptive overload protection is disabled.

  • Simulated Execution: In this mode, when protection is triggered, the system generates events but does not adjust the traffic protection policy.

  • Enabled: In this mode, when triggered, the system adjusts traffic protection policies to throttle a percentage of all ingress traffic.

CPU utilization

Defines the expected CPU utilization threshold. Adaptive overload protection uses the system's actual CPU utilization and this configured threshold to adaptively adjust the interface throttling probability. This helps maintain CPU utilization within a small range around the threshold under high load by rejecting a portion of requests.

exception settings

See Configure exception settings.

Total QPS throttling

How it works

Total QPS throttling tracks the total queries per second (QPS) at the node level, the sum of QPS for all server-side interfaces on a single node. It throttles requests that exceed the specified threshold.

Note

Total QPS throttling requires agent version 4.2.0 or later.

Scope

Total QPS throttling applies to all server-side interfaces and has a lower priority than traffic protection rules.

Use cases

Not all systems are CPU-bound. Some applications may experience performance degradation at low CPU loads due to memory, network, or other resource constraints. Total QPS throttling provides a traffic-based protection mechanism by limiting the total QPS of a node.

This applies to scenarios where an unexpected traffic surge on an interface leads to resource contention, which in turn affects core interfaces.

You can determine the steady-state total QPS for a node through stress testing or by analyzing historical data, and then configure the threshold by adding a safety margin to this value.

Console

The left panel lists total QPS throttling events, and the right panel shows the trend of the average total QPS for the application's nodes over the last 5 minutes.

Events are reported every 5 minutes at the node and interface level where throttling occurred, covering the preceding 5-minute period.

The line chart on the right displays the Total QPS, Passed QPS, and Blocked QPS curves, along with a threshold reference line to help you observe the throttling effect.

Click the View link for an event to query the total QPS for the corresponding node IP. The timeline jumps to the event's report time, so you can check if the node's total QPS and throttling behavior meet your expectations. To view more detailed information at the interface and node level, navigate to the interface or node details page. Direct links will be provided in a future release.

The request data panel on the right displays trend curves for Total QPS, Passed QPS, and Blocked QPS.

Parameter

Description

Mode

  • Disabled: Total QPS throttling is disabled.

  • Enabled: In this mode, requests that exceed the threshold are throttled.

Total QPS Threshold

The total QPS threshold at the node level.

exception settings

See Configure exception settings.

Total concurrency throttling

How it works

Total concurrency throttling tracks the total number of concurrent requests at the node level, the sum of concurrent requests for all server-side interfaces on a single node. It throttles requests that exceed the specified threshold.

Note

Total concurrency throttling requires agent version 4.2.0 or later.

Scope

Total concurrency throttling applies to all server-side interfaces and has a lower priority than traffic protection rules.

Use cases

In scenarios with high response times (RT), typically over 1 second, QPS-based throttling has a significant drawback. When system resources such as thread pools, memory, or connection pools are occupied, request queuing increases, which further inflates the interface RT. If only QPS-based throttling is used, a small number of requests can still enter the system each second. However, the queued requests cannot be processed within a second, leading to further queue buildup and a rise in RT for both new and existing requests. Concurrency throttling immediately rejects new requests if a certain number of requests are already being processed. Although some requests are throttled, the system can process new requests with minimal queuing time after it finishes the current ones. This significantly improves the overall request success rate and average RT.

This applies to scenarios where an unexpected traffic surge on an interface leads to resource contention, queue buildup, and increased RT for all requests.

You can determine the steady-state total concurrency for a node through stress testing or by analyzing historical data, and then configure the threshold by adding a safety margin to this value.

Console

The left panel lists total concurrency throttling events, and the right panel shows the trend of the average total concurrent requests for the application's nodes over the last 5 minutes.

Events are reported every 5 minutes at the node and interface level where total concurrency throttling occurred, covering the preceding 5-minute period.

Click the View link for an event to query the total concurrency for the corresponding node IP. The timeline jumps to the event's report time, so you can check if the node's total concurrency and throttling behavior meet your expectations. To view more detailed information at the interface and node level, navigate to the interface or node details page. Direct links will be provided in a future release.

Parameter

Description

Mode

  • Disabled: Total concurrency throttling is disabled.

  • Enabled: In this mode, requests that exceed the threshold are throttled.

Total Concurrency Threshold

The total concurrency threshold at the node level.

exception settings

See Configure exception settings.

Abnormal call circuit breaking

How it works

Abnormal call circuit breaking tracks the error rate for each client interface. When the error rate exceeds the configured threshold, it trips the circuit for that interface. During the circuit breaking period, subsequent requests to that interface use fast failure. The system sends probe requests after a specified duration. If a probe request succeeds, the circuit closes, and normal traffic resumes.

Note

Abnormal call circuit breaking requires agent version 4.2.0 or later.

Scope

Abnormal call circuit breaking applies to all client-side interfaces, except for those that have interface-specific circuit breaking rules configured.

Use cases

Abnormal call circuit breaking is primarily used in two scenarios.

Timeout scenarios: A high rate of timeout errors on a client interface often indicates an issue with the service provider. This can lead to request queuing in the calling application, affecting its other interfaces. By using circuit breaking, requests can fail fast during the provider's outage, which prevents queue buildup.

Non-timeout scenarios: For high rates of non-timeout errors on a client interface, abnormal call circuit breaking can trigger a degradation effect by throwing a throttling exception that you can handle, which optimizes the user experience during abnormal conditions.

Console

The left panel lists abnormal call circuit breaking events, and the right panel shows the top 10 interfaces with the highest error rates over the last 5 minutes.

Events are reported every 5 minutes at the node and interface level where abnormal call circuit breaking occurred, covering the preceding 5-minute period.

Parameter

Description

Mode

  • Disabled: Abnormal call circuit breaking is disabled.

  • Enabled: In this mode, circuit breaking is triggered for requests that exceed the threshold.

Circuit Breaking Ratio Threshold (%)

The circuit breaking ratio threshold at the interface level.

exception settings

See Configure exception settings.

Advanced Settings

statistics window duration (s)

The length of the statistics window. Valid values: 1 second to 120 minutes.

circuit breaking duration (s)

The duration for which the circuit remains open after it is tripped. During this period, all requests to the resource fail fast.

minimum number of requests

The minimum number of requests required to trip the circuit. If the number of requests in the current statistics window is less than this value, the rule is not triggered even if the circuit breaking condition is met.

Circuit breaker recovery strategy

The recovery strategy for the circuit breaker when it enters the recovery phase (half-open state).

  • Single detection recovery: After the circuit breaking duration, the circuit breaker sends one probe request. If the request meets expectations (not a slow call or an error), the circuit closes. Otherwise, it re-enters the open state.

  • Progressive recovery: You must set the Number of recovery phases and Minimum passes per step.

  • After the circuit breaking duration, the circuit breaker recovers progressively based on the specified number of recovery phases. If the number of requests in a phase reaches the minimum number of passes per step, a check is triggered. If all checked requests are below the threshold, the percentage of allowed requests is gradually increased until all requests are restored. If any step fails, the circuit re-enters the open state.

  • Let T = 100/N, where N is the number of recovery phases. Phase 1 allows T% of requests, phase 2 allows 2T% of requests, and so on until 100%.

  • For example, if the number of recovery phases is 3 and the minimum number of passes per step is 5, requests are allowed at rates of 33%, 67%, and 100% in the three phases. When the number of requests in a phase is greater than or equal to 5, a check is performed. If the request metrics do not exceed the threshold, the next recovery phase begins, until traffic is fully restored.

Slow call circuit breaking

How it works

Slow call circuit breaking tracks the slow call rate for each client interface. When the slow call rate exceeds the configured threshold, the circuit for that interface is tripped. During the circuit breaking period, subsequent requests to the interface use fast failure. The system sends probe requests after a specified duration. If a probe request succeeds, the circuit closes, and normal traffic resumes.

Note

Slow call circuit breaking requires agent version 4.2.0 or later.

Scope

Slow call circuit breaking applies to all client-side interfaces, except for those that have interface-specific circuit breaking rules configured.

Use cases

The use case for slow call circuit breaking is almost identical to the timeout scenario for abnormal call circuit breaking. The key difference is that you can dynamically adjust the RT threshold that defines a slow call, independent of any timeout configurations.

Console

The left panel lists slow call circuit breaking events, and the right panel shows the top 10 interfaces by average RT over the last 5 minutes.

Events are reported every 5 minutes at the node and interface level where slow call circuit breaking occurred, covering the preceding 5-minute period.

Parameter

Description

Mode

  • Disabled: Slow call circuit breaking is disabled.

  • Enabled: In this mode, circuit breaking is triggered for requests that exceed the threshold.

Slow Call RT (ms)

Requests with response times exceeding this value are considered slow calls.

Degradation Threshold (%)

When the percentage of requests with an RT longer than the Slow Call RT exceeds this threshold, circuit breaking is triggered.

exception settings

See Configure exception settings.

Advanced Settings

statistics window duration (s)

The length of the statistics window. Valid values: 1 second to 120 minutes.

circuit breaking duration (s)

The duration for which the circuit remains open after it is tripped. During this period, all requests to the resource fail fast.

minimum number of requests

The minimum number of requests required to trip the circuit. If the number of requests in the current statistics window is less than this value, the rule is not triggered even if the circuit breaking condition is met.

Circuit breaker recovery strategy

The recovery strategy for the circuit breaker when it enters the recovery phase (half-open state).

  • Single detection recovery: After the circuit breaking duration, the circuit breaker sends one probe request. If the request meets expectations (not a slow call or an error), the circuit closes. Otherwise, it re-enters the open state.

  • Progressive recovery: You must set the Number of recovery phases and Minimum passes per step.

  • After the circuit breaking duration, the circuit breaker recovers progressively based on the specified number of recovery phases. If the number of requests in a phase reaches the minimum number of passes per step, a check is triggered. If all checked requests are below the threshold, the percentage of allowed requests is gradually increased until all requests are restored. If any step fails, the circuit re-enters the open state.

  • Let T = 100/N, where N is the number of recovery phases. Phase 1 allows T% of requests, phase 2 allows 2T% of requests, and so on until 100%.

  • For example, if the number of recovery phases is 3 and the minimum number of passes per step is 5, requests are allowed at rates of 33%, 67%, and 100% in the three phases. When the number of requests in a phase is greater than or equal to 5, a check is performed. If the request metrics do not exceed the threshold, the next recovery phase begins, until traffic is fully restored.

Exceptions

How it works

You can configure exception settings for all system protection features. Interfaces added to the exception list bypass rule checks and are always permitted.

Note

Exception settings require agent version 4.2.0 or later.

Use cases

Typically, you only need to configure exception settings for health check interfaces and critical system interfaces. This prevents health checks from impacting node health status. It also allows critical interfaces, which have their own specific throttling limits, to operate without being affected by system-level throttling.

Configuration

The panel on the left lists interfaces that you can select based on recent calls. If an interface is not shown, you can enter its name in the input box and press Enter to add it to the selected list.

You can remove a selected interface by clicking the × icon next to it or clear all selections by clicking Remove all. Selected interfaces are not affected by the preceding configurations.

FAQ

System protection vs. traffic protection

  • Both system protection and traffic protection ensure application stability, but they cover different scenarios and result in different levels of traffic loss.

  • When throttling is triggered, system protection returns a 429 status code. Custom configurations are not currently supported.

  • System protection provides traffic protection based on node-level metrics, ensuring the stability of the application itself and covering most scenarios. However, because it treats all interfaces equally from the application's perspective, it does not account for the varying importance and load impact of different interfaces within the same application. Traffic protection allows you to configure different thresholds for different interfaces. This approach covers more scenarios and minimizes traffic loss while providing the same level of protection.

  • Overall, both system protection and traffic protection are effective safeguards. However, traffic protection offers better coverage and less traffic loss, while system protection is simpler to configure. Therefore, the best practice is to use a combination of both: use system protection to ensure baseline application stability, and use traffic protection to reduce traffic loss through fine-grained configuration without compromising protection effectiveness.

Related documents

For more information about traffic protection policies, see Traffic protection.