Cloud Monitor 2.0 provides built-in health inspection rules that cover common health issue detection scenarios and work out of the box.
Overview
Severity level definitions
The severity levels of built-in rules align with alert severity levels:
|
Level |
Name |
Meaning |
|
P1 |
Critical |
A critical issue that requires immediate attention. |
|
P2 |
Error |
An error that requires prompt handling. |
|
P3 |
Warning |
A warning that requires attention. |
|
P4 |
Normal (Info) |
An informational message for reference. |
Rule type definitions
|
Chinese |
English |
Description |
|
Fault |
Error |
An error event that has occurred. |
|
Anomaly |
Anomaly |
An abnormal fluctuation based on historical data. |
|
Water level |
Saturation |
A resource is approaching its capacity limit. |
|
Change |
Amend |
A change in the environment or code. |
|
Failure |
Failure |
A severe or complete degradation. |
|
Alert |
Custom Alert |
A user-defined alert. |
APM domain
Built-in rules in the application performance monitoring (APM) domain apply to application services connected through ARMS agents or open-source agents.
Supported entity types
|
Entity type |
Description |
|
Application service (apm.service) |
The entire microservice application. |
|
Interface (apm.operation) |
An API interface exposed by the service. |
|
Instance (apm.instance) |
A specific running instance of the service. |
Summary of APM domain rules
Error & Anomaly rules (7)
|
Alert rule name |
Rule type |
Rule name |
Detection logic |
Default threshold |
Level |
|
[Health Rule] error_ratio_threshold_critical |
Error |
Error rate exceeds threshold |
Error rate > Threshold and QPS > 0.1 |
10% |
P1 |
|
[Health Rule] error_ratio_compare |
Anomaly |
The period-over-period error rate is abnormal. |
Day-over-day error rate increase > Threshold and QPS > 0.1 |
100% |
P1 |
|
[Health Rule] latency_avg_threshold_critical |
Error |
Average latency exceeds threshold |
Average latency > Threshold and QPS > 0.1 |
3 seconds |
P1 |
|
[Health Rule] latency_avg_compare |
Anomaly |
Period-over-period anomaly in average latency |
Day-over-day average latency increase > Threshold and QPS > 0.1 |
50% |
P1 |
|
[Health Rule] request_rate_compare |
Anomaly |
Abnormal day-to-day request rate |
Day-over-day change in request rate > Threshold |
50% |
P1 |
|
[Health Rule] http_5xx_threshold_and_compare_critical |
Error |
HTTP 5xx count exceeds threshold and increases day-over-day |
5xx count > Threshold and day-over-day increase > 50% |
100 times |
P1 |
|
[Health Rule] exception_count_threshold_and_compare_critical |
Error |
Exception count exceeds threshold |
Exception count > Threshold and day-over-day increase > 50% |
100 times |
P1 |
Water Level Rules (5 rules)
|
Alert rule name |
Rule type |
Rule name |
Detection logic |
Default threshold |
Level |
|
[Health Rule] jvm_fullgc_count_threshold |
Water level |
Full GC count exceeds threshold |
Full GC count > Threshold |
3 times |
P1 |
|
[Health Rule] jvm_gc_total_duration |
Saturation |
Total GC duration exceeds threshold |
Total GC duration > Threshold |
10 seconds |
P1 |
|
[Health Rule] jvm_abnormal_thread_count_threshold_and_compare_critical |
Water Level |
Abnormal JVM thread count exceeds threshold |
Abnormal thread count > Threshold and day-over-day increase > 100% |
5 |
P1 |
|
[Health Rule] cpu_usage_threshold_critical |
Water Level |
CPU usage exceeds threshold |
CPU usage > Threshold |
70% |
P1 |
|
[Health Rule] memory_usage_threshold_critical |
Water Level |
Memory usage exceeds threshold |
Memory usage > Threshold |
85% |
P1 |
Details of Error & Anomaly rules
Error & Anomaly rules detect service health during request processing. They apply to APM and XTrace applications.
Error rate exceeds threshold
|
Property |
Value |
|
Rule ID |
|
|
Rule name |
Error rate exceeds threshold |
|
Rule description |
Monitors the application error rate to check service availability. |
|
Applicable entity |
Interface (apm.operation) |
|
Severity level |
P1 (Critical) |
|
Detection logic |
Error rate > Threshold and QPS > 0.1 |
|
Default threshold |
Error rate: 10% |
Abnormal day-to-day error rate
|
Property |
Value |
|
Rule ID |
|
|
Rule name |
Abnormal MoM Error Rate |
|
Rule description |
Detects abnormal error rate fluctuations based on historical data to identify potential faults. |
|
Applicable entity |
Interface (apm.operation) |
|
Severity level |
P1 (Critical) |
|
Detection logic |
Day-over-day error rate increase > Threshold and QPS > 0.1 |
|
Default threshold |
Month-over-month growth: 100% (doubled) |
Average latency exceeds threshold
|
Property |
Value |
|
Rule ID |
|
|
Rule name |
Average latency exceeds threshold |
|
Rule description |
Monitors the average response time of an application to help ensure a good user experience. |
|
Applicable entity |
Interface (apm.operation) |
|
Severity level |
P1 (Critical) |
|
Detection logic |
Average latency > Threshold and QPS > 0.1 |
|
Default threshold |
Average latency: 3 seconds |
Abnormal day-to-day average latency
|
Property |
Value |
|
Rule ID |
|
|
Rule name |
Abnormal day-to-day average latency |
|
Rule description |
Detects abnormal response time fluctuations based on historical data to pinpoint performance degradation. |
|
Applicable entity |
Interface (apm.operation) |
|
Severity level |
P1 (Critical) |
|
Detection logic |
Day-over-day average latency increase > Threshold and QPS > 0.1 |
|
Default threshold |
Month-over-month growth: 50% |
Abnormal day-to-day request rate
|
Property |
Value |
|
Rule ID |
|
|
Rule name |
Anomalous period-over-period request volume |
|
Rule description |
Detects abnormal traffic fluctuations based on historical data to identify capacity risks. |
|
Applicable entity |
Interface (apm.operation) |
|
Severity level |
P1 (Critical) |
|
Detection logic |
Day-over-day change in request rate > Threshold (increase or decrease) |
|
Default threshold |
Change: 50% |
HTTP 5xx count exceeds threshold and increases day-over-day
|
Property |
Value |
|
Rule ID |
|
|
Rule name |
HTTP 5xx count exceeds threshold and increases day-over-day |
|
Rule description |
Detects whether the HTTP 5xx server error count exceeds the threshold and shows an upward trend. |
|
Applicable entity |
Interface (apm.operation) |
|
Severity level |
P1 (Critical) |
|
Detection logic |
5xx error count > Threshold and day-over-day increase > 50% |
|
Default threshold |
5xx error count: 100 per 5 minutes |
Exception count exceeds threshold
|
Property |
Value |
|
Rule ID |
|
|
Rule name |
Exception count exceeds threshold |
|
Rule description |
Detects whether the application exception count exceeds the threshold and shows an upward trend. |
|
Applicable entity |
Interface (apm.operation) |
|
Severity level |
P1 (Critical) |
|
Detection logic |
Exception count > Threshold and day-over-day increase > 50% |
|
Default threshold |
Exception count: 100 per 5 minutes |
|
Supported application types |
APM applications only |
Water level rule details
Saturation rules detect resource usage during service runtime to help identify resource bottlenecks and potential risks.
Full GC count exceeds threshold
|
Property |
Value |
|
Rule ID |
|
|
Rule name |
Full GC count exceeds threshold |
|
Rule description |
Detects whether the JVM Full GC frequency is abnormal, which may indicate memory pressure. |
|
Applicable entity |
Instance (apm.instance) |
|
Severity level |
P1 (Critical) |
|
Detection logic |
Full GC count > Threshold |
|
Default threshold |
3 per 5 minutes |
|
Supported application types |
APM Java applications only |
Total GC duration exceeds threshold
|
Property |
Value |
|
Rule ID |
|
|
Rule name |
Total GC duration exceeds threshold |
|
Rule description |
Detects whether cumulative GC duration is excessive, which may cause GC pauses that affect performance. |
|
Applicable entity |
Instance (apm.instance) |
|
Severity level |
P1 (Critical) |
|
Detection logic |
Total GC duration > Threshold |
|
Default threshold |
10 seconds per 5 minutes |
|
Supported application types |
APM Java applications only |
Abnormal JVM thread count exceeds threshold
|
Property |
Value |
|
Rule ID |
|
|
Rule name |
Abnormal JVM thread count exceeds threshold |
|
Rule description |
Detects whether the count of deadlocked or blocked JVM threads is abnormal, which may indicate thread resource issues. |
|
Applicable entity |
Instance (apm.instance) |
|
Severity level |
P1 (Critical) |
|
Detection logic |
(Deadlocked threads + Blocked threads) > Threshold and day-over-day increase > 100% |
|
Default threshold |
Abnormal thread count: 5 |
|
Supported application types |
APM Java applications only |
CPU usage exceeds threshold
|
Property |
Value |
|
Rule ID |
|
|
Rule name |
CPU usage exceeds threshold |
|
Rule description |
Detects whether the CPU usage of an instance is too high, indicating a computing resource bottleneck. |
|
Applicable entity |
Instance (apm.instance) |
|
Severity level |
P1 (Critical) |
|
Detection logic |
CPU usage > Threshold |
|
Default threshold |
70% |
|
Note |
The application must report system metrics. |
Memory usage exceeds threshold
|
Property |
Value |
|
Rule ID |
|
|
Rule name |
Memory usage exceeds threshold |
|
Rule description |
Detects whether the memory usage of an instance is too high, indicating insufficient memory resources. |
|
Applicable entity |
Instance (apm.instance) |
|
Severity level |
P1 (Critical) |
|
Detection logic |
Memory usage > Threshold |
|
Default threshold |
85% |
|
Note |
The application must report system metrics. |
Detection parameter descriptions
Time window
|
Parameter |
Value |
Description |
|
Detection Cycle |
1 minute |
The rule is executed once per minute. |
|
Data window |
5 minutes |
Data from the past 5 minutes is used for evaluation. |
|
Duration |
1 minute |
The condition must be met for 1 minute to trigger an event. |
Sequential-Period Comparison Baseline
Day-over-day comparison rules compare data with the same time period from the previous day by default:
-
Current time window: The past 5 minutes.
-
Comparison time window: The same 5-minute period from the previous day.
QPS filtering
Request-based rules include a QPS > 0.1 filter to prevent false positives for low-traffic interfaces. When request volume is very low, a high error rate may be an isolated incident rather than statistically significant.