Built-in health rules-Cloud Monitor(CMS)-阿里云帮助中心

Cloud Monitor 2.0 provides built-in health inspection rules that cover common health issue detection scenarios and work out of the box.

Overview

Severity level definitions

The severity levels of built-in rules align with alert severity levels:

Level	Name	Meaning
P1	Critical	A critical issue that requires immediate attention.
P2	Error	An error that requires prompt handling.
P3	Warning	A warning that requires attention.
P4	Normal (Info)	An informational message for reference.

Rule type definitions

Chinese	English	Description
Fault	Error	An error event that has occurred.
Anomaly	Anomaly	An abnormal fluctuation based on historical data.
Water level	Saturation	A resource is approaching its capacity limit.
Change	Amend	A change in the environment or code.
Failure	Failure	A severe or complete degradation.
Alert	Custom Alert	A user-defined alert.

APM domain

Built-in rules in the application performance monitoring (APM) domain apply to application services connected through ARMS agents or open-source agents.

Supported entity types

Entity type	Description
Application service (apm.service)	The entire microservice application.
Interface (apm.operation)	An API interface exposed by the service.
Instance (apm.instance)	A specific running instance of the service.

Summary of APM domain rules

Error & Anomaly rules (7)

Alert rule name	Rule type	Rule name	Detection logic	Default threshold	Level
[Health Rule] error_ratio_threshold_critical	Error	Error rate exceeds threshold	Error rate > Threshold and QPS > 0.1	10%	P1
[Health Rule] error_ratio_compare	Anomaly	The period-over-period error rate is abnormal.	Day-over-day error rate increase > Threshold and QPS > 0.1	100%	P1
[Health Rule] latency_avg_threshold_critical	Error	Average latency exceeds threshold	Average latency > Threshold and QPS > 0.1	3 seconds	P1
[Health Rule] latency_avg_compare	Anomaly	Period-over-period anomaly in average latency	Day-over-day average latency increase > Threshold and QPS > 0.1	50%	P1
[Health Rule] request_rate_compare	Anomaly	Abnormal day-to-day request rate	Day-over-day change in request rate > Threshold	50%	P1
[Health Rule] http_5xx_threshold_and_compare_critical	Error	HTTP 5xx count exceeds threshold and increases day-over-day	5xx count > Threshold and day-over-day increase > 50%	100 times	P1
[Health Rule] exception_count_threshold_and_compare_critical	Error	Exception count exceeds threshold	Exception count > Threshold and day-over-day increase > 50%	100 times	P1

Water Level Rules (5 rules)

Alert rule name	Rule type	Rule name	Detection logic	Default threshold	Level
[Health Rule] jvm_fullgc_count_threshold	Water level	Full GC count exceeds threshold	Full GC count > Threshold	3 times	P1
[Health Rule] jvm_gc_total_duration	Saturation	Total GC duration exceeds threshold	Total GC duration > Threshold	10 seconds	P1
[Health Rule] jvm_abnormal_thread_count_threshold_and_compare_critical	Water Level	Abnormal JVM thread count exceeds threshold	Abnormal thread count > Threshold and day-over-day increase > 100%	5	P1
[Health Rule] cpu_usage_threshold_critical	Water Level	CPU usage exceeds threshold	CPU usage > Threshold	70%	P1
[Health Rule] memory_usage_threshold_critical	Water Level	Memory usage exceeds threshold	Memory usage > Threshold	85%	P1

Details of Error & Anomaly rules

Error & Anomaly rules detect service health during request processing. They apply to APM and XTrace applications.

Error rate exceeds threshold

Property	Value
Rule ID	`error_ratio_threshold_critical`
Rule name	Error rate exceeds threshold
Rule description	Monitors the application error rate to check service availability.
Applicable entity	Interface (apm.operation)
Severity level	P1 (Critical)
Detection logic	Error rate > Threshold and QPS > 0.1
Default threshold	Error rate: 10%

Abnormal day-to-day error rate

Property	Value
Rule ID	`error_ratio_compare`
Rule name	Abnormal MoM Error Rate
Rule description	Detects abnormal error rate fluctuations based on historical data to identify potential faults.
Applicable entity	Interface (apm.operation)
Severity level	P1 (Critical)
Detection logic	Day-over-day error rate increase > Threshold and QPS > 0.1
Default threshold	Month-over-month growth: 100% (doubled)

Average latency exceeds threshold

Property	Value
Rule ID	`latency_avg_threshold_critical`
Rule name	Average latency exceeds threshold
Rule description	Monitors the average response time of an application to help ensure a good user experience.
Applicable entity	Interface (apm.operation)
Severity level	P1 (Critical)
Detection logic	Average latency > Threshold and QPS > 0.1
Default threshold	Average latency: 3 seconds

Abnormal day-to-day average latency

Property	Value
Rule ID	`latency_avg_compare`
Rule name	Abnormal day-to-day average latency
Rule description	Detects abnormal response time fluctuations based on historical data to pinpoint performance degradation.
Applicable entity	Interface (apm.operation)
Severity level	P1 (Critical)
Detection logic	Day-over-day average latency increase > Threshold and QPS > 0.1
Default threshold	Month-over-month growth: 50%

Abnormal day-to-day request rate

Property	Value
Rule ID	`request_rate_compare`
Rule name	Anomalous period-over-period request volume
Rule description	Detects abnormal traffic fluctuations based on historical data to identify capacity risks.
Applicable entity	Interface (apm.operation)
Severity level	P1 (Critical)
Detection logic	Day-over-day change in request rate > Threshold (increase or decrease)
Default threshold	Change: 50%

HTTP 5xx count exceeds threshold and increases day-over-day

Property	Value
Rule ID	`http_5xx_threshold_and_compare_critical`
Rule name	HTTP 5xx count exceeds threshold and increases day-over-day
Rule description	Detects whether the HTTP 5xx server error count exceeds the threshold and shows an upward trend.
Applicable entity	Interface (apm.operation)
Severity level	P1 (Critical)
Detection logic	5xx error count > Threshold and day-over-day increase > 50%
Default threshold	5xx error count: 100 per 5 minutes

Exception count exceeds threshold

Property	Value
Rule ID	`exception_count_threshold_and_compare_critical`
Rule name	Exception count exceeds threshold
Rule description	Detects whether the application exception count exceeds the threshold and shows an upward trend.
Applicable entity	Interface (apm.operation)
Severity level	P1 (Critical)
Detection logic	Exception count > Threshold and day-over-day increase > 50%
Default threshold	Exception count: 100 per 5 minutes
Supported application types	APM applications only

Water level rule details

Saturation rules detect resource usage during service runtime to help identify resource bottlenecks and potential risks.

Full GC count exceeds threshold

Property	Value
Rule ID	`jvm_fullgc_count_threshold`
Rule name	Full GC count exceeds threshold
Rule description	Detects whether the JVM Full GC frequency is abnormal, which may indicate memory pressure.
Applicable entity	Instance (apm.instance)
Severity level	P1 (Critical)
Detection logic	Full GC count > Threshold
Default threshold	3 per 5 minutes
Supported application types	APM Java applications only

Total GC duration exceeds threshold

Property	Value
Rule ID	`jvm_gc_total_duration`
Rule name	Total GC duration exceeds threshold
Rule description	Detects whether cumulative GC duration is excessive, which may cause GC pauses that affect performance.
Applicable entity	Instance (apm.instance)
Severity level	P1 (Critical)
Detection logic	Total GC duration > Threshold
Default threshold	10 seconds per 5 minutes
Supported application types	APM Java applications only

Abnormal JVM thread count exceeds threshold

Property	Value
Rule ID	`jvm_abnormal_thread_count_threshold_and_compare_critical`
Rule name	Abnormal JVM thread count exceeds threshold
Rule description	Detects whether the count of deadlocked or blocked JVM threads is abnormal, which may indicate thread resource issues.
Applicable entity	Instance (apm.instance)
Severity level	P1 (Critical)
Detection logic	(Deadlocked threads + Blocked threads) > Threshold and day-over-day increase > 100%
Default threshold	Abnormal thread count: 5
Supported application types	APM Java applications only

CPU usage exceeds threshold

Property	Value
Rule ID	`cpu_usage_threshold_critical`
Rule name	CPU usage exceeds threshold
Rule description	Detects whether the CPU usage of an instance is too high, indicating a computing resource bottleneck.
Applicable entity	Instance (apm.instance)
Severity level	P1 (Critical)
Detection logic	CPU usage > Threshold
Default threshold	70%
Note	The application must report system metrics.

Memory usage exceeds threshold

Property	Value
Rule ID	`memory_usage_threshold_critical`
Rule name	Memory usage exceeds threshold
Rule description	Detects whether the memory usage of an instance is too high, indicating insufficient memory resources.
Applicable entity	Instance (apm.instance)
Severity level	P1 (Critical)
Detection logic	Memory usage > Threshold
Default threshold	85%
Note	The application must report system metrics.

Detection parameter descriptions

Time window

Parameter	Value	Description
Detection Cycle	1 minute	The rule is executed once per minute.
Data window	5 minutes	Data from the past 5 minutes is used for evaluation.
Duration	1 minute	The condition must be met for 1 minute to trigger an event.

Sequential-Period Comparison Baseline

Day-over-day comparison rules compare data with the same time period from the previous day by default:

Current time window: The past 5 minutes.
Comparison time window: The same 5-minute period from the previous day.

QPS filtering

Request-based rules include a QPS > 0.1 filter to prevent false positives for low-traffic interfaces. When request volume is very low, a high error rate may be an isolated incident rather than statistically significant.