Built-in health rules

更新时间:
复制 MD 格式

Cloud Monitor 2.0 provides built-in health inspection rules that cover common health issue detection scenarios and work out of the box.

Overview

Severity level definitions

The severity levels of built-in rules align with alert severity levels:

Level

Name

Meaning

P1

Critical

A critical issue that requires immediate attention.

P2

Error

An error that requires prompt handling.

P3

Warning

A warning that requires attention.

P4

Normal (Info)

An informational message for reference.

Rule type definitions

Chinese

English

Description

Fault

Error

An error event that has occurred.

Anomaly

Anomaly

An abnormal fluctuation based on historical data.

Water level

Saturation

A resource is approaching its capacity limit.

Change

Amend

A change in the environment or code.

Failure

Failure

A severe or complete degradation.

Alert

Custom Alert

A user-defined alert.

APM domain

Built-in rules in the application performance monitoring (APM) domain apply to application services connected through ARMS agents or open-source agents.

Supported entity types

Entity type

Description

Application service (apm.service)

The entire microservice application.

Interface (apm.operation)

An API interface exposed by the service.

Instance (apm.instance)

A specific running instance of the service.

Summary of APM domain rules

Error & Anomaly rules (7)

Alert rule name

Rule type

Rule name

Detection logic

Default threshold

Level

[Health Rule] error_ratio_threshold_critical

Error

Error rate exceeds threshold

Error rate > Threshold and QPS > 0.1

10%

P1

[Health Rule] error_ratio_compare

Anomaly

The period-over-period error rate is abnormal.

Day-over-day error rate increase > Threshold and QPS > 0.1

100%

P1

[Health Rule] latency_avg_threshold_critical

Error

Average latency exceeds threshold

Average latency > Threshold and QPS > 0.1

3 seconds

P1

[Health Rule] latency_avg_compare

Anomaly

Period-over-period anomaly in average latency

Day-over-day average latency increase > Threshold and QPS > 0.1

50%

P1

[Health Rule] request_rate_compare

Anomaly

Abnormal day-to-day request rate

Day-over-day change in request rate > Threshold

50%

P1

[Health Rule] http_5xx_threshold_and_compare_critical

Error

HTTP 5xx count exceeds threshold and increases day-over-day

5xx count > Threshold and day-over-day increase > 50%

100 times

P1

[Health Rule] exception_count_threshold_and_compare_critical

Error

Exception count exceeds threshold

Exception count > Threshold and day-over-day increase > 50%

100 times

P1

Water Level Rules (5 rules)

Alert rule name

Rule type

Rule name

Detection logic

Default threshold

Level

[Health Rule] jvm_fullgc_count_threshold

Water level

Full GC count exceeds threshold

Full GC count > Threshold

3 times

P1

[Health Rule] jvm_gc_total_duration

Saturation

Total GC duration exceeds threshold

Total GC duration > Threshold

10 seconds

P1

[Health Rule] jvm_abnormal_thread_count_threshold_and_compare_critical

Water Level

Abnormal JVM thread count exceeds threshold

Abnormal thread count > Threshold and day-over-day increase > 100%

5

P1

[Health Rule] cpu_usage_threshold_critical

Water Level

CPU usage exceeds threshold

CPU usage > Threshold

70%

P1

[Health Rule] memory_usage_threshold_critical

Water Level

Memory usage exceeds threshold

Memory usage > Threshold

85%

P1

Details of Error & Anomaly rules

Error & Anomaly rules detect service health during request processing. They apply to APM and XTrace applications.

Error rate exceeds threshold

Property

Value

Rule ID

error_ratio_threshold_critical

Rule name

Error rate exceeds threshold

Rule description

Monitors the application error rate to check service availability.

Applicable entity

Interface (apm.operation)

Severity level

P1 (Critical)

Detection logic

Error rate > Threshold and QPS > 0.1

Default threshold

Error rate: 10%

Abnormal day-to-day error rate

Property

Value

Rule ID

error_ratio_compare

Rule name

Abnormal MoM Error Rate

Rule description

Detects abnormal error rate fluctuations based on historical data to identify potential faults.

Applicable entity

Interface (apm.operation)

Severity level

P1 (Critical)

Detection logic

Day-over-day error rate increase > Threshold and QPS > 0.1

Default threshold

Month-over-month growth: 100% (doubled)

Average latency exceeds threshold

Property

Value

Rule ID

latency_avg_threshold_critical

Rule name

Average latency exceeds threshold

Rule description

Monitors the average response time of an application to help ensure a good user experience.

Applicable entity

Interface (apm.operation)

Severity level

P1 (Critical)

Detection logic

Average latency > Threshold and QPS > 0.1

Default threshold

Average latency: 3 seconds

Abnormal day-to-day average latency

Property

Value

Rule ID

latency_avg_compare

Rule name

Abnormal day-to-day average latency

Rule description

Detects abnormal response time fluctuations based on historical data to pinpoint performance degradation.

Applicable entity

Interface (apm.operation)

Severity level

P1 (Critical)

Detection logic

Day-over-day average latency increase > Threshold and QPS > 0.1

Default threshold

Month-over-month growth: 50%

Abnormal day-to-day request rate

Property

Value

Rule ID

request_rate_compare

Rule name

Anomalous period-over-period request volume

Rule description

Detects abnormal traffic fluctuations based on historical data to identify capacity risks.

Applicable entity

Interface (apm.operation)

Severity level

P1 (Critical)

Detection logic

Day-over-day change in request rate > Threshold (increase or decrease)

Default threshold

Change: 50%

HTTP 5xx count exceeds threshold and increases day-over-day

Property

Value

Rule ID

http_5xx_threshold_and_compare_critical

Rule name

HTTP 5xx count exceeds threshold and increases day-over-day

Rule description

Detects whether the HTTP 5xx server error count exceeds the threshold and shows an upward trend.

Applicable entity

Interface (apm.operation)

Severity level

P1 (Critical)

Detection logic

5xx error count > Threshold and day-over-day increase > 50%

Default threshold

5xx error count: 100 per 5 minutes

Exception count exceeds threshold

Property

Value

Rule ID

exception_count_threshold_and_compare_critical

Rule name

Exception count exceeds threshold

Rule description

Detects whether the application exception count exceeds the threshold and shows an upward trend.

Applicable entity

Interface (apm.operation)

Severity level

P1 (Critical)

Detection logic

Exception count > Threshold and day-over-day increase > 50%

Default threshold

Exception count: 100 per 5 minutes

Supported application types

APM applications only

Water level rule details

Saturation rules detect resource usage during service runtime to help identify resource bottlenecks and potential risks.

Full GC count exceeds threshold

Property

Value

Rule ID

jvm_fullgc_count_threshold

Rule name

Full GC count exceeds threshold

Rule description

Detects whether the JVM Full GC frequency is abnormal, which may indicate memory pressure.

Applicable entity

Instance (apm.instance)

Severity level

P1 (Critical)

Detection logic

Full GC count > Threshold

Default threshold

3 per 5 minutes

Supported application types

APM Java applications only

Total GC duration exceeds threshold

Property

Value

Rule ID

jvm_gc_total_duration

Rule name

Total GC duration exceeds threshold

Rule description

Detects whether cumulative GC duration is excessive, which may cause GC pauses that affect performance.

Applicable entity

Instance (apm.instance)

Severity level

P1 (Critical)

Detection logic

Total GC duration > Threshold

Default threshold

10 seconds per 5 minutes

Supported application types

APM Java applications only

Abnormal JVM thread count exceeds threshold

Property

Value

Rule ID

jvm_abnormal_thread_count_threshold_and_compare_critical

Rule name

Abnormal JVM thread count exceeds threshold

Rule description

Detects whether the count of deadlocked or blocked JVM threads is abnormal, which may indicate thread resource issues.

Applicable entity

Instance (apm.instance)

Severity level

P1 (Critical)

Detection logic

(Deadlocked threads + Blocked threads) > Threshold and day-over-day increase > 100%

Default threshold

Abnormal thread count: 5

Supported application types

APM Java applications only

CPU usage exceeds threshold

Property

Value

Rule ID

cpu_usage_threshold_critical

Rule name

CPU usage exceeds threshold

Rule description

Detects whether the CPU usage of an instance is too high, indicating a computing resource bottleneck.

Applicable entity

Instance (apm.instance)

Severity level

P1 (Critical)

Detection logic

CPU usage > Threshold

Default threshold

70%

Note

The application must report system metrics.

Memory usage exceeds threshold

Property

Value

Rule ID

memory_usage_threshold_critical

Rule name

Memory usage exceeds threshold

Rule description

Detects whether the memory usage of an instance is too high, indicating insufficient memory resources.

Applicable entity

Instance (apm.instance)

Severity level

P1 (Critical)

Detection logic

Memory usage > Threshold

Default threshold

85%

Note

The application must report system metrics.

Detection parameter descriptions

Time window

Parameter

Value

Description

Detection Cycle

1 minute

The rule is executed once per minute.

Data window

5 minutes

Data from the past 5 minutes is used for evaluation.

Duration

1 minute

The condition must be met for 1 minute to trigger an event.

Sequential-Period Comparison Baseline

Day-over-day comparison rules compare data with the same time period from the previous day by default:

  • Current time window: The past 5 minutes.

  • Comparison time window: The same 5-minute period from the previous day.

QPS filtering

Request-based rules include a QPS > 0.1 filter to prevent false positives for low-traffic interfaces. When request volume is very low, a high error rate may be an isolated incident rather than statistically significant.