Application anomaly detection rules

更新时间:
复制 MD 格式

This topic describes application anomaly detection rules to help you configure your application settings effectively and maintain system stability.

CPU anomaly detection rules

Data field descriptions

  • avg_cpu_used: Average CPU usage (in millicores; 1 core = 1000 millicores).

  • p95_cpu_used: P95 percentile of CPU usage (in millicores).

  • p99_cpu_used: P99 percentile of CPU usage (in millicores).

  • cpu_request: CPU request for the Kubernetes pod (resource assurance baseline).

  • cpu_limit: CPU limit for the Kubernetes pod (upper bound of resource usage).

Detection rule details

1. Limit proximity risk

Trigger condition: p99_cpu_used > cpu_limit × 0.8

Anomaly type: Throttling risk

Scoring formula: min(100, 70 + (p99_used/limit - 0.8) × 500)

Severity levels:

  • ≥90 → Critical (direct kernel-level throttling risk)

  • 80–89 → High (adjust limit urgently)

  • 70–79 → Medium (monitor recommended)

Rule explanation

Design principle: Based on the Kubernetes CPU throttling mechanism (CFS quota), when a container’s CPU usage consistently approaches its limit, kernel-level throttling occurs. The P99 value reflects resource consumption under extreme conditions. The 80% threshold aligns with the "saturation" metric from Google SRE’s Golden Signals.

Example: When cpu_limit=2000m (2 cores) and p99_cpu_used=1700m:

  • Usage ratio: 1700/2000 = 0.85

  • Score = 70 + (0.85 – 0.8) × 500 = 70 + 25 = 95 → Critical

Recommended actions:

  1. Immediately increase the limit by at least 25% (e.g., from 2000m to 2500m).

  2. Check node resource availability using kubectl describe node.

2. Persistent resource shortage

Trigger condition: p95_cpu_used > cpu_request + (cpu_limit - cpu_request) × 0.2

Anomaly type: Resource shortage

Scoring formula: min(100, 80 + (p95_used - request) / (limit - request) × 20)

Classification:

  • ≥90 → Critical

  • 80–89 → High

Rule explanation

Design principle: In Kubernetes scheduling, the request is the guaranteed baseline, and the difference between limit and request provides elastic headroom. A 20% buffer (inspired by AWS ECS safety margins) helps detect when sustained demand exceeds the guaranteed baseline.

Example: When cpu_request=1000m, cpu_limit=3000m, and p95_cpu_used=1500m:

  • Buffer upper bound = 1000 + (3000 – 1000) × 0.2 = 1400m

  • Excess = 1500 – 1400 = 100m

  • Score = 80 + (100/2000) × 20 = 81 → High

Recommended actions:

  1. Increase the request to above 1400m.

  2. Verify that the pod’s QoS class is Burstable.

3. Burst load risk

Trigger condition: (p95_cpu_used - avg_cpu_used)/avg_cpu_used > 4

Anomaly type: Load fluctuation

Scoring formula: 60 + 20 × (fluctuation_rate - 4)

Fixed severity: Medium

Rule explanation

Design principle: Empirical data shows that peak usage exceeding the average by more than 4× indicates abnormal traffic bursts. This rule uses a fixed Medium severity to flag attention without requiring immediate action.

Example: When avg_cpu_used=200m and p95_cpu_used=1000m:

  • Fluctuation rate = (1000 – 200)/200 = 4 → just triggers

  • Score = 60 + 20 × (4 – 4) = 60 → Medium

Recommended actions:

Configure Horizontal Pod Autoscaler (HPA) for automatic scaling and add a pre-stop hook to smooth traffic.

4. Resource over-provisioning

Trigger condition: avg_cpu_used < cpu_request × 0.4

Anomaly type: Resource waste

Scoring formula: 70 - (avg_used/request) × 10

Fixed severity: Medium

Rule explanation

Design principle: Empirical best practices suggest that average usage below 40% of the request indicates waste. Google Cloud recommends keeping request utilization between 40% and 70%.

Example: When cpu_request=4000m and avg_cpu_used=1500m:

  • Utilization = 1500/4000 = 0.375

  • Score = 70 – 0.375 × 10 = 66.5 → Medium

Recommended actions:

  1. Gradually reduce the request in steps (no more than 20% per adjustment).

  2. Switch to Guaranteed QoS class.

5. Unreasonable limit configuration

Trigger condition: cpu_limit/cpu_request > 4

Anomaly type: Configuration risk

Scoring formula: 80 + (limit/request - 4) × 5

Classification Levels:

  • ≥90 → Critical

  • 80–89 → High

Rule explanation

Design principle: Empirical data shows that a limit-to-request ratio greater than 4:1 (production environments should aim for 2:1) can cause node resource fragmentation and violate security standards such as PCI-DSS.

Example: When cpu_request=500m and cpu_limit=2500m:

  • Ratio = 2500/500 = 5

  • Score = 80 + (5 – 4) × 5 = 85 → High

Recommended actions:

  1. Set the limit to no more than 3× the request.

  2. Review namespace-level ResourceQuota limits.

Memory anomaly detection rules

Data field descriptions

  • max_heap_rate_based_request: Maximum JVM heap usage rate based on request (percentage; e.g., 7 means 7%).

  • mem_request: Memory request (MB; Kubernetes guaranteed amount).

  • mem_limit: Memory limit (MB; Kubernetes usage cap).

  • p95_mem_used_rate: P95 memory usage rate based on limit.

  • p99_mem_used_rate: P99 memory usage rate based on limit.

  • p95_mem_used: P95 memory usage (MB).

  • p99_mem_used: P99 memory usage (MB).

Detection rule details

1. Limit proximity risk (highest priority)

Trigger condition: p99_mem_used_rate > 0.85

Anomaly type: Out-of-memory (OOM) risk

Scoring formula: min(100, 80 + [(p99_mem_used_rate - 0.85) × 200])

Level classifications:

  • ≥90 → Critical (immediate OOM risk)

  • 80–89 → High (close to limit but not critical)

Rule explanation

Design principle: Based on Kubernetes OOMKill behavior (per Kubernetes official documentation), containers are terminated immediately when memory exceeds the limit. The 85% threshold accounts for extreme-case usage reflected by P99.

Example: When mem_limit=4096MB and p99_mem_used_rate=0.88:

  • Excess over threshold = 0.88 – 0.85 = 0.03

  • Score = 80 + 0.03 × 200 = 86 → High

  • If usage reaches 90% (rate = 0.9), score = 80 + 0.05 × 200 = 90 → Critical

Recommended actions:

  1. Immediately increase the limit by at least 15%.

  2. Check for memory leaks: jmap -histo <pid>

2. Dynamic resource shortage

Trigger condition: p95_mem_used > mem_request + (mem_limit - mem_request) × 0.2

Anomaly type: Resource shortage

Scoring formula: min(100, 70 + [(p95_mem_used - request)/(1 + mem_limit - mem_request) × 50])

Tiering:

  • ≥90 → Critical (exceeds buffer by 50%)

  • 80–89 → High (exceeds by 30%–50%)

  • 70–79 → Medium (exceeds by 20%–30%)

Rule explanation

Design principle: Empirical practice sets a 20% buffer in the limit-request range (with +1 to prevent division by zero). Sustained P95 usage beyond this buffer suggests the request is too low (per Kubernetes Resource Configuration Whitepapers).

Example: When mem_request=2048MB, mem_limit=4096MB, and p95_mem_used=2560MB:

  • Buffer upper bound = 2048 + (4096 – 2048) × 0.2 = 2457.6MB

  • Excess = 2560 – 2457.6 = 102.4MB

  • Score = 70 + [102.4/(1 + 2048)] × 50 ≈ 71 → Medium

Recommended actions:

  1. Increase the request to above 2458MB.

  2. Configure VerticalPodAutoscaler.

3. Resource waste (request-based)

Trigger condition: p95_mem_used < mem_request × 0.6

Anomaly type: Resource waste

Scoring formula: min(100, 40 + [(0.6 - p95_mem_used/mem_request) × 150])

Tier classification:

  • ≥70 → High (utilization <15%)

  • 60–69 → Medium (15%–30%)

  • 40–59 → Low (30%–40%)

Rule explanation

Design principle: Request utilization below 60% is considered wasteful. The 150× coefficient gives 1.5 points per 1% underutilization to emphasize waste.

Example: When mem_request=4096MB and p95_mem_used=1229MB:

  • Utilization = 1229/4096 ≈ 0.3

  • Score = 40 + (0.6 – 0.3) × 150 = 85 → High

Recommended actions:

  1. Reduce the request in 30% increments.

  2. Switch to Burstable QoS class.

4. QoS configuration risk

Trigger condition: mem_limit/mem_request >4 && p95_mem_used<mem_limit×0.5 && (p99_mem_used-p95_mem_used)<mem_limit×0.1

Anomaly type: Misconfiguration

Scoring formula: min(100, 60 + [(mem_limit/request -4)×20 - (p95_used/limit×50)])

Severity levels:

  • ≥80 → Critical (ratio >6 and usage <15%)

  • 70–79 → High (ratio 5–6 and usage <25%)

  • 60–69 → Medium (ratio 4–5 and usage <30%)

Rule explanation

Design principle: A high limit-to-request ratio combined with low actual usage causes node resource fragmentation. Dual conditions ensure real configuration waste exists.

Example: When request=1024MB, limit=5120MB, and p95_used=768MB:

  • Ratio = 5, usage = 768/5120 = 15%

  • Score = 60 + (5 – 4) × 20 – 0.15 × 50 = 60 + 20 – 7.5 = 72.5 → High

Recommended actions:

  1. Adjust the limit-to-request ratio to ≤3:1.

  2. Apply ResourceQuota constraints.

5. Resource waste (JVM heap-based)

Trigger condition: max_heap_rate_based_request <40

Anomaly type: Resource waste

Scoring formula: min(100,40 + [(0.4 - max_heap_rate/100)×150])

Severity levels:

  • ≥70 → High (utilization ≤10%)

  • 60–69 → Medium (11%–20%)

  • 40–59 → Low (21%–40%)

Rule explanation

Design principle: Heap memory utilization consistently below 40% of the request indicates clear resource waste.

Example: When mem_request=2048MB and max_heap_rate=15 (i.e., 15%):

Score = 40 + (0.4 – 0.15) × 150 = 77.5 → High

Recommended actions:

  1. Adjust JVM parameters: -Xmx1536m (75% of 2048MB)

  2. Monitor GC logs: -XX:+PrintGCDetails

Young GC anomaly detection rules

Data field descriptions

  • max_total_ygc_time: Maximum total Young GC time within 10 minutes (milliseconds).

  • avg_total_ygc_time: Average total Young GC time within 10 minutes (milliseconds).

  • max_total_ygc_count: Maximum number of Young GC events within 10 minutes.

  • avg_total_ygc_count: Average number of Young GC events within 10 minutes.

  • avg_time_per_ygc: Average duration per Young GC event (milliseconds).

  • jvm_configurations: JVM startup parameters used by the application (including memory settings and garbage collector type).

Detection rule details

1. Excessive total Young GC time

Trigger condition: max_total_ygc_time > 10000ms

Anomaly type: Performance bottleneck

Scoring formula: min(100, 60 + [(max_total_ygc_time/10000 - 1) × 20])

Classification:

  • ≥90 → Critical (severely blocks main thread)

  • 80–89 → High (long GC time requires optimization)

Rule explanation

Design principle: Empirical data shows that total GC time exceeding 10 seconds in 10 minutes (over 1 second per minute) significantly reduces system throughput. The 20× coefficient adds 20 points per extra second per minute to reflect severity quickly.

Example: When max_total_ygc_time=20000ms:

  • Excess = 20000 – 10000 = 10000ms

  • Score = 60 + (2 – 1) × 20 = 80 → High

Recommended actions:

  1. Adjust young generation size: e.g., -Xmn512m (2× default)

  2. Check object allocation rate: jstat -gcutil <pid> 1000

2. Abnormal Young GC frequency

Trigger condition: (max_total_ygc_count > 500) AND (avg_time_per_ygc > 20ms)

Anomaly type: Resource waste

Scoring formula: min(100, 70 + [(max_total_ygc_count/500 - 1) × 10])

Severity levels:

  • ≥80 → High (frequent GC impacts throughput)

  • 70–79 → Medium (monitor memory allocation)

Rule explanation

Design principle: Both conditions must be met:

  • More than 500 GC events in 10 minutes (>0.83 per second).

  • Average GC duration >20ms. The 10× coefficient adds 2 points per extra 100 events to reflect waste accurately.

Example: When max_total_ygc_count=600 and avg_time_per_ygc=25ms:

  • Score = 70 + (600/500 – 1) × 10 = 72 → Medium

    When count reaches 1000:

  • Score = 70 + (2 – 1) × 10 = 80 → High

Recommended actions:

  1. Optimize short-lived objects using object pooling.

  2. Adjust Survivor space ratio: e.g., -XX:SurvivorRatio=6

3. Inefficient single GC event

Trigger condition: avg_time_per_ygc > 60ms

Anomaly type: Performance bottleneck

Scoring formula: min(100, 70 + (avg_time_per_ygc - 60))

Classification:

  • ≥90 → Critical (large objects or memory fragmentation)

  • 80–89 → High (insufficient Survivor space)

  • 70–79 → Medium (Eden space too small)

Rule explanation

Design principle: Healthy JVMs typically complete Young GC in under 60ms.

Example: When avg_time_per_ygc=80ms:

Score = 70 + (80 – 60) = 80 → High

Recommended actions:

  1. Check large object allocation: jmap -histo:live <pid>

  2. Adjust GC thread count: -XX:ParallelGCThreads=4

  3. Enable GC log analysis: -Xlog:gc*:file=gc.log

Full GC anomaly detection rules

Data field descriptions

  • max_total_fullgc_time: Maximum total Full GC time within 10 minutes (milliseconds).

  • avg_total_fullgc_time: Average total Full GC time within 10 minutes (milliseconds).

  • max_total_fullgc_count: Maximum number of Full GC events within 10 minutes.

  • avg_total_fullgc_count: Average number of Full GC events within 10 minutes.

  • avg_time_per_fullgc: Average duration per Full GC event (milliseconds).

  • jvm_configurations: JVM startup parameters used by the application.

  • max_old_size: Maximum old generation size (MB).

  • avg_old_size: Average old generation size (MB).

  • old_gen_max_capacity: Configured old generation capacity (MB).

Detection rule details

1. Abnormal single Full GC duration

Trigger condition: avg_time_per_fullgc > 1000ms

Anomaly type: Performance bottleneck

Scoring formula: min(100, 70 + floor(actual_duration / 100))

Classification:

  • ≥90 → Critical (duration ≥2000ms)

  • 80–89 → High (1000ms < duration < 2000ms)

Rule explanation

Design principle: Empirical data shows that Full GC durations over 1 second cause excessive Stop-The-World (STW) pauses, degrading service availability. Linear scoring (1 point per 100ms) clearly reflects severity.

Example: When avg_time_per_fullgc=1850ms:

  • Score = 70 + floor(1850/100) = 70 + 18 = 88 → High

  • If duration reaches 2100ms, score = 70 + 21 = 91 → Critical

Recommended actions:

  1. Increase old generation size: -XX:OldSize=2048m

  2. Upgrade GC algorithm: -XX:+UseG1GC

  3. Inspect large objects: jmap -histo:live <pid>

2. Excessive Full GC frequency

Trigger condition: max_total_fullgc_count > 3

Anomaly type: Memory pressure

Scoring formula: min(100, 60 + ((max_total_fullgc_count - 3) × 5))

Classification:

  • ≥90 → Critical (≥9 occurrences)

  • 80–89 → High (7–8 occurrences)

  • 65–79 → Low (4–6 occurrences)

Rule explanation

Design principle: Healthy JVMs should trigger Full GC no more than 3 times per 10 minutes. Each additional occurrence adds 5 points (based on Alibaba Cloud ARMS monitoring experience) to quickly identify memory leak risks.

Example: When max_total_fullgc_count=7:

  • Score = 60 + (7 – 3) × 5 = 80 → High

  • If count reaches 10, score = 60 + (10 – 3) × 5 = 95 → Critical

Recommended actions:

  1. Check for memory leaks: jstat -gcutil <pid> 1000

  2. Adjust tenuring threshold: -XX:MaxTenuringThreshold=8

3. Old generation capacity risk

Trigger conditions:

  • max_old_used > old_gen_max_capacity × 0.8

  • avg_old_used > old_gen_max_capacity × 0.6

Anomaly type: Configuration risk

Scoring formulas:

  • min(100, 70 + [(max_old_used/capacity - 0.8) × 200])

  • min(100, 70 + [(avg_old_used/capacity - 0.6) × 150])

Classification levels:

  • ≥90 → Critical (usage ≥90%)

  • 80–89 → High (usage 85%–89%)

  • 70–79 → Medium (usage 80%–84%)

Rule explanation

Design Principle: This principle is based on experience.

  • Peak usage over 80% risks memory fragmentation.

  • Average usage over 60% signals insufficient capacity.

Example

When old_gen_max_capacity=2048MB and max_old_used=1843MB:

  • Usage = 1843/2048 ≈ 0.9

  • Score = 70 + (0.9 – 0.8) × 200 = 90 → Critical

When old_gen_max_capacity=2048MB and avg_old_used=1600MB:

  • Usage = 1600/2048 ≈ 0.8

  • Score = 70 + (0.8 – 0.6) × 150 = 100 → Critical

Recommended actions:

  1. Adjust old generation ratio: -XX:NewRatio=3

  2. Increase heap size: -Xmx4096m -Xms4096m

  3. Analyze heap memory.