This topic describes application anomaly detection rules to help you configure your application settings effectively and maintain system stability.
CPU anomaly detection rules
Data field descriptions
avg_cpu_used: Average CPU usage (in millicores; 1 core = 1000 millicores).p95_cpu_used: P95 percentile of CPU usage (in millicores).p99_cpu_used: P99 percentile of CPU usage (in millicores).cpu_request: CPU request for the Kubernetes pod (resource assurance baseline).cpu_limit: CPU limit for the Kubernetes pod (upper bound of resource usage).
Detection rule details
1. Limit proximity risk
Trigger condition: p99_cpu_used > cpu_limit × 0.8
Anomaly type: Throttling risk
Scoring formula: min(100, 70 + (p99_used/limit - 0.8) × 500)
Severity levels:
≥90 → Critical (direct kernel-level throttling risk)
80–89 → High (adjust limit urgently)
70–79 → Medium (monitor recommended)
Rule explanation
Design principle: Based on the Kubernetes CPU throttling mechanism (CFS quota), when a container’s CPU usage consistently approaches its limit, kernel-level throttling occurs. The P99 value reflects resource consumption under extreme conditions. The 80% threshold aligns with the "saturation" metric from Google SRE’s Golden Signals.
Example: When cpu_limit=2000m (2 cores) and p99_cpu_used=1700m:
Usage ratio: 1700/2000 = 0.85
Score = 70 + (0.85 – 0.8) × 500 = 70 + 25 = 95 → Critical
Recommended actions:
Immediately increase the limit by at least 25% (e.g., from 2000m to 2500m).
Check node resource availability using
kubectl describe node.
2. Persistent resource shortage
Trigger condition: p95_cpu_used > cpu_request + (cpu_limit - cpu_request) × 0.2
Anomaly type: Resource shortage
Scoring formula: min(100, 80 + (p95_used - request) / (limit - request) × 20)
Classification:
≥90 → Critical
80–89 → High
Rule explanation
Design principle: In Kubernetes scheduling, the request is the guaranteed baseline, and the difference between limit and request provides elastic headroom. A 20% buffer (inspired by AWS ECS safety margins) helps detect when sustained demand exceeds the guaranteed baseline.
Example: When cpu_request=1000m, cpu_limit=3000m, and p95_cpu_used=1500m:
Buffer upper bound = 1000 + (3000 – 1000) × 0.2 = 1400m
Excess = 1500 – 1400 = 100m
Score = 80 + (100/2000) × 20 = 81 → High
Recommended actions:
Increase the request to above 1400m.
Verify that the pod’s QoS class is Burstable.
3. Burst load risk
Trigger condition: (p95_cpu_used - avg_cpu_used)/avg_cpu_used > 4
Anomaly type: Load fluctuation
Scoring formula: 60 + 20 × (fluctuation_rate - 4)
Fixed severity: Medium
Rule explanation
Design principle: Empirical data shows that peak usage exceeding the average by more than 4× indicates abnormal traffic bursts. This rule uses a fixed Medium severity to flag attention without requiring immediate action.
Example: When avg_cpu_used=200m and p95_cpu_used=1000m:
Fluctuation rate = (1000 – 200)/200 = 4 → just triggers
Score = 60 + 20 × (4 – 4) = 60 → Medium
Recommended actions:
Configure Horizontal Pod Autoscaler (HPA) for automatic scaling and add a pre-stop hook to smooth traffic.
4. Resource over-provisioning
Trigger condition: avg_cpu_used < cpu_request × 0.4
Anomaly type: Resource waste
Scoring formula: 70 - (avg_used/request) × 10
Fixed severity: Medium
Rule explanation
Design principle: Empirical best practices suggest that average usage below 40% of the request indicates waste. Google Cloud recommends keeping request utilization between 40% and 70%.
Example: When cpu_request=4000m and avg_cpu_used=1500m:
Utilization = 1500/4000 = 0.375
Score = 70 – 0.375 × 10 = 66.5 → Medium
Recommended actions:
Gradually reduce the request in steps (no more than 20% per adjustment).
Switch to Guaranteed QoS class.
5. Unreasonable limit configuration
Trigger condition: cpu_limit/cpu_request > 4
Anomaly type: Configuration risk
Scoring formula: 80 + (limit/request - 4) × 5
Classification Levels:
≥90 → Critical
80–89 → High
Rule explanation
Design principle: Empirical data shows that a limit-to-request ratio greater than 4:1 (production environments should aim for 2:1) can cause node resource fragmentation and violate security standards such as PCI-DSS.
Example: When cpu_request=500m and cpu_limit=2500m:
Ratio = 2500/500 = 5
Score = 80 + (5 – 4) × 5 = 85 → High
Recommended actions:
Set the limit to no more than 3× the request.
Review namespace-level ResourceQuota limits.
Memory anomaly detection rules
Data field descriptions
max_heap_rate_based_request: Maximum JVM heap usage rate based on request (percentage; e.g., 7 means 7%).mem_request: Memory request (MB; Kubernetes guaranteed amount).mem_limit: Memory limit (MB; Kubernetes usage cap).p95_mem_used_rate: P95 memory usage rate based on limit.p99_mem_used_rate: P99 memory usage rate based on limit.p95_mem_used: P95 memory usage (MB).p99_mem_used: P99 memory usage (MB).
Detection rule details
1. Limit proximity risk (highest priority)
Trigger condition: p99_mem_used_rate > 0.85
Anomaly type: Out-of-memory (OOM) risk
Scoring formula: min(100, 80 + [(p99_mem_used_rate - 0.85) × 200])
Level classifications:
≥90 → Critical (immediate OOM risk)
80–89 → High (close to limit but not critical)
Rule explanation
Design principle: Based on Kubernetes OOMKill behavior (per Kubernetes official documentation), containers are terminated immediately when memory exceeds the limit. The 85% threshold accounts for extreme-case usage reflected by P99.
Example: When mem_limit=4096MB and p99_mem_used_rate=0.88:
Excess over threshold = 0.88 – 0.85 = 0.03
Score = 80 + 0.03 × 200 = 86 → High
If usage reaches 90% (rate = 0.9), score = 80 + 0.05 × 200 = 90 → Critical
Recommended actions:
Immediately increase the limit by at least 15%.
Check for memory leaks:
jmap -histo <pid>
2. Dynamic resource shortage
Trigger condition: p95_mem_used > mem_request + (mem_limit - mem_request) × 0.2
Anomaly type: Resource shortage
Scoring formula: min(100, 70 + [(p95_mem_used - request)/(1 + mem_limit - mem_request) × 50])
Tiering:
≥90 → Critical (exceeds buffer by 50%)
80–89 → High (exceeds by 30%–50%)
70–79 → Medium (exceeds by 20%–30%)
Rule explanation
Design principle: Empirical practice sets a 20% buffer in the limit-request range (with +1 to prevent division by zero). Sustained P95 usage beyond this buffer suggests the request is too low (per Kubernetes Resource Configuration Whitepapers).
Example: When mem_request=2048MB, mem_limit=4096MB, and p95_mem_used=2560MB:
Buffer upper bound = 2048 + (4096 – 2048) × 0.2 = 2457.6MB
Excess = 2560 – 2457.6 = 102.4MB
Score = 70 + [102.4/(1 + 2048)] × 50 ≈ 71 → Medium
Recommended actions:
Increase the request to above 2458MB.
Configure VerticalPodAutoscaler.
3. Resource waste (request-based)
Trigger condition: p95_mem_used < mem_request × 0.6
Anomaly type: Resource waste
Scoring formula: min(100, 40 + [(0.6 - p95_mem_used/mem_request) × 150])
Tier classification:
≥70 → High (utilization <15%)
60–69 → Medium (15%–30%)
40–59 → Low (30%–40%)
Rule explanation
Design principle: Request utilization below 60% is considered wasteful. The 150× coefficient gives 1.5 points per 1% underutilization to emphasize waste.
Example: When mem_request=4096MB and p95_mem_used=1229MB:
Utilization = 1229/4096 ≈ 0.3
Score = 40 + (0.6 – 0.3) × 150 = 85 → High
Recommended actions:
Reduce the request in 30% increments.
Switch to Burstable QoS class.
4. QoS configuration risk
Trigger condition: mem_limit/mem_request >4 && p95_mem_used<mem_limit×0.5 && (p99_mem_used-p95_mem_used)<mem_limit×0.1
Anomaly type: Misconfiguration
Scoring formula: min(100, 60 + [(mem_limit/request -4)×20 - (p95_used/limit×50)])
Severity levels:
≥80 → Critical (ratio >6 and usage <15%)
70–79 → High (ratio 5–6 and usage <25%)
60–69 → Medium (ratio 4–5 and usage <30%)
Rule explanation
Design principle: A high limit-to-request ratio combined with low actual usage causes node resource fragmentation. Dual conditions ensure real configuration waste exists.
Example: When request=1024MB, limit=5120MB, and p95_used=768MB:
Ratio = 5, usage = 768/5120 = 15%
Score = 60 + (5 – 4) × 20 – 0.15 × 50 = 60 + 20 – 7.5 = 72.5 → High
Recommended actions:
Adjust the limit-to-request ratio to ≤3:1.
Apply ResourceQuota constraints.
5. Resource waste (JVM heap-based)
Trigger condition: max_heap_rate_based_request <40
Anomaly type: Resource waste
Scoring formula: min(100,40 + [(0.4 - max_heap_rate/100)×150])
Severity levels:
≥70 → High (utilization ≤10%)
60–69 → Medium (11%–20%)
40–59 → Low (21%–40%)
Rule explanation
Design principle: Heap memory utilization consistently below 40% of the request indicates clear resource waste.
Example: When mem_request=2048MB and max_heap_rate=15 (i.e., 15%):
Score = 40 + (0.4 – 0.15) × 150 = 77.5 → High
Recommended actions:
Adjust JVM parameters:
-Xmx1536m(75% of 2048MB)Monitor GC logs:
-XX:+PrintGCDetails
Young GC anomaly detection rules
Data field descriptions
max_total_ygc_time: Maximum total Young GC time within 10 minutes (milliseconds).avg_total_ygc_time: Average total Young GC time within 10 minutes (milliseconds).max_total_ygc_count: Maximum number of Young GC events within 10 minutes.avg_total_ygc_count: Average number of Young GC events within 10 minutes.avg_time_per_ygc: Average duration per Young GC event (milliseconds).jvm_configurations: JVM startup parameters used by the application (including memory settings and garbage collector type).
Detection rule details
1. Excessive total Young GC time
Trigger condition: max_total_ygc_time > 10000ms
Anomaly type: Performance bottleneck
Scoring formula: min(100, 60 + [(max_total_ygc_time/10000 - 1) × 20])
Classification:
≥90 → Critical (severely blocks main thread)
80–89 → High (long GC time requires optimization)
Rule explanation
Design principle: Empirical data shows that total GC time exceeding 10 seconds in 10 minutes (over 1 second per minute) significantly reduces system throughput. The 20× coefficient adds 20 points per extra second per minute to reflect severity quickly.
Example: When max_total_ygc_time=20000ms:
Excess = 20000 – 10000 = 10000ms
Score = 60 + (2 – 1) × 20 = 80 → High
Recommended actions:
Adjust young generation size: e.g.,
-Xmn512m(2× default)Check object allocation rate:
jstat -gcutil <pid> 1000
2. Abnormal Young GC frequency
Trigger condition: (max_total_ygc_count > 500) AND (avg_time_per_ygc > 20ms)
Anomaly type: Resource waste
Scoring formula: min(100, 70 + [(max_total_ygc_count/500 - 1) × 10])
Severity levels:
≥80 → High (frequent GC impacts throughput)
70–79 → Medium (monitor memory allocation)
Rule explanation
Design principle: Both conditions must be met:
More than 500 GC events in 10 minutes (>0.83 per second).
Average GC duration >20ms. The 10× coefficient adds 2 points per extra 100 events to reflect waste accurately.
Example: When max_total_ygc_count=600 and avg_time_per_ygc=25ms:
Score = 70 + (600/500 – 1) × 10 = 72 → Medium
When count reaches
1000:Score = 70 + (2 – 1) × 10 = 80 → High
Recommended actions:
Optimize short-lived objects using object pooling.
Adjust Survivor space ratio: e.g.,
-XX:SurvivorRatio=6
3. Inefficient single GC event
Trigger condition: avg_time_per_ygc > 60ms
Anomaly type: Performance bottleneck
Scoring formula: min(100, 70 + (avg_time_per_ygc - 60))
Classification:
≥90 → Critical (large objects or memory fragmentation)
80–89 → High (insufficient Survivor space)
70–79 → Medium (Eden space too small)
Rule explanation
Design principle: Healthy JVMs typically complete Young GC in under 60ms.
Example: When avg_time_per_ygc=80ms:
Score = 70 + (80 – 60) = 80 → High
Recommended actions:
Check large object allocation:
jmap -histo:live <pid>Adjust GC thread count:
-XX:ParallelGCThreads=4Enable GC log analysis:
-Xlog:gc*:file=gc.log
Full GC anomaly detection rules
Data field descriptions
max_total_fullgc_time: Maximum total Full GC time within 10 minutes (milliseconds).avg_total_fullgc_time: Average total Full GC time within 10 minutes (milliseconds).max_total_fullgc_count: Maximum number of Full GC events within 10 minutes.avg_total_fullgc_count: Average number of Full GC events within 10 minutes.avg_time_per_fullgc: Average duration per Full GC event (milliseconds).jvm_configurations: JVM startup parameters used by the application.max_old_size: Maximum old generation size (MB).avg_old_size: Average old generation size (MB).old_gen_max_capacity: Configured old generation capacity (MB).
Detection rule details
1. Abnormal single Full GC duration
Trigger condition: avg_time_per_fullgc > 1000ms
Anomaly type: Performance bottleneck
Scoring formula: min(100, 70 + floor(actual_duration / 100))
Classification:
≥90 → Critical (duration ≥2000ms)
80–89 → High (1000ms < duration < 2000ms)
Rule explanation
Design principle: Empirical data shows that Full GC durations over 1 second cause excessive Stop-The-World (STW) pauses, degrading service availability. Linear scoring (1 point per 100ms) clearly reflects severity.
Example: When avg_time_per_fullgc=1850ms:
Score = 70 + floor(1850/100) = 70 + 18 = 88 → High
If duration reaches 2100ms, score = 70 + 21 = 91 → Critical
Recommended actions:
Increase old generation size:
-XX:OldSize=2048mUpgrade GC algorithm:
-XX:+UseG1GCInspect large objects:
jmap -histo:live <pid>
2. Excessive Full GC frequency
Trigger condition: max_total_fullgc_count > 3
Anomaly type: Memory pressure
Scoring formula: min(100, 60 + ((max_total_fullgc_count - 3) × 5))
Classification:
≥90 → Critical (≥9 occurrences)
80–89 → High (7–8 occurrences)
65–79 → Low (4–6 occurrences)
Rule explanation
Design principle: Healthy JVMs should trigger Full GC no more than 3 times per 10 minutes. Each additional occurrence adds 5 points (based on Alibaba Cloud ARMS monitoring experience) to quickly identify memory leak risks.
Example: When max_total_fullgc_count=7:
Score = 60 + (7 – 3) × 5 = 80 → High
If count reaches 10, score = 60 + (10 – 3) × 5 = 95 → Critical
Recommended actions:
Check for memory leaks:
jstat -gcutil <pid> 1000Adjust tenuring threshold:
-XX:MaxTenuringThreshold=8
3. Old generation capacity risk
Trigger conditions:
max_old_used > old_gen_max_capacity × 0.8
avg_old_used > old_gen_max_capacity × 0.6
Anomaly type: Configuration risk
Scoring formulas:
min(100, 70 + [(max_old_used/capacity - 0.8) × 200])
min(100, 70 + [(avg_old_used/capacity - 0.6) × 150])
Classification levels:
≥90 → Critical (usage ≥90%)
80–89 → High (usage 85%–89%)
70–79 → Medium (usage 80%–84%)
Rule explanation
Design Principle: This principle is based on experience.
Peak usage over 80% risks memory fragmentation.
Average usage over 60% signals insufficient capacity.
Example
When old_gen_max_capacity=2048MB and max_old_used=1843MB:
Usage = 1843/2048 ≈ 0.9
Score = 70 + (0.9 – 0.8) × 200 = 90 → Critical
When old_gen_max_capacity=2048MB and avg_old_used=1600MB:
Usage = 1600/2048 ≈ 0.8
Score = 70 + (0.8 – 0.6) × 150 = 100 → Critical
Recommended actions:
Adjust old generation ratio:
-XX:NewRatio=3Increase heap size:
-Xmx4096m -Xms4096mAnalyze heap memory.