Failure Damage Control and Recovery-Well-Architected Framework(WAF)-阿里云帮助中心

Learn how to quickly identify failure root causes and execute recovery plans to minimize service impact.

Root Cause Analysis

Aggregate all stability-related data across the enterprise, such as change events and abnormal events in databases and MQ, and integrate business-specific troubleshooting tools. During fault and risk alert response, use this data to perform root cause analysis and shorten the time to identify the initial cause.

Rapid Recovery Plan Recommendation

Common fault recovery methods include restart, rollback, scaling, traffic switching, rate limiting, and degradation. Recovery efficiency depends largely on well-prepared plans and regular drills. We recommend that you surface common rapid recovery capabilities with one-click execution on both PC and mobile platforms in the fault emergency collaboration group. This reduces the time R&D engineers spend locating recovery entry points and removes the dependency on having computer access during emergencies. Rapid recovery capabilities fall into two categories:

Manual mitigation plans: Comb through available degradation plans for fault and risk scenarios. When a matching scenario is triggered, the system recommends the associated plan and supports one-click execution within the fault group and automatic execution after specified conditions are met.
Universal vertical-specific rapid recovery capabilities: Integrate common recovery capabilities, such as DB-side slow SQL rate limiting, ultra-fast change rollback, and multi-active redundancy traffic switching. The system automatically analyzes fault root causes based on monitoring, logs, and other data, and then recommends the corresponding recovery method.