Stability design solutions
Apply stability solutions through architecture design, change management, emergency response, and fault drills.
The following stability solutions apply the design principles described above.

Architecture design principles
Software architecture has evolved from monolithic to distributed to microservices to cloud-native. Core system attributes—storage, compute, and networking—remain constant, but their implementations shift toward large-scale, high-performance, highly reliable, and easily scalable designs. This evolution raises the bar for architecture stability.
Stability risks range from thread-level failures to regional disasters, including software/hardware faults and traffic spikes. Address them through three pillars: disaster recovery, failure tolerance, and capacity.
Disaster Recovery
Disaster recovery (DR) keeps operations running and minimizes data loss during outages. Common patterns include geo-redundancy and same-region active-active deployments. Alibaba Cloud regions and availability zones enable cost-effective DR architectures.
A complete DR strategy ensures data integrity and business continuity when the production center fails, enabling the standby center to take over with minimal downtime. Cloud Backup provides backup, DR protection, and policy-based archiving for ECS, ECS databases, file systems, NAS, etc.
Failure tolerance
Failure tolerance covers mechanisms in distributed systems that automatically detect, eliminate, or correct errors to maintain normal operation and improve system reliability.
Capacity
Capacity is the maximum workload or data volume a system can handle in a given period. It depends on hardware, software, architecture, and network bandwidth. For cloud applications, monitor Alibaba Cloud service quotas per account to prevent quota-related outages. Use Quota Center to query and adjust quotas on demand.
Change management
A change is any addition, modification, or deletion that directly or indirectly affects services. Failed changes can cause business interruption and customer impact. Reduce change risks by following three principles: grayscale capability, observability, and rollback capability.
Grayscale release
A complete grayscale release mechanism reduces business impact and improves user experience when changes fail.
The grayscale release mechanism covers release methods, batch size, interval time, and observation criteria. Key considerations:
-
Release interval: Set a reasonable interval. Overly long intervals may cause data inconsistency in downstream applications.
-
Release method: Segment by user, region, or channel to maintain consistent user experience during rollout.
-
Release batches: Start with a small-scale verification and gradually expand scope.
-
Observation indicators: Define observable metrics during grayscale to identify issues early and prevent chain reactions.
Rollback capability
Rollback is the most effective recovery method for failed changes in emergencies.
When problems occur, ensuring business continuity is the top priority. While other solutions may exist, rollback remains the most predictable option in most cases.
Execute frequent, small, reversible changes. Large version differences may block rollback due to dependency conflicts.
Observability
Changes may affect the existing environment and upstream/downstream services. Make business, links, and resources observable to detect issues early. During observation, monitor business metrics (such as order success rate) and application metrics (such as CPU, load, and exception count). Prioritize business metrics—they reflect system status most directly. When business metrics change, application metrics often follow.
Prepare a monitoring checklist before each change. Continuously observe metrics during the change. After completion, compare pre-change and post-change metrics—declare success only when all metrics are normal.
Alibaba Cloud best practices
Alibaba Cloud's Cloud Excellence (BizDevOps) services support grayscale release and rollback, ensuring controllable release changes. For monitoring during changes, use CloudMonitor, Application Real-Time Monitoring Service (ARMS), Log Service (SLS), etc., to observe affected resources, traces, and business status.
Emergency response
The emergency response mechanism defines standard operating procedures after an incident. Alibaba Cloud's 1-5-10 mechanism—detect faults within 1 minute, organize relevant personnel for preliminary investigation within 5 minutes, and carry out the recovery process within 10 minutes—serves as a reference for enterprises designing their own response processes. All parties must be clear about their roles and responsibilities when an incident occurs.
Fault detection
The earlier a fault is detected, the faster you can respond. Achieve instant detection through:
-
Unified alerting. Notify relevant personnel (system administrators, operations staff) through SMS, email, DingTalk, etc., immediately after a fault is identified.
-
Monitoring dashboard. Display system health status graphically in real-time. When a fault occurs, the dashboard shows errors and provides data for troubleshooting.
-
Risk prediction. Use data analysis and machine learning to predict risks before faults occur. Predictions help identify root causes quickly during emergency response.
Fault response
After detecting a fault, locate the problem quickly using these practices:
-
Coordination. Quickly organize responders, set up a command center, define the emergency process, and assign tasks so everyone knows their roles.
-
Alarm correlation analysis. Correlate alerts to determine fault scope, impact, and root cause. Techniques include event correlation analysis and machine learning.
-
Knowledge graph. Organize system data into a unified graph to quickly locate and resolve issues. Technologies include natural language processing and graph databases.
Fault recovery
After identifying the root cause, recover business operations quickly and then conduct a post-incident review.
-
Plan execution. Follow the predefined emergency plan, including response processes, role responsibilities, and handling procedures, to standardize fault recovery.
-
Fault self-healing. The system detects faults and recovers automatically. For example, container technology can automatically migrate workloads to resolve faults.
-
Fault review. Analyze and summarize incidents to prevent recurrence. Record causes, impacts, and handling processes, then formulate improvement measures.
Fault drill practices
Fault drills validate software resilience by actively injecting faults. Benefits include discovering system risks in advance, improving test quality, perfecting risk plans, strengthening monitoring and alerting, and accelerating recovery. Fault drills build confidence in production operations and fall into three categories: disaster recovery drills for plan verification, red-blue attack and defense drills for stability acceptance, and emergency response drills for fault verification.
Disaster Recovery Drills
Disaster recovery drills simulate instance, data center, or regional-level faults to verify system-level DR capability. They help validate Recovery Point Objective (RPO) and Recovery Time Objective (RTO) targets and surface related issues early. Alibaba Cloud Application High Availability Service (AHAS) supports fault injection for applications to enhance stability.
Red Team Exercises
Red team exercises originate from military training where a defending side (red team) faces an attacking side (blue team). Applied to stability, the red team discovers vulnerabilities from a third-party perspective and injects faults into software and hardware, continuously verifying business system reliability. The red team responds according to predefined fault response and emergency processes. After each exercise, review three stages—discovery, response, and recovery—and summarize improvement actions.
Emergency Response Drills
In emergency response drills, the red team's targets and methods are opaque to the blue team, creating full confrontation. This tests the team's ability to detect unexpected faults and respond effectively. The blue team must understand both the system's weaknesses and its business logic. The red team must detect and fix faults rapidly. Compared with planned drills, emergency drills involve more complex personnel, scenarios, and processes. Ensure drill plan confidentiality and fully evaluate impact controls in case the response team fails to contain faults promptly.