Disaster recovery-Well-Architected Framework(WAF)-阿里云帮助中心

Disaster recovery keeps business systems running during a disaster while minimizing data loss. On Alibaba Cloud, this means maintaining data integrity and restoring normal operations as quickly as possible when a production center fails.

Key concepts

Before selecting a disaster recovery strategy, align on the following terms:

Term	Definition
Recovery Time Objective (RTO)	The maximum acceptable time to restore a system after a failure. A shorter RTO requires higher investment in standby infrastructure.
Recovery Point Objective (RPO)	The maximum acceptable amount of data loss measured in time. A shorter RPO requires more frequent data replication.
Production center	The primary deployment that handles live traffic under normal conditions.
Standby center	The secondary deployment that takes over when the production center fails.
Active-active	Both centers handle live traffic simultaneously. Failover is immediate because no center is idle.
Active-passive	Only the production center handles traffic. The standby center activates only during a failover event.

Prioritize by workload criticality

Before setting RTO and RPO targets, categorize your workload by business criticality. Each tier justifies a different level of investment and recovery speed.

Criticality tier	Examples	Recovery approach
Mission-critical	Core transaction systems, payment processing	Aggressive RTO/RPO targets; active-active recommended
Business-critical	Internal platforms, analytics pipelines	Moderate targets; warm standby acceptable
Business operational	Development environments, batch jobs	Relaxed targets; cold standby or backup-restore sufficient

Higher-criticality workloads demand faster recovery and more frequent data replication. Lower-criticality workloads can tolerate longer restoration windows. Define RTO and RPO targets after you know which tier your workload belongs to.

Disaster recovery strategies

Alibaba Cloud supports a spectrum of disaster recovery strategies. Cost and complexity increase as RTO/RPO targets shrink.

Strategy	RTO/RPO	Cost	Description
Backup and restore	Hours	Low	Data is backed up to OSS or another region. Restoration requires deploying infrastructure and restoring data from backup. No standby resources run continuously.
Warm standby	Minutes	Medium	A reduced-capacity environment runs continuously in the standby region. During failover, it scales up to handle full production traffic.
Same-city active-active	Near-zero	High	Two data centers in the same city both handle live traffic. If one fails, the other absorbs all traffic without a failover step. Low latency between centers makes synchronous replication practical.
Cross-region active-active	Near-zero	Highest	Two data centers in different regions both handle live traffic. Protects against region-wide outages. Higher latency between regions means replication is typically asynchronous, which introduces a small RPO window.

Same-city active-active vs. cross-region active-active

Both strategies eliminate idle standby capacity, but they protect against different failure scenarios:

Same-city active-active protects against a single data center failure. The two centers are close enough for synchronous replication, so RPO is effectively zero. Choose this when you need near-zero RTO/RPO but do not need to survive a city-wide or region-wide outage.
Cross-region active-active protects against a full regional outage, such as a natural disaster or large-scale infrastructure failure. Because the regions are geographically distant, network latency is higher and synchronous replication is impractical for most workloads, so a small RPO gap is expected. Choose this when business continuity must survive events that take down an entire Alibaba Cloud region.

Choose a strategy

Use the following decision path:

Determine your criticality tier — Is the workload mission-critical, business-critical, or business operational?
Define RTO and RPO targets — How much downtime and data loss can the business tolerate?
Assess your failure scope — Do you need to survive a single data center failure, or a full regional outage?
Match strategy to targets — Select the strategy whose RTO/RPO characteristics meet your targets at an acceptable cost.

If your workload has multiple components with different criticality levels, document the recovery approach for each component separately. During an actual disaster, ambiguity about which components to recover first increases recovery time.