Disaster recovery
Disaster recovery keeps business systems running during a disaster while minimizing data loss. On Alibaba Cloud, this means maintaining data integrity and restoring normal operations as quickly as possible when a production center fails.
Key concepts
Before selecting a disaster recovery strategy, align on the following terms:
|
Term |
Definition |
|
Recovery Time Objective (RTO) |
The maximum acceptable time to restore a system after a failure. A shorter RTO requires higher investment in standby infrastructure. |
|
Recovery Point Objective (RPO) |
The maximum acceptable amount of data loss measured in time. A shorter RPO requires more frequent data replication. |
|
Production center |
The primary deployment that handles live traffic under normal conditions. |
|
Standby center |
The secondary deployment that takes over when the production center fails. |
|
Active-active |
Both centers handle live traffic simultaneously. Failover is immediate because no center is idle. |
|
Active-passive |
Only the production center handles traffic. The standby center activates only during a failover event. |
Prioritize by workload criticality
Before setting RTO and RPO targets, categorize your workload by business criticality. Each tier justifies a different level of investment and recovery speed.
|
Criticality tier |
Examples |
Recovery approach |
|
Mission-critical |
Core transaction systems, payment processing |
Aggressive RTO/RPO targets; active-active recommended |
|
Business-critical |
Internal platforms, analytics pipelines |
Moderate targets; warm standby acceptable |
|
Business operational |
Development environments, batch jobs |
Relaxed targets; cold standby or backup-restore sufficient |
Higher-criticality workloads demand faster recovery and more frequent data replication. Lower-criticality workloads can tolerate longer restoration windows. Define RTO and RPO targets after you know which tier your workload belongs to.
Disaster recovery strategies
Alibaba Cloud supports a spectrum of disaster recovery strategies. Cost and complexity increase as RTO/RPO targets shrink.
|
Strategy |
RTO/RPO |
Cost |
Description |
|
Backup and restore |
Hours |
Low |
Data is backed up to OSS or another region. Restoration requires deploying infrastructure and restoring data from backup. No standby resources run continuously. |
|
Warm standby |
Minutes |
Medium |
A reduced-capacity environment runs continuously in the standby region. During failover, it scales up to handle full production traffic. |
|
Same-city active-active |
Near-zero |
High |
Two data centers in the same city both handle live traffic. If one fails, the other absorbs all traffic without a failover step. Low latency between centers makes synchronous replication practical. |
|
Cross-region active-active |
Near-zero |
Highest |
Two data centers in different regions both handle live traffic. Protects against region-wide outages. Higher latency between regions means replication is typically asynchronous, which introduces a small RPO window. |
Same-city active-active vs. cross-region active-active
Both strategies eliminate idle standby capacity, but they protect against different failure scenarios:
Same-city active-active protects against a single data center failure. The two centers are close enough for synchronous replication, so RPO is effectively zero. Choose this when you need near-zero RTO/RPO but do not need to survive a city-wide or region-wide outage.
Cross-region active-active protects against a full regional outage, such as a natural disaster or large-scale infrastructure failure. Because the regions are geographically distant, network latency is higher and synchronous replication is impractical for most workloads, so a small RPO gap is expected. Choose this when business continuity must survive events that take down an entire Alibaba Cloud region.
Choose a strategy
Use the following decision path:
Determine your criticality tier — Is the workload mission-critical, business-critical, or business operational?
Define RTO and RPO targets — How much downtime and data loss can the business tolerate?
Assess your failure scope — Do you need to survive a single data center failure, or a full regional outage?
Match strategy to targets — Select the strategy whose RTO/RPO characteristics meet your targets at an acceptable cost.
If your workload has multiple components with different criticality levels, document the recovery approach for each component separately. During an actual disaster, ambiguity about which components to recover first increases recovery time.