Disaster recovery

更新时间:
复制 MD 格式

Disaster recovery keeps business systems running during a disaster while minimizing data loss. On Alibaba Cloud, this means maintaining data integrity and restoring normal operations as quickly as possible when a production center fails.

Key concepts

Before selecting a disaster recovery strategy, align on the following terms:

Term

Definition

Recovery Time Objective (RTO)

The maximum acceptable time to restore a system after a failure. A shorter RTO requires higher investment in standby infrastructure.

Recovery Point Objective (RPO)

The maximum acceptable amount of data loss measured in time. A shorter RPO requires more frequent data replication.

Production center

The primary deployment that handles live traffic under normal conditions.

Standby center

The secondary deployment that takes over when the production center fails.

Active-active

Both centers handle live traffic simultaneously. Failover is immediate because no center is idle.

Active-passive

Only the production center handles traffic. The standby center activates only during a failover event.

Prioritize by workload criticality

Before setting RTO and RPO targets, categorize your workload by business criticality. Each tier justifies a different level of investment and recovery speed.

Criticality tier

Examples

Recovery approach

Mission-critical

Core transaction systems, payment processing

Aggressive RTO/RPO targets; active-active recommended

Business-critical

Internal platforms, analytics pipelines

Moderate targets; warm standby acceptable

Business operational

Development environments, batch jobs

Relaxed targets; cold standby or backup-restore sufficient

Higher-criticality workloads demand faster recovery and more frequent data replication. Lower-criticality workloads can tolerate longer restoration windows. Define RTO and RPO targets after you know which tier your workload belongs to.

Disaster recovery strategies

Alibaba Cloud supports a spectrum of disaster recovery strategies. Cost and complexity increase as RTO/RPO targets shrink.

Strategy

RTO/RPO

Cost

Description

Backup and restore

Hours

Low

Data is backed up to OSS or another region. Restoration requires deploying infrastructure and restoring data from backup. No standby resources run continuously.

Warm standby

Minutes

Medium

A reduced-capacity environment runs continuously in the standby region. During failover, it scales up to handle full production traffic.

Same-city active-active

Near-zero

High

Two data centers in the same city both handle live traffic. If one fails, the other absorbs all traffic without a failover step. Low latency between centers makes synchronous replication practical.

Cross-region active-active

Near-zero

Highest

Two data centers in different regions both handle live traffic. Protects against region-wide outages. Higher latency between regions means replication is typically asynchronous, which introduces a small RPO window.

Same-city active-active vs. cross-region active-active

Both strategies eliminate idle standby capacity, but they protect against different failure scenarios:

  • Same-city active-active protects against a single data center failure. The two centers are close enough for synchronous replication, so RPO is effectively zero. Choose this when you need near-zero RTO/RPO but do not need to survive a city-wide or region-wide outage.

  • Cross-region active-active protects against a full regional outage, such as a natural disaster or large-scale infrastructure failure. Because the regions are geographically distant, network latency is higher and synchronous replication is impractical for most workloads, so a small RPO gap is expected. Choose this when business continuity must survive events that take down an entire Alibaba Cloud region.

Choose a strategy

Use the following decision path:

  1. Determine your criticality tier — Is the workload mission-critical, business-critical, or business operational?

  2. Define RTO and RPO targets — How much downtime and data loss can the business tolerate?

  3. Assess your failure scope — Do you need to survive a single data center failure, or a full regional outage?

  4. Match strategy to targets — Select the strategy whose RTO/RPO characteristics meet your targets at an acceptable cost.

If your workload has multiple components with different criticality levels, document the recovery approach for each component separately. During an actual disaster, ambiguity about which components to recover first increases recovery time.