Cross-region disaster recovery

更新时间:
复制 MD 格式

Cross-region disaster recovery (DR) replicates ECS instances to a secondary region. If a regional outage occurs, workloads fail over to the DR site with an RPO as low as 1 minute and an RTO of approximately 15 minutes.

Preparations

Before you implement cross-region DR, select a destination region different from your production region. Create a VPC with a replication vSwitch and a recovery vSwitch in that region. Set up a virtual private cloud.

Step 1: Create a DR site pair

To create a cross-region DR site pair for your ECS instances:

  1. Log on to the Cloud Backup console.

  2. Select Disaster Recovery > ECS Disaster Recovery, then click Switch to CDR in the upper-left corner of the page.

  3. Click Add, select Cross-region Disaster Recovery as the type, and enter the Production Site Information and DR Site Information.

  4. Click Create.

Step 2: Add the servers to be protected

After the DR site pair is created, add the ECS instances to protect:

  1. Click the Protected Server tab and confirm the disaster recovery site pair information in the upper-right corner.

  2. Click Add, and select the ECS instances to protect.

  3. Click OK. The server status changes from Agent Installing to Initialized.

    Note

    If the server status does not show Initialized, click More > Server Operation > Restart Server to complete client initialization.

Step 3: Start replication

To start replicating ECS instances to the DR site:

  1. Click the Protected Server tab. In the Actions column for the target server, choose More > Failover > Start Replication.

  2. In the Enable Replication panel, configure the following parameters, then click Start.

    Parameter

    Description

    Recovery Point Policy

    Interval in hours at which Cloud Backup creates a recovery point each day.

    Hard Disk Type

    Valid values: Ultra Disk, ESSD, and SSD.

    Replication Network

    vSwitch for replicating DR data to the cloud.

    For optimal RTO, use the same zone as the Recovery Network. The replication and recovery networks can share the same vSwitch.

    Recovery Network

    vSwitch for recovery operations. During DR drills or failovers, Cloud Backup creates recovered instances in this network.

    For optimal RTO, use the same zone as the Replication Network. The replication and recovery networks can share the same vSwitch.

    Automatic restart after replication interruption

    Automatically resumes replication after an interruption.

    Replication proceeds through three stages:Enabling Replication, Replicating Full Data, and Replicating.

    1. Enabling Replication: Scans system data and estimates total data volume. Usually takes a few minutes.

    2. Replicating Full Data: Transfers all data from the server to the DR site. Duration depends on data volume and network bandwidth.

    3. Replicating: After full replication, Alibaba Cloud Replication Service (AReS) monitors all disk writes and continuously replicates changes to the DR site in real time.

(Optional) Perform a DR drill

After an ECS instance enters the Replicating state, it is ready for a DR drill.

A DR drill launches a protected instance in the DR site to verify application functionality:

  • Verify application viability: Confirm that applications run correctly on a restored instance.

  • Improve team readiness: Familiarize your team with the failover process for faster response during real disasters.

To perform a DR drill:

  1. On the Protected Server tab, in the Actions column for the target server, click Test Failover.

  2. In the Test Failover panel, select the Recovery Network, IP Address, whether to Use ECS Specification, Hard Disk Type, Recovery Point, Elastic Public Network IP, and Post Script. Then click Start.

    Note
    • Cloud Backup automatically retains 24 recovery points from the last 24 hours for each server.

    • If not using an ECS instance type, also specify CPU and memory.

    Cloud Backup launches the server from the selected recovery point. Real-time replication continues unaffected during the drill.

    After the drill completes, click the link under Test Failover Information to verify data and applications.

  3. Clean up the drill environment.

    After verification, in the Actions column, click Cleanup Test Environment to delete the recovered ECS instance.

    Note

    Delete the recovered instance promptly after verification to avoid unnecessary costs.

Step 4: Perform a failover

Regular DR drills confirm that workloads fail over reliably to the DR site.

Warning

Initiate a failover only for servers that have experienced a critical error. This action stops all real-time data replication. To re-establish protection, restart replication manually, which triggers a new full data replication before continuous protection resumes.

To perform a failover:

  1. On the Protected Server tab, in the Actions column for the server, choose More > Failover > Failover.

  2. In the Failover panel, select the Recovery Network, IP Address, whether to Use ECS Specification, Hard Disk Type, Recovery Point, Elastic Public Network IP, and Post Script. Then click Start.

    Important

    Recovery to the latest available state is a one-time operation.

  3. After the failover completes, click the link under Recovered Instance ID/Name to verify data and applications.

    • If the application runs correctly at the selected point in time, choose More > Failover > Commit Failover.

      Note

      After the recovered application takes over business traffic, commit the failover to clean up DR resources and reduce costs.

    • If the application state is unsatisfactory (for example, due to database consistency or data corruption issues), before committing failover, choose More > Failover > Change Recovery Point.

    Note

    Changing the recovery point follows the same process as failover. Select an earlier recovery point.

Step 5: Perform a failback

Failback returns operations to the original production site after a disaster is resolved. Reverse replication synchronizes data changes from the DR site before the final cutover.

To perform a failback:

  1. On the Protected Server tab, in the Actions column for the server, choose More > Failback > Reversed Register, then confirm reverse registration of the protected server.

  2. In the Actions column, choose More > Failback > Initiate Reverse Replication.

  3. In the Initiate Reverse Replication panel, select whether to enable Original Machine Recovery, then select the Replication Network and Recovery Network. Then click Start.

    Warning

    Failing back to the original ECS instance is an irreversible action that permanently overwrites all data on the instance. This option is available for both cross-region and cross-zone DR. Proceed with extreme caution.

  4. When the server enters reverse real-time replication, in the Actions column, choose More > Failback > Failback.

  5. In the Failback panel, enter CPU and Memory information, select the Recovery Network and IP, and edit the Execute script after recovery.

  6. After failback completes, in the Actions column, choose More > Failover > Registration to re-register the protected server.