ECS cross-zone disaster recovery failover-Cloud Backup(Cloud Backup)-阿里云帮助中心

Before you begin

Before you implement cross-zone disaster recovery, you must create a Virtual Private Cloud (VPC) in a different availability zone. In the destination availability zone, create a replication vSwitch and a recovery vSwitch. For more information, see Set up a VPC.

Step 1: Create a DR site pair

After you complete the prerequisites, follow these steps to protect your source ECS instances with cross-zone disaster recovery:

Log on to the Cloud Backup console.
In the left-side navigation pane, choose Disaster Recovery > ECS Disaster Recovery. In the upper-left corner of the page, click Switch to CDR.
Click Add, set the Type to Cross-zone Disaster Recovery, and then complete the Production Site Information and DR Site Information sections.
Click Create.

Step 2: Add protected servers

After you create the DR site pair, follow these steps to add the servers you want to protect:

Click the Protected Server tab and confirm the DR site pair information in the upper-right corner.
Click Add to the right of Protected Server, and then select the ECS instances you want to protect.
Click Confirmation. After you add the server, the status shows Agent Installing and then changes to Initialized.
Note
If the server status does not change to Initialized, choose More > Server Operation > Restart Server to complete the agent initialization.

Step 3: Start replication

To enable real-time replication of ECS instances to Alibaba Cloud:

Click the Protected Server tab. In the Actions column for the target server, choose More > Failover > Start Replication.

In the Enable Replication panel, configure the following parameters, then click Start.

Parameter	Description
Recovery Point Policy	Select the interval (in hours) at which Cloud Backup creates a recovery point each day.
Hard Disk Type	Valid values: Ultra Disk, ESSD, and SSD.
Replication Network	Select the vSwitch used to replicate disaster recovery data to the cloud. Cloud Backup lists available vSwitches from the secondary site VPC. The replication and recovery networks can share the same vSwitch. Using the same vSwitch speeds up recovery. If the replication and recovery networks reside in different zones, RTO increases. For best performance, configure the same zone as the Recovery Network.
Recovery Network	Select the vSwitch used for recovery operations. During a DR drill or failover, Cloud Backup creates recovered ECS instances in this network. Cloud Backup lists available vSwitches from the secondary site VPC. The replication and recovery networks can share the same vSwitch. Using the same vSwitch speeds up recovery. If the replication and recovery networks reside in different zones, RTO increases. For best performance, configure the same zone as the Replication Network.
Automatic restart after replication interruption	Specifies whether to automatically restart replication after an interruption. Select this option to resume the replication task if it stops.

Disaster recovery replication proceeds through three stages: Enabling Replication, Replicating Full Data, and Replicating.

Enabling Replication: The ECS disaster recovery service scans system data and estimates the total data volume. This stage usually takes a few minutes.
Replicating Full Data: The ECS disaster recovery service transfers all valid data from the entire server to Alibaba Cloud. The duration depends on data volume and network bandwidth. The console progress bar shows replication progress.
Replicating: After full replication completes, Alibaba Cloud holds a complete copy of your data. Then, Alibaba Cloud Replication Service (AReS) monitors all disk write operations on the server and continuously replicates them to Alibaba Cloud in real time.

(Optional) Perform a DR drill

After an ECS instance enters the Replicating state, it is ready for a DR drill.

A DR drill launches a protected ECS instance on the cloud and verifies application functionality. DR drills provide the following benefits:

Verify application viability: Confirms that applications run correctly on a restored instance.
Improve team readiness: Familiarizes the team with the failover process, enabling a smooth and rapid response during a real disaster.

To perform a DR drill:

On the Protected Server tab, in the Actions column for the target server, click Test Failover.
In the Test Failover panel, select the Recovery Network, IP Address, whether to Use ECS Specification, Hard Disk Type, Recovery Point, Elastic Public Network IP, and Post Script. Then click Start.
Note
- Cloud Backup automatically retains 24 recovery points from the last 24 hours for each server.
- If not using an ECS instance type, also specify CPU and memory.
Cloud Backup launches the server in the background from the selected point in time. Real-time data replication continues unaffected during the drill.

After a few minutes, the drill completes. Click the link under Test Failover Information to verify data and applications.
Clean up the drill environment.

After verification, in the Actions column for the server, click Cleanup Test Environment. This action deletes the recovered ECS instance.

Note
Delete the recovered ECS instance as soon as verification is complete to reduce costs.

Step 4: Perform a failover

Regular DR drills verify that applications run successfully on restored ECS instances, confirming that workloads fail over reliably to the DR site during a critical error.

Warning

Initiate a failover only for servers that have experienced a critical error. This action stops all real-time data replication. To re-establish protection, restart replication manually, which triggers a new full data replication before continuous protection resumes.

To perform a failover:

On the Protected Server tab, in the Actions column for the server, choose More > Failover > Failover.
In the Failover panel, select the Recovery Network, IP Address, whether to Use ECS Specification, Hard Disk Type, Recovery Point, Elastic Public Network IP, and Post Script. Then click Start.

Important
Recovery to the latest available state is a one-time operation.
After the failover completes, click the link under Recovered Instance ID/Name to verify data and applications.
- If the application runs correctly at the selected point in time, choose More > Failover > Commit Failover.
  
  Note
  After completing failover or switching recovery points, and confirming that the recovered application has taken over business traffic, commit the failover to clean up disaster recovery resources and reduce costs.
- If the application state is unsatisfactory (for example, due to database consistency issues or corrupted source data already synchronized to the other region), before confirming failover, choose More > Failover > Change Recovery Point.
Note
Changing the recovery point follows the same process as failover. Select an earlier recovery point.

Step 5: Perform a failback

After replicating an application on a protected server from one availability zone (for example, Zone A) to another (for example, Zone B), you can perform a failback. A failback involves replicating data from Zone B back to Zone A.

Follow these steps to perform a failback:

On the Protected Server tab, in the Actions column for the server, choose More > Failback > Reversed Register, then confirm reverse registration of the protected server.
In the Actions column, choose More > Failback > Initiate Reverse Replication.
In the Initiate Reverse Replication panel, select whether to enable Original Machine Recovery, then select the Replication Network and Recovery Network. Then click Start.

Warning
Failing back to the original ECS instance is an irreversible action that permanently overwrites all data on the instance. This option is available for both cross-region and cross-zone DR. Proceed with extreme caution.
When the server enters reverse real-time replication, in the Actions column, choose More > Failback > Failback.
In the Failback panel, enter CPU and Memory information, select the Recovery Network and IP, and edit the Execute script after recovery.
After failback completes, in the Actions column, choose More > Failover > Registration to re-register the protected server.