Primary/secondary switchover

更新时间:
复制 MD 格式

ApsaraDB for MongoDB uses a high-availability architecture with automatic primary/secondary switchover to keep your instance running when failures occur. This topic explains what triggers a switchover, how it affects your application, and what you can do to minimize disruption.

Why a switchover occurs

A primary/secondary switchover is triggered by one of four causes:

  • Manual switchover: You or an authorized Alibaba Cloud technical expert manually initiates the switchover.

  • Hidden risks: ApsaraDB for MongoDB detects potential risks that could affect your instance. It schedules Operations and Maintenance (O&M) tasks to resolve the issue and performs the switchover within your configured maintenance window. Query processed O&M tasks in historical events and manage pending tasks in scheduled events.

  • Host offline: An exception on the host running one of your instance nodes causes ApsaraDB for MongoDB to treat the host as offline and replace the risky node by triggering a switchover.

  • Instance exceptions: When ApsaraDB for MongoDB detects that the instance is faulty and cannot function normally, it immediately triggers a switchover to recover the instance and minimize downtime.

For host offline and instance exception events, you receive a notification by internal message or email:

[Alibaba Cloud] Dear \*\*\*\*: Your ApsaraDB for MongoDB instance dds-bp\*\*\*\* (name: \*\*\*\*) has an exception. The high-availability system has triggered switchover to ensure the stable running of your instance. We recommend that you check whether your application is still connected to your instance and configure your application to automatically reconnect to your instance.

What happens during a switchover

During a switchover, the primary and secondary nodes swap roles. This causes a transient connection interruption that lasts about 30 seconds.

Write operations routed directly to the old primary node fail during this period because that node becomes a secondary. Read operations may also be interrupted depending on your connection configuration.

Trigger a manual switchover

Manual switchover is useful for:

  • Running disaster recovery drills

  • Verifying your application's exception-handling behavior

  • In multi-zone deployments, connecting your application to the nearest node

Manual switchover support varies by instance type:

Instance typeSupported
Replica set instanceYes — Configure primary/secondary switchover for a replica set instance
Sharded cluster instanceYes — Configure primary/secondary failover for a sharded cluster instance
Standalone instanceNot supported

Keep your application resilient during switchovers

To prevent write failures and connection drops from affecting your application, follow these practices:

Use an SRV connection string URI or connection string URI in production. These connection types automatically detect the new primary after a switchover, so read and write operations continue without manual intervention. See Connect to a replica set instance or Connect to a sharded cluster instance for connection string details.

Configure automatic reconnection. Implement retry logic in your application so it reconnects to the instance after a transient disconnection during switchover.

Handle write exceptions. Catch and handle write errors that may surface during the role-change period.

Troubleshoot write errors after a switchover

Issue

After a switchover, write operations to a replica set instance fail with one of these errors:

"errmsg": "not master", "code": 10107, "codeName": "NotMaster"
"errmsg": "not master", "code": 10107, "codeName": "NotWritablePrimary"
Time out after 30000ms while waiting for a server that matches writableServerSelector.

Cause

During a switchover, the primary node becomes a secondary. If your application connects directly to the old primary's address, write operations fail because that node no longer accepts writes.

Solutions

If your application fails to handle the switchover, check the following:

  • Connection string format: Confirm you are using an SRV connection string URI or connection string URI. These strings automatically route writes to the current primary, so a switchover does not interrupt your reads or writes.

  • Retry logic: Confirm your application retries connection attempts after a disconnection and handles write exceptions to protect business continuity.

If you need to restore the original node assignment immediately, switch node roles manually.