Cross-AZ deployment and operations

更新时间:
复制 MD 格式

A cross-AZ deployment distributes the nodes of an Elasticsearch instance across multiple physically isolated availability zones within the same region. If one zone fails, the cluster stays operational using nodes and data replicas in the remaining zones, providing data center-level disaster recovery.

How it works

A cross-AZ deployment uses the built-in shard allocation awareness mechanism in Elasticsearch.

When you create a cross-AZ instance, the system automatically adds a zone_id attribute to nodes in each zone and configures cluster.routing.allocation.awareness.attributes: zone_id so Elasticsearch considers zone placement during shard allocation.

This ensures primary and replica shards are distributed across different zones. If a zone fails, data remains accessible from replicas in the other zones.

Deployment modes

Choose the deployment mode based on your availability requirements and budget.

Deployment mode

Architecture

Disaster recovery

Use cases

Single availability zone

All nodes in a single zone.

A zone failure causes a complete service outage.

Development, testing, or other non-critical services.

Two availability zones

Nodes distributed across two zones.

Survives a single zone failure.

Production environments with high availability requirements.

Three availability zones

Nodes distributed across three zones.

Survives a single zone failure.

Core production services with stringent high availability requirements.

Create a cross-AZ instance

  1. Go to the Create Alibaba Cloud Elasticsearch Instance page.

  2. In the Deployment Mode section, select two or three availability zones.

    • Node count: The number of data nodes, cold data nodes, or coordinating nodes must be a multiple of the number of selected zones.

    • Dedicated master nodes: Three dedicated master nodes are required for multi-AZ stability.

    The zone you select in the console serves as the cluster's primary access point. The system distributes nodes evenly across the selected zones based on real-time resource availability.

Upgrade to multi-AZ (V3 clusters only)

  1. Before you upgrade, verify the following:

    • Run GET _cluster/health to ensure the cluster status is GREEN. If the cluster status is not healthy, see Cluster change error: cluster status is unhealthy for solutions.

    • Optimize client connection distribution to prevent long-lived connections from concentrating in a single zone. Methods include setting connection timeouts, restarting clients in batches, or using separate coordinating nodes. Analysis and solutions for uneven cluster load.

    • Run GET _cluster/settings and confirm that the output includes "cluster.routing.allocation.enable": "all", which allows Elasticsearch to automatically allocate shards. If the output is different, run the following command to force automatic shard allocation:

      PUT _cluster/settings  
      {  
        "transient": {  
          "cluster.routing.allocation.enable": "all"  
        }  
      }  
  2. On the Instance List page, click Upgrade Configuration.

    Alternatively, go to the Basic Information page and click Configuration Update > Upgrade.

  3. On the upgrade page, in the Deployment Mode section, select two or three availability zones and complete the payment.

    • During the upgrade, the system automatically enables dedicated master nodes (if not already enabled) and may add data nodes for even distribution across zones. Additional nodes incur extra costs shown in your bill.

    • For example, upgrading a single-AZ instance with two data nodes to three AZs adds one data node, bringing the total to three (one per zone).

Migrate availability zones

If the current availability zone has insufficient resources for an upgrade, migrate nodes to a zone with available resources first.

Important

Migrating an availability zone triggers a cluster restart. The cluster remains available but may experience temporary instability. Perform this operation during off-peak hours.

  1. Before you migrate, verify the following:

    • Run GET _cluster/health to ensure the cluster status is GREEN. If the cluster status is not healthy, see Cluster change error: cluster status is unhealthy for solutions.

    • Run GET /_cat/indices?v to check for closed indexes. If any exist, open them with POST /<index_name>/_open. Closed indexes prevent GREEN status and may cause the upgrade to fail.

    • Run GET _cluster/settings and confirm that the output includes "cluster.routing.allocation.enable": "all", which allows Elasticsearch to automatically allocate shards. If the output is different, run the following command to force automatic shard allocation:

      PUT _cluster/settings  
      {  
        "transient": {  
          "cluster.routing.allocation.enable": "all"  
        }  
      }  
  2. Perform the migration:

    1. Go to the Basic Information page of the target instance. In the node visualization area, hover over the availability zone you want to migrate and click Migrate.

    2. In the dialog box, select the target availability zone and vSwitch. You can migrate only one zone at a time.

    3. Select the Data Migration Service Agreement checkbox and click OK.

      • After you confirm, the cluster restarts with possible brief performance fluctuations. The system provisions new master nodes in the target zone first, so old and new zones coexist temporarily.

      • After migration completes, the cluster returns to normal. The console may temporarily display the old zone due to a display lag, but this does not affect operations. Node IP addresses will change.

Availability zone failover and restore

When a zone fails, perform a failover to redirect traffic to the remaining healthy zones. After the failed zone recovers, restore it to bring it back into the cluster.

Failover

  1. In the node visualization area of the instance, hover over the availability zone you want to isolate and click Switch Over.

  2. In the dialog box that appears, click OK.

    Important

    A failover isolates all nodes in the specified zone. Service requests are then handled by nodes in the remaining zones. The system attempts to provision a corresponding number of resources in those zones, but success is not guaranteed due to resource inventory and scheduling constraints. Implement traffic-limiting measures based on the cluster's load.

    If the cluster status becomes yellow after failover despite having replicas configured, run the following command in Kibana to override the shard allocation policy and force reallocation to the remaining zones:

    PUT /_cluster/settings
    {
        "persistent" : {
            "cluster.routing.allocation.awareness.force.zone_id.values" : {"0": null, "1": null, "2": null}
        }
    }

    After the shards are reallocated, the cluster health status will return to GREEN.

Restore

  1. After you confirm that the failed availability zone is back to normal, in the node visualization area of the instance, hover over the offline availability zone and click Switch Back.

  2. In the dialog box that appears, click OK. The cluster restarts. After restoration, temporarily added failover nodes are removed and the cluster returns to its original architecture.