Migrate data using the Reindex API

更新时间:
复制 MD 格式

The Reindex API copies documents from a source index to a destination index. You can copy all documents or only those that match a specific query. This process can occur within the same cluster or across different clusters. This topic shows you how to use the Reindex API to migrate data from one cluster to another.

Limitations

  • Both clusters must be in the same region and availability zone.

  • Management and deployment mode: You can migrate data from a v2 cluster to a v3 cluster, between two v2 clusters, or between two v3 clusters.

    Clusters have two management and deployment modes: Cloud-native New Management (v3) and Basic Management (v2). You can view the management and deployment mode of your cluster on its information page in the console.

  • Cluster version: Data migration is supported between clusters of the same major version, for example, from a lower to a higher minor version. Migrating data across major versions, such as from 7.7.1 to 8.15.1, is not recommended.

Prerequisites

In this example, we will migrate data from ES_2 to ES_1 using the Reindex API. Before you begin, complete the following preparations.

Prepare test data

  • In ES_2, create an index and insert test data:

    PUT /product_info
    {
      "settings": {
        "number_of_shards": 5,
        "number_of_replicas": 1
      },
      "mappings": {
          "properties": {
            "productName": {
              "type": "text",
              "analyzer": "ik_smart"
            },
            "annual_rate":{
              "type":"keyword"
            },
            "describe": {
              "type": "text",
              "analyzer": "ik_smart"
            }
        }
      }
    }

    This command creates an index named product_info that contains the productName, annual_rate, and describe fields. If successful, the request returns the following result.

    {
      "acknowledged" : true,
      "shards_acknowledged" : true,
      "index" : "product_info"
    }

    Insert six test documents:

    POST /product_info/_bulk
    {"index":{}}
    {"productName":"Financial Product A","annual_rate":"3.2200%","describe":"A 180-day fixed-term product with a minimum investment of 20,000. Stable returns with optional message notifications."}
    {"index":{}}
    {"productName":"Financial Product B","annual_rate":"3.1100%","describe":"A 90-day regular investment product with a minimum investment of 10,000. Daily profit notifications are sent."}
    {"index":{}}
    {"productName":"Financial Product C","annual_rate":"3.3500%","describe":"A 270-day regular investment product with a minimum investment of 40,000. Daily profit notifications are sent."}
    {"index":{}}
    {"productName":"Financial Product D","annual_rate":"3.1200%","describe":"A 90-day regular investment product with a minimum investment of 12,000. Daily profit notifications are sent."}
    {"index":{}}
    {"productName":"Financial Product E","annual_rate":"3.0100%","describe":"A recommended 30-day regular investment product with a minimum investment of 8,000. Daily profit notifications are sent."}
    {"index":{}}
    {"productName":"Financial Product F","annual_rate":"2.7500%","describe":"A popular 3-day short-term product with no fees and a minimum investment of 500. Profit notifications are sent via SMS."}
  • In ES_1, create an index to store the migrated data from ES_2:

    PUT dest
    {
      "settings": {
        "number_of_shards": 5,
        "number_of_replicas": 1
      }
    }

Private connection via NLB and PrivateLink

To enhance cluster security, clusters in the same VPC or different VPCs are network-isolated. You must use NLB and PrivateLink to establish a private connection (VPC connection) between the clusters.

As the following figure shows, the two ES clusters are deployed in the same VPC. An endpoint service is created in the user's VPC. Then, a private connection is configured in ES_1 to obtain an endpoint. Finally, the endpoint is associated with the endpoint service to establish a private connection between the two clusters.

An endpoint service is a service that other VPCs can connect to privately by creating an endpoint. You must manually create the related service resources.
An endpoint is associated with an endpoint service and provides a private network connection to access external services. When you configure a private connection for an Alibaba Cloud ES instance, an endpoint is automatically created in the network environment where the ES cluster resides.
image

For detailed configuration steps, see Establish a private connection between Alibaba Cloud ES clusters using NLB and PrivateLink. You must complete Step 1, Step 2, and Step 3.

On the Security configuration page of the ES_1 instance, in the Cluster network settings section, click Modify to the right of Configure instance private connection. In the Configure instance private connection panel, you can view the endpoint ID, endpoint service ID, and connection status. When the endpoint connection status is Connected, the ES_1 and ES_2 clusters can communicate through their private IP addresses.

Configure Reindex API whitelist

To ensure secure data migration between clusters, you must add the private connection address and port of the ES_2 cluster to the Reindex API whitelist of ES_1.

  1. Go to the Security page for ES_1 and click Edit next to Configure Private Connection. In the Configure Private Connection side panel, click the target Endpoint ID.

    To add a new connection, click + Add Private Connection at the bottom of the Configure instance private connection side panel.

  2. In the VPC console, on the Endpoint Connections tab, click the 展开符 icon next to the endpoint ID to view its corresponding domain name.

    Important

    You must remove the availability zone identifier from the domain name before adding it to the Reindex API whitelist.

    For example, if the full domain name is "ep-bp1****************-cn-hangzhou-i.epsrv-bp1****************.cn-hangzhou.privatelink.aliyuncs.com", remove the availability zone identifier "-cn-hangzhou-i" to get the final domain name: "ep-bp1bp1****************.epsrv-bp1****************.cn-hangzhou.privatelink.aliyuncs.com".

  3. In the YML file for ES_1, configure the Reindex API whitelist. The whitelist entry must be the endpoint's domain name and port.

    reindex:
      remote:
        whitelist: >-
          ep-bp1bp1****************.epsrv-bp1****************.cn-hangzhou.privatelink.aliyuncs.com:9200

    On the ES cluster configuration page, click Modify configuration to the right of YML configuration. In the Other configure YAML editor in the panel, add the preceding whitelist configuration.

Call the Reindex API

  1. Log on to the Kibana console for ES_1.

  2. In Dev Tools > Console, call the Reindex API to migrate the data.

    POST _reindex
    {
      "source": {
        "remote": {
          "host": "http://ep-bp1bp1****************.epsrv-bp1****************.cn-hangzhou.privatelink.aliyuncs.com:9200",
          "username": "elastic",
          "password": "xxx-xxxx123!"
        },
        "index": "product_info",
        "query": {
          "match": {
            "productName": "Financial Product"
          }
        }
      },
      "dest": {
        "index": "dest"
      }
    }

    Category

    Parameter

    Description

    source

    remote

    The remote cluster. In this example, ES_2.

    host

    The access address of the ES_2 cluster. It includes:

    • The protocol. You can find this on the Basic Information page of the cluster.

      Important

      For security, use the HTTPS protocol to prevent the password from being transmitted in plain text when connecting to the cluster. To enable the HTTPS protocol, see HTTPS protocol.

    • Domain name: The private connection address of the ES_2 cluster. This must be the same domain name configured in the Reindex whitelist.

    • Port: Fixed at 9200.

    username

    The default username for the cluster is elastic.

    password

    The password for the specified user.

    The password was set when you created the cluster. If you have forgotten it, you can reset the password.

    index

    The source index in the remote cluster.

    query

    A query that specifies which documents to migrate.

    In this example, documents where the productName field contains "Financial Product" are migrated from the ES_2 cluster's index to the ES_1 cluster.

    dest

    index

    The destination index in the target cluster for the migrated data.

    If successful, the request returns the following result:

    {
      "took": 211,
      "timed_out": false,
      "total": 6,
      "updated": 6,
      "created": 0,
      "deleted": 0,
      "batches": 1,
      "version_conflicts": 0,
      "noops": 0,
      "retries": {
        "bulk": 0,
        "search": 0
      },
      "throttled_millis": 0,
      "requests_per_second": -1,
      "throttled_until_millis": 0,
      "failures": []
    }
  3. Call the _search API to view the migration result.

    GET dest/_search

    Expected result:

    {
      "took": 6,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": {
          "value": 6,
          "relation": "eq"
        },
        "max_score": 1,
        "hits": [
          {
            "_index": "dest",
            "_id": "n9kyqpcBCRuDZhswJCpH",
            "_score": 1,
            "_source": {
              "productName": "Financial Product D",
              "annual_rate": "3.1200%",
              "describe": "A 90-day regular investment product with a minimum investment of 12,000. Daily profit notifications are sent."
            }
          },
          {
            "_index": "dest",
            "_id": "nNkyqpcBCRuDZhswJCpG",
            "_score": 1,
            "_source": {
              "productName": "Financial Product A",
              "annual_rate": "3.2200%",
              "describe": "A 180-day fixed-term product with a minimum investment of 20,000. Stable returns with optional message notifications."
            }
          },
          {
            "_index": "dest",
            "_id": "ndkyqpcBCRuDZhswJCpG",
            "_score": 1,
            "_source": {
              "productName": "Financial Product B",
              "annual_rate": "3.1100%",
              "describe": "A 90-day regular investment product with a minimum investment of 10,000. Daily profit notifications are sent."
            }
          },
          {
            "_index": "dest",
            "_id": "ntkyqpcBCRuDZhswJCpH",
            "_score": 1,
            "_source": {
              "productName": "Financial Product C",
              "annual_rate": "3.3500%",
              "describe": "A 270-day regular investment product with a minimum investment of 40,000. Daily profit notifications are sent."
            }
          },
          {
            "_index": "dest",
            "_id": "oNkyqpcBCRuDZhswJCpH",
            "_score": 1,
            "_source": {
              "productName": "Financial Product E",
              "annual_rate": "3.0100%",
              "describe": "A recommended 30-day regular investment product with a minimum investment of 8,000. Daily profit notifications are sent."
            }
          },
          {
            "_index": "dest",
            "_id": "odkyqpcBCRuDZhswJCpH",
            "_score": 1,
            "_source": {
              "productName": "Financial Product F",
              "annual_rate": "2.7500%",
              "describe": "A popular 3-day short-term product with no fees and a minimum investment of 500. Profit notifications are sent via SMS."
            }
          }
        ]
      }
    }

FAQ

Q: How can I adjust the batch size and timeout for the Reindex API based on document size?

  • Adjust the batch size

    The default batch size for a reindex operation is 1,000 documents. If your index contains large documents, you can reduce this value to prevent timeouts.

    In the following example, size is set to 10 to process 10 documents per batch.

    POST _reindex
    {
      "source": {
        "remote": {
          "host": "http://otherhost:9200"
        },
        "index": "source",
        "size": 10,
        "query": {
          "match": {
            "test": "data"
          }
        }
      },
      "dest": {
        "index": "dest"
      }
    }
  • Adjust the timeout

    The socket_timeout parameter, which defaults to 30 seconds, sets the socket read timeout. The connect_timeout parameter, which defaults to 1 second, sets the cluster connection timeout.

    In the following example, the socket read timeout is set to 1 minute, and the connection timeout is set to 10 seconds.

    POST _reindex
    {
      "source": {
        "remote": {
          "host": "http://otherhost:9200",
          "socket_timeout": "1m",
          "connect_timeout": "10s"
        },
        "index": "source",
        "query": {
          "match": {
            "test": "data"
          }
        }
      },
      "dest": {
        "index": "dest"
      }
    }