Solution overview
Introduction:
-
OpenSearch Vector Search Edition: A large-scale, distributed vector search solution on the public cloud. It supports multiple vector search algorithms and delivers excellent performance with high accuracy. It enables you to build and query indexes for large-scale datasets cost-effectively. The service supports horizontal scaling, index merging, streaming index builds, query-as-you-ingest capabilities, and real-time dynamic data updates.
-
Data Integration: A secure, cost-effective, stable, efficient, and elastic data synchronization platform provided by Alibaba Cloud. As a core component of DataWorks, it provides high-speed, stable data movement and synchronization between numerous heterogeneous data sources in complex network environments.
To transfer existing data to OpenSearch Vector Search Edition, use Data Integration. The process involves three main steps:
-
Create an instance and perform initial configuration:
-
Create an OpenSearch Vector Search Edition instance and configure a table.
-
Purchase a Data Integration resource group and bind it to a workspace.
-
-
Configure a data synchronization task: Add an OpenSearch data source, verify network connectivity, and then configure a data synchronization task in the Data Integration console to transfer data from your source to OpenSearch.
-
Perform a query test: Return to the OpenSearch Vector Search Edition console to test queries on the transferred data.
Usage notes:
-
Limitations: Due to the network architecture of OpenSearch, a single resource group can only synchronize data to one OpenSearch instance at a time. Therefore, make sure your OpenSearch Vector Search Edition instance and Data Integration are in the same region.
-
Supported data sources: Data Integration supports over 40 types of data sources, including relational databases, unstructured storage, big data storage, and message queues. By defining a source and a destination data source, you can use the data reader and writer plug-ins provided by Data Integration to transfer data between any structured or semi-structured data sources.
Instance configuration
1. Create and configure a vector search instance
1.1. Create a vector search instance
Create an OpenSearch Vector Search Edition instance. For instructions, see Purchase an OpenSearch Vector Search Edition instance.
1.2. Configure the vector search instance
A newly purchased instance is in the Pending Configuration state. You need to configure a table for the instance. In the Actions column, click Configure.
-
Enter the basic information for the table and click Next.
Parameters:
-
Table Name: A custom name for your table.
-
Number of Data Shards: Specify a positive integer no greater than 256. Sharding can improve the speed of full index builds and the performance of individual queries. For some existing instances, ensure that either all index tables have the same number of data shards, or one table has a single data shard while the remaining tables share an identical number of shards.
-
Number of Resources for Data Updates: The number of compute resources for data updates. By default, each index is provided with two 4-core, 8 GB update resources for free. You are charged for resources that exceed the free quota. For more information, see Billing of OpenSearch Vector Search Edition.
-
Scenario Template: Select General Template.
-
For data synchronization, configure the data source. You must select API as the data transfer method. Then, click Next.
-
Configure the fields. Specify the source fields to index or use for queries. We recommend that you use the same field names as in the source data table to simplify field mapping in DataWorks.
Note: When you configure fields, you must define at least a primary key field and a vector field. The vector field must be set to a multi-value float type.
-
Configure the index schema. Set the parameters based on your vector dimension and vector index algorithm, and then click Next.
Parameters:
-
Index Name: You can specify a custom name.
-
The Primary Key Field is automatically populated.
-
Select the vector field that you created.
-
Configure the vector dimension, real-time indexing, distance type, and vector index algorithm based on your business requirements.
-
After you complete the table configuration, click Confirm. Wait for the table status to change to In Use. Then, go to the DataWorks console to configure the synchronization task.
2. Purchase and bind a Data Integration resource group
2.1. Purchase a Data Integration resource group
Go to the Data Integration purchase page to create an instance. Make sure the region you select is the same as the region of your OpenSearch Vector Search Edition instance.
2.2. Bind the resource group to a workspace
Log on to the DataWorks console. Bind the purchased DataWorks resource group to a workspace. You can bind it to the default workspace or create a new one.
To create a new workspace, click Create Workspace on the workspace list page. Enter a Workspace Name, bind the purchased resource group to Data Integration Resource Group in the advanced settings, and then click Create Workspace.
Add a data source
Before configuring a data synchronization task, you must add the source and destination data sources. This allows you to select them by name during task configuration.
-
Go back to the DataWorks workspace list page, find the workspace you created, and choose Shortcuts → Data Integration in the Actions column to enter the DataWorks Management Center. In the navigation pane on the left, choose Data Source → Add Data Source to create a new data source.
-
In the search box on the Add Data Source page, find and select the OpenSearch data source. On the Add OpenSearch Data Source page, first configure the Basic Information. Then, in the Connection Configuration section, click Test Network Connectivity. If the connectivity status is Connected, the data source is created.
-
If the status is Connection failed, the resource group cannot connect to the data source, and subsequent data source tasks cannot run. In this case, use the Network Connectivity Diagnostic Tool panel that appears on the right to resolve the issue. The tool provides reasons for the failure. Follow the suggestions to make adjustments, such as checking your account credentials, password, connection endpoint, or the status of the created instance.
Configure a synchronization task
-
In the navigation pane on the left, select Synchronization Task. Set the Source and Destination for the task, and then click Create Synchronization Task.
-
DataWorks supports a wide variety of data sources as inputs and outputs for Data Integration. You can create data sources in the Data Integration module for your synchronization tasks. For this tutorial, we use Elasticsearch as the source (assuming the data source is already created) and OpenSearch as the destination.
On the Create Synchronization Task page, select Single-table offline synchronization for Synchronization Type, and then go to the DataStudio page to configure the data transfer.
-
Create a node. You are redirected to the DataStudio page, where the Create Node dialog box appears. Select Offline synchronization for Node Type, set the Path as needed, and then click Confirm.
-
In the Configure Network Connections and Resource Group step, configure the Data Source, My Resource Group, and Data Destination. Then, click Next.
-
Configure the task. Set the options in the Configure Source and Destination, Field Mapping, and Channel Control sections. After the configuration is complete, click Run in the upper-left corner to start the offline data synchronization.
-
In Field Mapping, the source fields and destination fields must be mapped one-to-one. To change the mapping order, you must manually edit the field mappings. The synchronization fails if a destination field does not have a default value or does not support automatic default value population.
Data synchronization rules:
|
Source data |
Supported types |
|
0.123 |
FLOAT/DOUBLE |
|
123 |
INT8/INT32 or other INT types |
|
"0.1,0.2,0.3" |
MULTI_FLOAT/MULTI_DOUBLE (multi-value type, typically used for a vector field) |
|
[0.1,0.2,0.3] |
MULTI_FLOAT/MULTI_DOUBLE (multi-value type, typically used for a vector field) |
|
["abc","defg"] |
STRING/MULTI_STRING (Both single-value and multi-value types are supported. Choose based on your business scenario.) |
|
Non-string array elements, such as [{"a":b},{"c":d}] |
STRING/TEXT/RAW (Multi-level object structures support only single-value push.) |
-
For multi-value data types, the names are prefixed with "MULTI_" in the DataWorks UI, such as MULTI_FLOAT.
-
View the run results. In the results pane at the bottom, you can check the run status.
Query test
Return to the OpenSearch Vector Search Edition console.
In the instance list, find the target instance and click Query Test in the Actions column.
On the Query Test page, select a query type (vector query, primary key query, or vector-text hybrid query), select the target table, enter your query conditions, and click Search.