Synchronize vector data from OpenLake-DLF to OpenSearch

更新时间:
复制 MD 格式

OpenSearch Vector Search Edition pulls multimodal data—text, images, and video—directly from a Data Lake Formation (DLF) data source, vectorizes it using built-in models or the AI Search Open Platform, and indexes the results for similarity search. This turns unstructured lakehouse data into a searchable vector index without a separate ingestion pipeline.

Supported table formats: Paimon, Lance, and Object Table. Supported use cases: image search, text semantic search, and video search.

Prerequisites

Before you begin, ensure that you have:

  • A DLF data catalog, database, and data table configured—these identifiers are required during data synchronization

  • Reviewed the Data Lake Formation overview

Limitations

Review these constraints before configuring:

Constraint Detail
Vector: Video Search template Does not support DLF as a full data source
Vector: Text Semantic Search template Supports only the Paimon table format
Common Template and Vector: Image Search Support Paimon, Lance, and Object Table formats
Paimon primary key table Supports add, delete, update, and query operations
Paimon append-only table Write-only; data cannot be modified or deleted after ingestion
Existing instances Must upgrade the engine version before using DLF data sources

Add a DLF data source

Step 1: Set basic information

  1. On the Instance Details > Table Management page, click Add Table.

  2. In the Basic Information step, configure the following parameters, then click Next.

    Parameter Description
    Table Name A custom name for the table.
    Data Shards If you create multiple index tables, all must have the same shard count—or one table can have a single shard while the others share an identical count.
    Number of Resources for Data Updates Controls the compute allocated for data updates. Each index includes a free quota of two resources; each resource provides 4 CPU cores and 8 GB of memory. Resources beyond the free quota are billed. See Billing of Vector Search Edition.
    Scenario Template Choose from four built-in templates: Common Template, Vector: Image Search, Vector: Text Semantic Search, and Vector: Video Search. See Limitations for format restrictions per template.

    Basic information configuration

Step 2: Configure the data source

In the Data Synchronization step, configure the source, then click Next after the data source check passes.

Parameter Description
Full Data Source Select Data Lake Formation (DLF).
Table Format Select Paimon, Lance, or Object Table. See Table formats for details.
Data Catalog The ID of the target DLF data catalog.
Database The database in the target data catalog.
Data Table The data table in the target database.
Relative Path The relative path to files in the object table. Applies only when Table Format is Object Table.
Data Format Select ha3 or json. Applies only when Table Format is Object Table.
Tag A data version tag. If specified, OpenSearch uses the tagged data for the full import. If left blank, OpenSearch uses the latest data in the table.
Data Source Check Verifies connectivity to the data source. Proceed to the next step only after verification passes.

Table formats

Paimon is a lakehouse table format that supports real-time data updates and both stream and batch processing. Paimon provides a tag feature to retain metadata and data files of specific snapshots, preventing historical data loss due to snapshot expiration. Tags can be created automatically based on write jobs, generated periodically by processing time or watermark, or created, deleted, and rolled back manually. Configure a data retention policy to control the maximum number of tags or their retention period to ensure that historical data remains queryable. For more information, see Paimon Tags.

Lance is a vector table format designed for AI workloads that enables ultra-fast similarity searches on vectors. Lance uses tags to mark specific versions in a dataset's history, making it easier to track dataset evolution in frequently updated machine learning workflows. Tags do not create new versions—they exist as metadata in a separate folder and are not removed by the cleanup_old_versions operation—remove the tag first before removing the corresponding version. For more information, see Lance Tags.

Object Table is a metadata table format that lets you query and locate files stored in the cloud using SQL.

Step 3: Configure fields

In the Field Configuration step, configure the schema fields, then click Next.

Field configuration

The following fields are required:

Field Type Option to enable Notes
Primary key int or string Primary Key Uniquely identifies each record.
Vector float Vector Field Multi-value float field by default.

For String type fields, enable Data pre-processing required and click Configure to call a model that pre-processes the field data before indexing.

If a source field is missing or empty, the system assigns a default value automatically: 0 for numeric fields and an empty string for string fields. Specify custom default values to override these defaults.

Pre-processing by data type

Text data type

Text pre-processing configuration
Setting Options
Data type Text
Pre-processing template Dense vectorization, or Dense + sparse vectorization
Model sources Built-in models: A limited selection of model types, available at no cost. AI Search Open Platform: A broader model selection, billed per call. Activate a workspace and an API key on the AI Search Open Platform before use. See Billing methods and billable items. Custom model: Go to Models > Custom Models on the Vector Search Edition page and click Create Model. See Custom model.

Image data type

Setting Options
Data type Image
Data source Object Storage Service (OSS): Store images in an OSS folder and specify the OSS path to import them directly. Base64 encoding: Encode the images first, then store them in a database or transfer them via API. DLF-Object Table: Specify the corresponding data catalog, database, and data table.
Pre-processing template Image vectorization, Image content parsing, or Image content parsing + image vectorization
Model sources Same options as Text: built-in models, AI Search Open Platform, or custom models.

Video data type

Setting Options
Data type Video
Data source Object Storage Service (OSS)
Pre-processing template Video processing
Model sources Same options as Text: built-in models, AI Search Open Platform, or custom models.

Step 4: Configure indexes

In the Index Schema step, configure the indexes, then click Next.

Index schema configuration

Vector index

Parameter Description
Vector Dimensions Select dimensions that match the output of your embedding model.
Distance Type Select the distance metric that matches your model output. Supported types: Squared Euclidean, Inner Product (dot product), and Cosine.
Vector Search Algorithm Select the index algorithm for your use case. Supported algorithms: Linear, HNSW, QGraph, QC, DiskANN, and CagraHnsw.
Real-time Index Whether to build real-time indexes for incremental data pushed via API calls. Default: true.

To configure advanced vector index settings, expand the advanced section. See Common configurations for vector indexes for parameter details.

Advanced vector index configuration

Other indexes

The system automatically generates a pk field and a primary key index. For all other non-vector fields, an index with the same name is created by default.

Global index configuration

Enable automatic cleanup for expired documents. When enabled, a document is deleted automatically when the difference between the current time and the document's timestamp exceeds the specified expiration time.

Step 5: Confirm and verify

  1. In the Confirm step, click Confirm. The system creates the table. Track progress on the Change History page.cn变更历史.png

  2. After the table status changes to In Use, run a query test on the Query Test page to verify that data is indexed and searchable.

Usage notes

When new data is written to a Paimon table in DLF, OpenSearch automatically triggers real-time indexing for that data. If you also write data manually using API calls, data consistency issues may occur—use API writes with caution in this scenario.