Automatic embedding converts text, images, or videos into vectors using built-in pre-trained open-source models or models from Alibaba Cloud Model Studio. This process eliminates the manual work of defining vector fields required in traditional solutions. This article explains how to write and query data using automatic embeddings in Lindorm.
Prerequisites
-
The vector engine must be enabled. For instructions, see Enable vector engine.
-
The search engine must be enabled, and its version must be 3.9.10 or later. To enable the engine, see the Activation guide. To view or upgrade your current version, see Search engine version information and Upgrade a minor version.
ImportantIf your search engine version is earlier than 3.9.10 but the console indicates that it is the latest version, contact Lindorm technical support (DingTalk ID: s0s3eg3).
-
The AI engine must be enabled. For instructions, see the Activation guide. If you need to use Alibaba Cloud Model Studio models, select the Dashscope model option during activation. If this option is not available, contact Lindorm technical support (DingTalk ID: s0s3eg3).
Note-
The AI engine depends on the wide table engine, which must also be enabled.
-
-
Your client IP address must be added to the Lindorm whitelist. For instructions, see Configure a whitelist.
Procedure overview
|
Step |
Engines |
Description |
|
AI engine |
Use a curl command to call the AI engine RESTful API to deploy an embedding model. This model converts specified data into vectors. |
|
|
search engine, AI engine (requires parameter validation) |
Create a write pipeline in the search engine to automatically convert data being written into vector data (embeddings). |
|
|
search engine, AI engine (requires parameter validation) |
Create a query pipeline in the search engine to automatically convert queries into vector data. |
|
|
vector engine, search engine |
When creating or modifying a vector index, specify the write and query pipelines to automatically convert data and queries into vector data. |
|
|
vector engine, search engine, AI engine |
Write data to the newly created index. The write pipeline automatically converts the data into vector data. |
|
|
vector engine, search engine, AI engine |
Query data from the newly created index. The query pipeline automatically converts the query into vector data. |
Deploy embedding model (Optional)
For instructions on deploying models in the AI engine, see Model Management and Examples of Using the AI engine RESTful API with curl commands.
The following example shows how to deploy the BGE_VISUALIZED model. For parameter details, see Model Management.
Use the AI engine's private network endpoint as the URL for your curl request.
curl -i -k --location --header 'x-ld-ak:<username>' --header 'x-ld-sk:<password>' -X POST http://<URL>/v1/ai/models/create -H "Content-Type: application/json" -d '{
"model_name": "bge_visualized_model",
"model_path": "huggingface://BAAI/bge-visualized",
"task": "FEATURE_EXTRACTION",
"algorithm": "BGE_VISUALIZED_M3",
"settings": {"instance_count": "2"}
}'
Create pipelines in the search engine
Create two types of pipelines in the search engine to automate vectorization for data ingestion and queries.
Use the VPC endpoint of the search engine as the curl request URL.
Text embedding
Create an ingest pipeline
curl -u <username>:<password> -H "Content-Type: application/json" -XPUT "http://<URL>/_ingest/pipeline/<ingest_pipeline_name>" -d '{
"description": "demo embedding pipeline",
"processors": [
{
"text-embedding": {
"inputFields": ["input_field"],
"outputFields": ["embedding_field"],
"userName": "root",
"password": "test****",
"url": "http://ld-xxxx-proxy-ai-vpc.lindorm.aliyuncs.com:9002/dashscope/compatible-mode/v1/embeddings",
"modeName": "text-embedding-v4"
}
}
]
}'
Parameters
|
Parameter |
Description |
|
processors |
The processors that transform documents during ingestion. |
|
text-embedding |
A fixed key that specifies the processor type. This parameter is required. |
|
inputFields |
The field to vectorize. |
|
outputFields |
The field to store the generated vector. |
|
userName |
The username for the Lindorm AI Engine. |
|
password |
The password for the Lindorm AI Engine. |
|
url |
|
|
modeName |
The model name for this topic is |
|
dimension (Optional) |
Specifies the dimension of the generated vector. Important
Only models from the Model Studio series support specifying vector dimensions. For more information, see Model Studio text embedding. |
The inputFields and outputFields specified in the ingest pipeline must match the input_field and embedding_field specified when you create the vector index.
Create a search pipeline
curl -u <username>:<password> -H "Content-Type: application/json" -XPUT "http://<URL>/_search/pipeline/<search_pipeline_name>" -d '{
"request_processors": [
{
"text-embedding" : {
"tag" : "auto-query-embedding",
"description" : "Auto query embedding",
"model_config" : {
"inputFields": ["text_field"],
"outputFields": ["text_field_embedding"],
"userName": "root",
"password": "test****",
"url": "http://ld-xxxx-proxy-ai-vpc.lindorm.aliyuncs.com:9002/dashscope/compatible-mode/v1/embeddings",
"modeName": "text-embedding-v4"
}
}
}
]
}'
Parameters
|
Parameter |
Description |
|
request_processors |
The processors that run on search requests. |
|
text-embedding |
A fixed key that specifies the processor type. This parameter is required. |
|
inputFields |
A placeholder for the text field to vectorize. |
|
outputFields |
The field to store the generated vector. |
|
userName |
The username for the Lindorm AI Engine. |
|
password |
The password for the Lindorm AI Engine. |
|
url |
|
|
modeName |
The name of the model. In this topic, the model is |
|
dimension (Optional) |
Specifies the dimension of the generated vector. Important
Only models from the Model Studio series support specifying vector dimensions. For more information, see Model Studio text embedding. |
The outputFields specified in the query pipeline must match the embedding_field specified when the vector index is created.
Multimodal embedding
Create an ingest pipeline
curl -u <username>:<password> -H "Content-Type: application/json" -XPUT "http://<URL>/_ingest/pipeline/<ingest_pipeline_name>" -d '{
"description": "demo embedding pipeline",
"processors": [
{
"multimodal-embedding": {
"input_fields": ["input_field"],
"output_fields": ["embedding_field"],
"user_name": "root",
"password": "test****",
"url": "http://ld-xxxx-proxy-ai-vpc.lindorm.aliyuncs.com:9002/dashscope/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding",
"model_name": "tongyi-embedding-vision-plus",
"input_type": "image"
}
}
]
}'
Parameters
|
Parameter |
Description |
|
processors |
The processors that transform documents during ingestion. |
|
multimodal-embedding |
A fixed key that specifies the processor type. This parameter is required. |
|
input_fields |
The field to vectorize. |
|
output_fields |
The field to store the generated vector. |
|
user_name |
The username for the Lindorm AI Engine. |
|
password |
The password for the Lindorm AI Engine. |
|
url |
|
|
model_name |
The name of the model. In this topic, the example is |
|
input_type |
The type of data to ingest. Accepted values are 'text', 'image', and 'video'. |
The input_fields and output_fields specified for the ingest pipeline must match the input_field and embedding_field specified when you create the vector index.
Create a search pipeline
curl -u <username>:<password> -H "Content-Type: application/json" -XPUT "http://<URL>/_search/pipeline/<search_pipeline_name>" -d '{
"description": "demo embedding pipeline",
"request_processors": [
{
"multimodal-embedding": {
"model_config" : {
"input_fields": ["input_field"],
"output_fields": ["embedding_field"],
"user_name": "root",
"password": "test****",
"url": "http://ld-xxxx-proxy-ai-vpc.lindorm.aliyuncs.com:9002/dashscope/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding",
"model_name": "tongyi-embedding-vision-plus"
}
}
}
]
}'
Parameters
|
Parameter |
Description |
|
request_processors |
The processors that run on search requests. |
|
multimodal-embedding |
A fixed key that specifies the processor type. This parameter is required. |
|
input_fields |
In a search pipeline, this field acts only as a placeholder. |
|
output_fields |
The field to store the generated vector. |
|
user_name |
The username for the Lindorm AI Engine. |
|
password |
The password for the Lindorm AI Engine. |
|
url |
|
|
model_name |
The model name for this topic is |
The output_fields specified in the query pipeline must be consistent with the embedding_field specified when creating the vector index.
Create an index and specify a pipeline
When creating or modifying a vector index, you must specify the required pipeline.
Use the VPC endpoint of the search engine as the curl request URL.
Create a vector index
curl -u <username>:<password> -H 'Content-Type: application/json' -XPUT "http://<URL>/<index_name>" -d '
{
"settings" : {
"index": {
"number_of_shards": 2,
"knn": true,
"default_pipeline": <ingest_pipeline_name>,
"search.default_pipeline": <search_pipeline_name>
}
},
"mappings": {
"_source": {
"excludes": ["embedding_field"]
},
"properties": {
"input_field": {
"type": "text",
"analyzer": "ik_max_word"
},
"embedding_field": {
"type": "knn_vector",
"dimension": 1024,
"method": {
"engine": "lvector",
"name": "hnsw",
"space_type": "cosinesimil",
"parameters": {
"m": 24,
"ef_construction": 500
}
}
},
"tag": {
"type": "keyword"
},
"brand": {
"type": "keyword"
},
"merit" : {
"type": "text",
"analyzer": "ik_max_word"
}
}
}
}'
Parameters
|
Parameter |
Description |
|
default_pipeline |
Specifies the ingest pipeline for processing documents before they are indexed. |
|
search.default_pipeline |
Specifies the search pipeline that automatically generates embeddings for queries. |
|
type |
Use 'keyword' for exact value matching or 'text' for tokenization. If you use 'text', you must specify an 'analyzer'. The |
|
dimension |
The dimension of the vector. This must match the dimension of the vector output by the model. |
For descriptions of other parameters, see Parameter descriptions.
The inputFields and outputFields specified in the ingest and search pipelines must match the input_field and embedding_field defined when you create the vector index.
Modify existing vector index settings
You can modify the settings of an existing vector index to specify the ingest and search pipelines.
curl -u <username>:<password> -H 'Content-Type: application/json' -XPUT "http://<URL>/<index_name>/_settings" -d '
{
"index": {
"default_pipeline": <ingest_pipeline_name>,
"search.default_pipeline": <search_pipeline_name>
}
}
'
Write data
Because a write pipeline is specified, the write process not only writes the scalar field input_field, but also uses the pipeline to encode input_field into a vector, which is then written as the embedding_field.
Use the VPC endpoint of the search engine as the curl request URL.
curl -u <username>:<password> -H 'Content-Type: application/json' -XPOST "http://<URL>/_bulk?pretty" -d '
{ "index" : { "_index" : "<index_name>", "_id" : "3982" } }
{ "input_field" : "Brand A Stylish Power-saving Wireless Mouse (Green) (12-month battery life, high-precision optical engine, 10m range)", "tag": ["Mouse", "Electronics"], "brand":"Brand A", "merit":"easy to use, stylish design"}
{ "index" : { "_index" : "<index_name>", "_id" : "323519" } }
{ "input_field" : "Brand B Optical Mouse (Black) (auto-pairing, 1000 DPI optical engine)", "tag": ["Mouse", "Electronics"], "brand":"Brand B", "merit":"good quality, fast delivery, stylish design, easy to use"}
{ "index" : { "_index" : "<index_name>", "_id" : "300265" } }
{ "input_field" : "Brand C In-ear Headphones (White) (classic and stylish)", "tag": ["Headphones", "Electronics"], "brand":"Brand C", "merit":"stylish design, good quality"}
{ "index" : { "_index" : "<index_name>", "_id" : "6797" } }
{ "input_field" : "Brand D Dual-head rechargeable electric shaver", "tag": ["Home Appliances", "Electric Shaver"], "brand":"Brand D", "merit":"easy to use, stylish design"}
{ "index" : { "_index" : "<index_name>", "_id" : "8195" } }
{ "input_field" : "Brand E Class 4 32GB microSD Card for mobile phones", "tag": ["Storage", "Memory Card", "SD Card"], "brand":"Brand E", "merit":"large capacity, fast, easy to use, good quality"}
{ "index" : { "_index" : "<index_name>", "_id" : "13316" } }
{ "input_field" : "Brand E 101 G2 32GB USB Drive", "tag": ["Storage","USB Drive"], "brand":"Brand E", "merit":"easy to use, large capacity, fast"}
{ "index" : { "_index" : "<index_name>", "_id" : "14103" } }
{ "input_field" : "Brand B 64GB Extreme High-Speed Mobile Memory Card (UHS-1, up to 30 MB/s read/write speed)", "tag": ["Storage", "Memory Card", "SD Card"], "brand":"Brand B", "merit":"large capacity, fast, easy to use"}
'
When ingesting data with a multimodal-embedding processor, the data format must match the specified input_type.
|
input_type |
Example |
|
text |
"input_field" : "multimodal vector model" |
|
image |
"input_field" : "https://img.alicdn.com/imgextra/i3/O1CN01rdstgY1uiZWt8gqSL_!!6000000006071-0-tps-1970-356.jpg" |
|
video |
"input_field" : "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4" |
Data query
Use the VPC endpoint of the search engine as the curl request URL.
Pure vector
Text search
curl -u <username>:<password> -H 'Content-Type: application/json' -XGET "http://ld-xx-proxy-search-vpc.lindorm.aliyuncs.com:30070/<index_name>/_search?pretty" -d '
{
"size": 10,
"_source": true,
"query": {
"knn": {
"embedding_field": {
"query_text": "memory card",
"k": 10
}
}
},
"ext": {
"lvector": {
"ef_search": "200"
}
}
}'
Image search
curl -u <username>:<password> -H 'Content-Type: application/json' -XGET "http://ld-xx-proxy-search-vpc.lindorm.aliyuncs.com:30070/<index_name>/_search?pretty" -d '
{
"size": 10,
"_source": true,
"query": {
"knn": {
"embedding_field": {
"query_image": "https://img.alicdn.com/imgextra/i2/O1CN019eO00F1HDdlU4Syj5_!!6000000000724-2-tps-2476-1158.png",
"k": 10
}
}
},
"ext": {
"lvector": {
"ef_search": "200"
}
}
}'
-
text-embeddingtype-
A
request_processorof thetext-embeddingtype applies only to KNN queries, and the vector column specified in the KNN query must be included in theoutputFieldslist specified in the pipeline configuration file. -
When constructing a KNN query, always use the
query_textparameter to provide the content to be embedded.Parameter
Description
query_text
The query text to be embedded. The key is fixed as query_text.
-
-
multimodal-embeddingtype-
A
multimodal-embedding-typerequest_processorapplies only to KNN queries, and the vector column specified by the KNN query must exist in theoutput_fieldslist specified in the pipeline configuration. -
When constructing a KNN query, use the following parameters to provide the content to be embedded.
Format
Key
text
query_text
image
query_image
video
query_video
-
For details on other parameters, see parameter descriptions.
Vector search with attribute filtering
curl -u <username>:<password> -H 'Content-Type: application/json' -XGET "http://ld-xx-proxy-search-vpc.lindorm.aliyuncs.com:30070/<index_name>/_search?pretty" -d '
{
"size": 10,
"_source": true,
"query": {
"knn": {
"embedding_field": {
"query_text": "memory card",
"k": 10,
"filter": {
"bool": {
"filter": [{
"match": {
"merit": "good quality"
}
},
{
"term": {
"brand": "Brand E"
}
},
{
"terms": {
"tag": ["SD card", "memory card"]
}
}]
}
}
}
}
},
"ext": {
"lvector": {
"filter_type": "efficient_filter",
"ef_search": "200"
}
}
}'
Hybrid search with attribute filtering
curl -u <username>:<password> -H 'Content-Type: application/json' -XGET "http://ld-xx-proxy-search-vpc.lindorm.aliyuncs.com:30070/<index_name>/_search?pretty" -d '
{
"size": 10,
"_source": true,
"query": {
"knn": {
"embedding_field": {
"query_text": "memory card",
"filter": {
"bool": {
"must": [{
"bool": {
"must": [{
"match": {
"input_field": {
"query": "memory card"
}
}
}]
}
},
{
"bool": {
"filter": [{
"match": {
"merit": "good quality"
}
},
{
"term": {
"brand": "Brand E"
}
},
{
"terms": {
"tag": ["SD card", "memory card"]
}
}]
}
}]
}
},
"k": 10
}
}
},
"ext": {
"lvector": {
"filter_type": "efficient_filter",
"hybrid_search_type": "filter_rrf",
"rrf_rank_constant": "1",
"ef_search": "200"
}
}
}'