自动Embedding技术通过内置预训练模型,将文本自动转化为向量,消除了传统方案中手动定义向量字段的繁琐流程。本文介绍基于Java语言,如何在Lindorm向量引擎中通过Java Low Level REST Client客户端实现自动Embedding数据的写入与查询。
前提条件
注意事项
本文所有示例代码中的JSON字符串均采用了文本块(Text Block),这是JDK15及以上版本支持的正式标准特性,即通过使用三对双引号 """ """
来标识文本块的开始和结束。如果您的JDK版本过低,可以将文本块自行转回多行字符串拼接的样式。
准备工作
在使用高级特性前,您需要先安装Java Low Level REST Client并连接搜索引擎,具体操作,请参见准备工作。
操作步骤概览
操作步骤 | 涉及引擎 | 说明 |
操作步骤 | 涉及引擎 | 说明 |
AI引擎 | 通过curl命令调用AI引擎RESTful API,部署Embedding模型BGE-M3,用于将文本数据转换为向量。 | |
搜索引擎 | 在搜索引擎中创建写入Pipeline,用于在写入数据时,自动将文本数据转换为向量数据(Embedding)。 | |
搜索引擎 | 在搜索引擎中创建查询Pipeline,用于在查询数据时,自动将文本数据转化为向量数据。 | |
向量引擎,搜索引擎 | 在创建或修改向量索引时,需指定写入和查询Pipeline,用于将写入与查询数据自动转换为向量数据。 | |
向量引擎,搜索引擎 | 使用指定的写入Pipeline,将写入的文本数据自动转化为向量数据。 | |
向量引擎,搜索引擎 | 使用指定的查询Pipeline,将查询的文本数据自动转化为向量数据。 |
AI引擎部署Embedding模型
AI引擎部署模型的具体操作请参见模型管理和通过curl命令使用AI引擎RESTful API示例。
部署BGE-M3模型示例如下,参数详情请参见模型管理。
curl请求地址URL使用AI引擎的专用网络连接地址。
curl -i -k --location --header 'x-ld-ak:<username>' --header 'x-ld-sk:<password>' -X POST http://<URL>/v1/ai/models/create -H "Content-Type: application/json" -d '{
"model_name": "bge_m3_model",
"model_path": "huggingface://BAAI/bge-m3",
"task": "FEATURE_EXTRACTION",
"algorithm": "BGE_M3",
"settings": {"instance_count": "2"}
}'
搜索引擎创建Pipeline
在搜索引擎中创建两种Pipeline,分别用于实现数据写入和查询的自动向量化处理。
创建写入Pipeline
String jsonString = """
{
"description": "demo_chunking pipeline",
"processors": [
{
"text-embedding": {
"inputFields": ["text_field"],
"outputFields": ["text_field_embedding"],
"userName": "user", //AI引擎的用户名
"password": "test****", //AI引擎的密码
"url": "http://ld-t4n5668xk31ui****-proxy-ai-vpc.lindorm.aliyuncs.com:9002", // AI引擎的专有网络连接地址
"modeName": "bge_m3_model"
}
}
]
}
""";
String pipelineName = "write_embedding_pipeline";
Request createPipelineRequest = new Request("PUT", "/_ingest/pipeline/" + pipelineName);
createPipelineRequest.setJsonEntity(jsonString);
Response response = restClient.performRequest(createPipelineRequest);
String responseBody = EntityUtils.toString(response.getEntity());
System.out.println("createPipeline responseBody = " + responseBody);
参数说明
参数 | 说明 |
参数 | 说明 |
processors | 对写入进行Pipeline操作。 |
text-embedding | 固定Key,必须填写。 |
inputFields | 需要进行向量化的文本字段。 |
outputFields | 向量化后的向量字段。 |
userName | Lindorm AI引擎的用户名。 |
password | Lindorm AI引擎的密码。 |
url | AI引擎的连接地址,务必使用专有网络连接地址。 |
modeName | 模型名称,本文对应 |
写入和查询Pipeline中指定的inputFields
和outputFields
,必须与创建向量索引时填写的text_field
和text_field_embedding
保持一致。
创建查询Pipeline
String jsonString = """
{
"request_processors": [
{
"text-embedding" : {
"tag" : "auto-query-embedding",
"description" : "Auto query embedding",
"model_config" : {
"inputFields": ["text_field"],
"outputFields": ["text_field_embedding"],
"userName": "user", //AI引擎的用户名
"password": "test****", //AI引擎的密码
"url": "http://ld-t4n5668xk31ui****-proxy-ai-vpc.lindorm.aliyuncs.com:9002", //AI引擎的专有网络连接地址
"modeName": "bge_m3_model"
}
}
}
]
}
""";
String pipelineName = "knnsearch_pipeline";
Request createPipelineRequest = new Request("PUT", "/_search/pipeline/" + pipelineName);
createPipelineRequest.setJsonEntity(jsonString);
Response response = restClient.performRequest(createPipelineRequest);
String responseBody = EntityUtils.toString(response.getEntity());
System.out.println("create knnSearch Pipeline responseBody = " + responseBody);
参数说明
参数 | 说明 |
参数 | 说明 |
request_processors | 表示对搜索请求进行Pipeline操作。 |
text-embedding | 固定Key,必须填写。 |
inputFields | 需要进行向量化的文本字段,起到占位作用。 |
outputFields | 向量化以后的向量字段。 |
userName | Lindorm AI引擎的用户名。 |
password | Lindorm AI引擎的密码。 |
url | AI引擎的连接地址,务必使用专有网络连接地址。 |
modeName | 模型名称,本文对应 |
写入和查询Pipeline中指定的inputFields
和outputFields
,必须与创建向量索引时填写的text_field
和text_field_embedding
保持一致。
创建索引并指定Pipeline
在创建向量索引或修改现有向量索引设置时,请指定所需的Pipeline。
创建向量索引
// 创建索引
String indexName = "search_vector_test";
Request indexRequest = new Request("PUT", "/" + indexName);
String jsonString = """
{
"settings" : {
"index": {
"number_of_shards": 2,
"knn": true,
"default_pipeline": "write_embedding_pipeline",
"search.default_pipeline": "knnsearch_pipeline"
}
},
"mappings": {
"_source": {
"excludes": ["text_field_embedding"]
},
"properties": {
"text_field": {
"type": "text",
"analyzer": "ik_max_word"
},
"text_field_embedding": {
"type": "knn_vector",
"dimension": 1024,
"data_type": "float",
"method": {
"engine": "lvector",
"name": "hnsw",
"space_type": "cosinesimil",
"parameters": {
"m": 24,
"ef_construction": 500
}
}
},
"tag": {
"type": "keyword"
},
"brand": {
"type": "keyword"
},
"merit" : {
"type": "text",
"analyzer": "ik_max_word"
}
}
}
}
""";
indexRequest.setJsonEntity(jsonString);
Response response = restClient.performRequest(indexRequest);
String responseBody = EntityUtils.toString(response.getEntity());
System.out.println("createIndex responseBody = " + responseBody);
修改现有向量索引设置
如果您已经创建了向量索引,可以通过以下方式修改其配置,指定写入和查询时使用的 Pipeline,以满足特定的业务需求。
String pipelineName = "write_embedding_pipeline";
String knnPipelineName = "knnsearch_pipeline";
String indexName = "vector_test6";
Request linkPipelineRequest = new Request("PUT", "/" + indexName + "/_settings");
String jsonString = """
{
"index": {
"default_pipeline": "%s",
"search.default_pipeline": "%s"
}
}
""".formatted(pipelineName, knnPipelineName);
linkPipelineRequest.setJsonEntity(jsonString);
Response response = restClient.performRequest(linkPipelineRequest);
String responseBody = EntityUtils.toString(response.getEntity());
System.out.println("linkPipeline responseBody = " + responseBody);
}
数据写入
由于指定了写入的Pipeline,因此,在写入过程中,除了将文本字段text_field
写入外,还会根据该Pipeline将text_field
编码成向量形式,并将其作为text_field_embedding
一并写入。
Request bulkRequest = new Request("POST", "/_bulk");
String jsonString = """
{ "index" : { "_index" : "search_vector_test", "_id" : "3982" } }
{ "text_field" : "品牌A 时尚节能无线鼠标(草绿)(眩光.悦动.时尚炫舞鼠标 12个月免换电池 高精度光学寻迹引擎 超细微接收器10米传输距离)", "tag": ["鼠标", "电子产品"], "brand":"品牌A", "merit":"好用、外观漂亮"}
{ "index" : { "_index" : "search_vector_test", "_id" : "323519" } }
{ "text_field" : "品牌B 光学鼠标(经典黑)(智能自动对码/1000DPI高精度光学引擎)", "tag": ["鼠标", "电子产品"], "brand":"品牌B", "merit":"质量好、到货速度快、外观漂亮、好用"}
{ "index" : { "_index" : "search_vector_test", "_id" : "300265" } }
{ "text_field" : "品牌C 耳塞式耳机 白色(经典时尚)", "tag": ["耳机", "电子产品"], "brand":"品牌C", "merit":"外观漂亮、质量好"}
{ "index" : { "_index" : "search_vector_test", "_id" : "6797" } }
{ "text_field" : "品牌D 两刀头充电式电动剃须刀", "tag": ["家用电器", "电动剃须刀"], "brand":"品牌D", "merit":"好用、外观漂亮"}
{ "index" : { "_index" : "search_vector_test", "_id" : "8195" } }
{ "text_field" : "品牌E Class4 32G TF卡(micro SD)手机存储卡", "tag": ["存储设备", "存储卡", "SD卡"], "brand":"品牌E", "merit":"容量挺大的、速度快、好用、质量好"}
{ "index" : { "_index" : "search_vector_test", "_id" : "13316" } }
{ "text_field" : "品牌E 101 G2 32GB 优盘", "tag": ["存储设备","U盘", "优盘"], "brand":"品牌E", "merit":"好用、容量挺大的、速度快"}
{ "index" : { "_index" : "search_vector_test", "_id" : "14103" } }
{ "text_field" : "品牌B 64GB至尊高速移动存储卡 UHS-1制式 读写速度最高可达30MB", "tag": ["存储设备", "存储卡", "SD卡"], "brand":"品牌B", "merit":"容量挺大的、速度快、好用"}
""";
bulkRequest.setJsonEntity(jsonString);
Response response = restClient.performRequest(bulkRequest);
String responseBody = EntityUtils.toString(response.getEntity());
System.out.println("bulkWriteDoc responseBody = " + responseBody);
数据查询
Request searchRequest = new Request("POST", "/" + indexName + "/_search?pretty");
String jsonString = """
{
"size": 10,
"_source": true,
"query": {
"knn": {
"text_field_embedding": {
"query_text": "存储卡",
"k": 10
}
}
},
"ext": {
"lvector": {
"ef_search": "200"
}
}
}
""";
searchRequest.setJsonEntity(jsonString);
Response response = restClient.performRequest(searchRequest);
String responseBody = EntityUtils.toString(response.getEntity());
System.out.println("search responseBody = " + responseBody);
返回结果
search responseBody = {
"took": 46,
"timed_out": false,
"terminated_early": false,
"num_reduce_phases": 0,
"_shards": {
"total": 2,
"successful": 2,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 7,
"relation": "eq"
},
"max_score": 0.7433592,
"hits": [
{
"_index": "search_vector_test",
"_id": "8195",
"_score": 0.7433592,
"_source": {
"text_field": "品牌E Class4 32G TF卡(micro SD)手机存储卡",
"merit": "容量挺大的、速度快、好用、质量好",
"tag": [
"存储设备",
"存储卡",
"SD卡"
],
"brand": "品牌E"
}
},
{
"_index": "search_vector_test",
"_id": "14103",
"_score": 0.7116537,
"_source": {
"text_field": "品牌B 64GB至尊高速移动存储卡 UHS-1制式 读写速度最高可达30MB",
"merit": "容量挺大的、速度快、好用",
"tag": [
"存储设备",
"存储卡",
"SD卡"
],
"brand": "品牌B"
}
},
{
"_index": "search_vector_test",
"_id": "13316",
"_score": 0.6831677,
"_source": {
"text_field": "品牌E 101 G2 32GB 优盘",
"merit": "好用、容量挺大的、速度快",
"tag": [
"存储设备",
"U盘",
"优盘"
],
"brand": "品牌E"
}
},
{
"_index": "search_vector_test",
"_id": "3982",
"_score": 0.64234203,
"_source": {
"text_field": "品牌A 时尚节能无线鼠标(草绿)(眩光.悦动.时尚炫舞鼠标 12个月免换电池 高精度光学寻迹引擎 超细微接收器10米传输距离)",
"merit": "好用、外观漂亮",
"tag": [
"鼠标",
"电子产品"
],
"brand": "品牌A"
}
},
{
"_index": "search_vector_test",
"_id": "6797",
"_score": 0.6357207,
"_source": {
"text_field": "品牌D 两刀头充电式电动剃须刀",
"merit": "好用、外观漂亮",
"tag": [
"家用电器",
"电动剃须刀"
],
"brand": "品牌D"
}
},
{
"_index": "search_vector_test",
"_id": "323519",
"_score": 0.62445086,
"_source": {
"text_field": "品牌B 光学鼠标(经典黑)(智能自动对码/1000DPI高精度光学引擎)",
"merit": "质量好、到货速度快、外观漂亮、好用",
"tag": [
"鼠标",
"电子产品"
],
"brand": "品牌B"
}
},
{
"_index": "search_vector_test",
"_id": "300265",
"_score": 0.62144196,
"_source": {
"text_field": "品牌C 耳塞式耳机 白色(经典时尚)",
"merit": "外观漂亮、质量好",
"tag": [
"耳机",
"电子产品"
],
"brand": "品牌C"
}
}
]
}
}
Request searchRequest = new Request("POST", "/" + indexName + "/_search?pretty");
String jsonString = """
{
"size": 10,
"_source": true,
"query": {
"knn": {
"text_field_embedding": {
"query_text": "存储卡",
"k": 10,
"filter": {
"bool": {
"filter": [{
"match": {
"merit": "质量好"
}
},
{
"term": {
"brand": "品牌E"
}
},
{
"terms": {
"tag": ["SD卡", "存储卡"]
}
}]
}
}
}
}
},
"ext": {
"lvector": {
"filter_type": "efficient_filter",
"ef_search": "200"
}
}
}
""";
searchRequest.setJsonEntity(jsonString);
Response response = restClient.performRequest(searchRequest);
String responseBody = EntityUtils.toString(response.getEntity());
System.out.println("search responseBody = " + responseBody);
返回结果
search responseBody = {
"took": 73,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.7433592,
"hits": [
{
"_index": "search_vector_test",
"_id": "8195",
"_score": 0.7433592,
"_source": {
"text_field": "品牌E Class4 32G TF卡(micro SD)手机存储卡",
"merit": "容量挺大的、速度快、好用、质量好",
"tag": [
"存储设备",
"存储卡",
"SD卡"
],
"brand": "品牌E"
}
}
]
}
}
Request searchRequest = new Request("POST", "/" + indexName + "/_search?pretty");
String jsonString = """
{
"size": 10,
"_source": true,
"query": {
"knn": {
"text_field_embedding": {
"query_text": "存储卡",
"filter": {
"bool": {
"must": [{
"bool": {
"must": [{
"match": {
"text_field": {
"query": "存储卡"
}
}
}]
}
},
{
"bool": {
"filter": [{
"match": {
"merit": "质量好"
}
},
{
"term": {
"brand": "品牌E"
}
},
{
"terms": {
"tag": ["SD卡", "存储卡"]
}
}]
}
}]
}
},
"k": 10
}
}
},
"ext": {
"lvector": {
"filter_type": "efficient_filter",
"hybrid_search_type": "filter_rrf",
"rrf_rank_constant": "1",
"ef_search": "200"
}
}
}
""";
searchRequest.setJsonEntity(jsonString);
Response response = restClient.performRequest(searchRequest);
String responseBody = EntityUtils.toString(response.getEntity());
System.out.println("search responseBody = " + responseBody);
返回结果
search responseBody = {
"took": 95,
"timed_out": false,
"terminated_early": false,
"num_reduce_phases": 0,
"_shards": {
"total": 2,
"successful": 2,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "search_vector_test",
"_id": "8195",
"_score": 1.0,
"_source": {
"text_field": "品牌E Class4 32G TF卡(micro SD)手机存储卡",
"merit": "容量挺大的、速度快、好用、质量好",
"tag": [
"存储设备",
"存储卡",
"SD卡"
],
"brand": "品牌E"
}
}
]
}
}
- 本页导读
- 前提条件
- 注意事项
- 准备工作
- 操作步骤概览
- AI引擎部署Embedding模型
- 搜索引擎创建Pipeline
- 创建写入Pipeline
- 创建查询Pipeline
- 创建索引并指定Pipeline
- 创建向量索引
- 修改现有向量索引设置
- 数据写入
- 数据查询