自动Embedding技术通过内置预训练模型,将文本自动转化为向量,消除了传统方案中手动定义向量字段的繁琐流程。本文介绍基于Java语言,如何在Lindorm向量引擎中通过Java High Level REST Client客户端实现自动Embedding数据的写入与查询。
前提条件
注意事项
本文所有示例代码中的JSON字符串均采用了文本块(Text Block),这是JDK15及以上版本支持的正式标准特性,即通过使用三对双引号 """ """
来标识文本块的开始和结束。如果您的JDK版本过低,可以将文本块自行转回多行字符串拼接的样式。
准备工作
在使用高级特性前,您需要先安装Java High Level REST Client并连接搜索引擎。具体操作,请参见准备工作。
操作步骤概览
操作步骤 | 涉及引擎 | 说明 |
操作步骤 | 涉及引擎 | 说明 |
AI引擎 | 通过curl命令调用AI引擎RESTful API,部署Embedding模型BGE-M3,用于将文本数据转换为向量。 | |
搜索引擎 | 在搜索引擎中创建写入Pipeline,用于在写入数据时,自动将文本数据转换为向量数据(Embedding)。 | |
搜索引擎 | 在搜索引擎中创建查询Pipeline,用于在查询数据时,自动将文本数据转化为向量数据。 | |
向量引擎,搜索引擎 | 在创建或修改向量索引时,需指定写入和查询Pipeline,用于将写入与查询数据自动转换为向量数据。 | |
向量引擎,搜索引擎 | 使用指定的写入Pipeline,将写入的文本数据自动转化为向量数据。 | |
向量引擎,搜索引擎 | 使用指定的查询Pipeline,将查询的文本数据自动转化为向量数据。 |
AI引擎部署Embedding模型
AI引擎部署模型的具体操作请参见模型管理和通过curl命令使用AI引擎RESTful API示例。
部署BGE-M3模型示例如下,参数详情请参见模型管理。
curl请求地址URL使用AI引擎的专用网络连接地址。
curl -i -k --location --header 'x-ld-ak:<username>' --header 'x-ld-sk:<password>' -X POST http://<URL>/v1/ai/models/create -H "Content-Type: application/json" -d '{
"model_name": "bge_m3_model",
"model_path": "huggingface://BAAI/bge-m3",
"task": "FEATURE_EXTRACTION",
"algorithm": "BGE_M3",
"settings": {"instance_count": "2"}
}'
搜索引擎创建Pipeline
在搜索引擎中创建两种Pipeline,分别用于实现数据写入和查询的自动向量化处理。
创建写入Pipeline
String pipelineId = "write_embedding_pipeline";
String pipelineDefinition = """
{
"description": "demo_chunking pipeline",
"processors": [
{
"text-embedding": {
"inputFields": ["text_field"],
"outputFields": ["text_field_embedding"],
"userName": "user", //AI引擎的用户名
"password": "test****", //AI引擎的密码
"url": "http://ld-t4n5668xk31ui****-proxy-ai-vpc.lindorm.aliyuncs.com:9002", // AI引擎的专有网络连接地址
"modeName": "bge_m3_model"
}
}
]
}
""";
BytesArray source = new BytesArray(pipelineDefinition.getBytes(StandardCharsets.UTF_8));
PutPipelineRequest request = new PutPipelineRequest(pipelineId, source, XContentType.JSON);
AcknowledgedResponse response = client.ingest().putPipeline(request, RequestOptions.DEFAULT);
System.out.println("CreatePipeline Acknowledged: " + response.isAcknowledged());
参数说明
参数 | 说明 |
参数 | 说明 |
processors | 对写入进行Pipeline操作。 |
text-embedding | 固定Key,必须填写。 |
inputFields | 需要进行向量化的文本字段。 |
outputFields | 向量化后的向量字段。 |
userName | Lindorm AI引擎的用户名。 |
password | Lindorm AI引擎的密码。 |
url | AI引擎的连接地址,务必使用专有网络连接地址。 |
modeName | 模型名称,本文对应 |
写入和查询Pipeline中指定的inputFields
和outputFields
,必须与创建向量索引时填写的text_field
和text_field_embedding
保持一致。
创建查询Pipeline
String jsonString = """
{
"request_processors": [
{
"text-embedding" : {
"tag" : "auto-query-embedding",
"description" : "Auto query embedding",
"model_config" : {
"inputFields": ["text_field"],
"outputFields": ["text_field_embedding"],
"userName": "user", //AI引擎的用户名
"password": "test****", //AI引擎的密码
"url": "http://ld-t4n5668xk31ui****-proxy-ai-vpc.lindorm.aliyuncs.com:9002", // AI引擎的专有网络连接地址
"modeName": "bge_m3_model"
}
}
}
]
}
""";
String pipelineName = "knnsearch_pipeline";
Request createPipelineRequest = new Request("PUT", "/_search/pipeline/" + pipelineName);
createPipelineRequest.setJsonEntity(jsonString);
Response response = client.getLowLevelClient().performRequest(createPipelineRequest);
String responseBody = EntityUtils.toString(response.getEntity());
System.out.println("create knnSearch Pipeline responseBody = " + responseBody);
参数说明
参数 | 说明 |
参数 | 说明 |
request_processors | 表示对搜索请求进行Pipeline操作。 |
text-embedding | 固定Key,必须填写。 |
inputFields | 需要进行向量化的文本字段,起到占位作用。 |
outputFields | 向量化以后的向量字段。 |
userName | Lindorm AI引擎的用户名。 |
password | Lindorm AI引擎的密码。 |
url | AI引擎的连接地址,务必使用专有网络连接地址。 |
modeName | 模型名称,本文对应 |
写入和查询Pipeline中指定的inputFields
和outputFields
,必须与创建向量索引时填写的text_field
和text_field_embedding
保持一致。
创建索引并指定Pipeline
在创建向量索引或修改现有向量索引设置时,请指定所需的Pipeline。
创建向量索引
String indexName = "search_vector_test";
CreateIndexRequest createIndexRequest = new CreateIndexRequest(indexName);
createIndexRequest.settings(Map.of(
"index", Map.of(
"number_of_shards", 2,
"knn", true,
"default_pipeline", "write_embedding_pipeline",
"search.default_pipeline", "knnsearch_pipeline")));
createIndexRequest.mapping(Map.of(
"_source", Map.of("excludes", new String[] {"text_field_embedding"}),
"properties", Map.of(
"text_field", Map.of(
"type", "text",
"analyzer", "ik_max_word"
),
"text_field_embedding", Map.of(
"type", "knn_vector",
"dimension", 1024,
"data_type", "float",
"method", Map.of(
"engine", "lvector",
"name", "hnsw",
"space_type", "cosinesimil",
"parameters", Map.of(
"m", 24,
"ef_construction", 500
)
)
),
"tag", Map.of(
"type", "keyword"
),
"brand", Map.of(
"type", "keyword"
),
"merit", Map.of(
"type", "text",
"analyzer", "ik_max_word"
)
)
));
CreateIndexResponse createIndexResponse = client.indices().create(createIndexRequest, RequestOptions.DEFAULT);
System.out.println("createIndexResponse: " + createIndexResponse.index());
修改现有向量索引设置
如果您已经创建了向量索引,可以通过以下方式修改其配置,指定写入和查询时使用的 Pipeline,以满足特定的业务需求。
String indexName = "search_vector_test";
String jsonString = """
{
"index": {
"default_pipeline": "write_embedding_pipeline",
"search.default_pipeline": "knnsearch_pipeline"
}
}
""";
UpdateSettingsRequest updateSettingsRequest = new UpdateSettingsRequest(indexName).settings(jsonString, XContentType.JSON);
AcknowledgedResponse response = client.indices().putSettings(updateSettingsRequest, RequestOptions.DEFAULT);
System.out.println("updateIndexSettings Acknowledged: " + response.isAcknowledged());
数据写入
由于指定了写入的Pipeline,因此,在写入过程中,除了将文本字段text_field
写入外,还会根据该Pipeline将text_field
编码成向量形式,并将其作为text_field_embedding
一并写入。
BulkRequest bulkRequest = new BulkRequest();
// Adding multiple IndexRequest to BulkRequest
bulkRequest.add(new IndexRequest("search_vector_test").id("3982")
.source(XContentType.JSON,
"text_field", "品牌A 时尚节能无线鼠标(草绿)(眩光.悦动.时尚炫舞鼠标 12个月免换电池 高精度光学寻迹引擎 超细微接收器10米传输距离)",
"tag", new String[] {"鼠标", "电子产品"},
"brand", "品牌A",
"merit", "好用、外观漂亮"));
bulkRequest.add(new IndexRequest("search_vector_test").id("323519")
.source(XContentType.JSON,
"text_field", "品牌B 光学鼠标(经典黑)(智能自动对码/1000DPI高精度光学引擎)",
"tag", new String[] {"鼠标", "电子产品"},
"brand", "品牌B",
"merit", "质量好、到货速度快、外观漂亮、好用"));
bulkRequest.add(new IndexRequest("search_vector_test").id("300265")
.source(XContentType.JSON,
"text_field", "品牌C 耳塞式耳机 白色(经典时尚)",
"tag", new String[] {"耳机", "电子产品"},
"brand", "品牌C",
"merit", "外观漂亮、质量好"));
bulkRequest.add(new IndexRequest("search_vector_test").id("6797")
.source(XContentType.JSON,
"text_field", "品牌D 两刀头充电式电动剃须刀",
"tag", new String[] {"家用电器", "电动剃须刀"},
"brand", "品牌D",
"merit", "好用、外观漂亮"));
bulkRequest.add(new IndexRequest("search_vector_test").id("8195")
.source(XContentType.JSON,
"text_field", "品牌E Class4 32G TF卡(micro SD)手机存储卡",
"tag", new String[] {"存储设备", "存储卡", "SD卡"},
"brand", "品牌E",
"merit", "容量挺大的、速度快、好用、质量好"));
bulkRequest.add(new IndexRequest("search_vector_test").id("13316")
.source(XContentType.JSON,
"text_field", "品牌E 101 G2 32GB 优盘",
"tag", new String[] {"存储设备", "U盘", "优盘"},
"brand", "品牌E",
"merit", "好用、容量挺大的、速度快"));
bulkRequest.add(new IndexRequest("search_vector_test").id("14103")
.source(XContentType.JSON,
"text_field", "品牌B 64GB至尊高速移动存储卡 UHS-1制式 读写速度最高可达30MB",
"tag", new String[] {"存储设备", "存储卡", "SD卡"},
"brand", "品牌B",
"merit", "容量挺大的、速度快、好用"));
bulkRequest.setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE);
BulkResponse bulkResponse = client.bulk(bulkRequest, RequestOptions.DEFAULT);
if (bulkResponse.hasFailures()) {
// 处理可能的失败情况
System.out.println("Bulk operation had failures:");
System.out.println(bulkResponse.buildFailureMessage());
} else {
System.out.println("Bulk operation completed successfully.");
}
数据查询
SearchRequest searchRequest = new SearchRequest();
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
Map<String, Object> queryBody = Map.of(
"knn", Map.of(
"text_field_embedding", Map.of(
"query_text", "存储卡",
"k", 10
)
)
);
searchSourceBuilder.size(10);
searchSourceBuilder.query(QueryBuilders.wrapperQuery(new Gson().toJson(queryBody)));
Map<String, String> ext = Map.of("ef_search", "200");
searchSourceBuilder.ext(Collections.singletonList(new LVectorExtBuilder("lvector", ext)));
searchRequest.source(searchSourceBuilder);
searchRequest.indices("search_vector_test");
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
System.out.println(searchResponse);
返回结果
search responseBody = {
"took": 46,
"timed_out": false,
"terminated_early": false,
"num_reduce_phases": 0,
"_shards": {
"total": 2,
"successful": 2,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 7,
"relation": "eq"
},
"max_score": 0.7433592,
"hits": [
{
"_index": "search_vector_test",
"_id": "8195",
"_score": 0.7433592,
"_source": {
"text_field": "品牌E Class4 32G TF卡(micro SD)手机存储卡",
"merit": "容量挺大的、速度快、好用、质量好",
"tag": [
"存储设备",
"存储卡",
"SD卡"
],
"brand": "品牌E"
}
},
{
"_index": "search_vector_test",
"_id": "14103",
"_score": 0.7116537,
"_source": {
"text_field": "品牌B 64GB至尊高速移动存储卡 UHS-1制式 读写速度最高可达30MB",
"merit": "容量挺大的、速度快、好用",
"tag": [
"存储设备",
"存储卡",
"SD卡"
],
"brand": "品牌B"
}
},
{
"_index": "search_vector_test",
"_id": "13316",
"_score": 0.6831677,
"_source": {
"text_field": "品牌E 101 G2 32GB 优盘",
"merit": "好用、容量挺大的、速度快",
"tag": [
"存储设备",
"U盘",
"优盘"
],
"brand": "品牌E"
}
},
{
"_index": "search_vector_test",
"_id": "3982",
"_score": 0.64234203,
"_source": {
"text_field": "品牌A 时尚节能无线鼠标(草绿)(眩光.悦动.时尚炫舞鼠标 12个月免换电池 高精度光学寻迹引擎 超细微接收器10米传输距离)",
"merit": "好用、外观漂亮",
"tag": [
"鼠标",
"电子产品"
],
"brand": "品牌A"
}
},
{
"_index": "search_vector_test",
"_id": "6797",
"_score": 0.6357207,
"_source": {
"text_field": "品牌D 两刀头充电式电动剃须刀",
"merit": "好用、外观漂亮",
"tag": [
"家用电器",
"电动剃须刀"
],
"brand": "品牌D"
}
},
{
"_index": "search_vector_test",
"_id": "323519",
"_score": 0.62445086,
"_source": {
"text_field": "品牌B 光学鼠标(经典黑)(智能自动对码/1000DPI高精度光学引擎)",
"merit": "质量好、到货速度快、外观漂亮、好用",
"tag": [
"鼠标",
"电子产品"
],
"brand": "品牌B"
}
},
{
"_index": "search_vector_test",
"_id": "300265",
"_score": 0.62144196,
"_source": {
"text_field": "品牌C 耳塞式耳机 白色(经典时尚)",
"merit": "外观漂亮、质量好",
"tag": [
"耳机",
"电子产品"
],
"brand": "品牌C"
}
}
]
}
}
SearchRequest searchRequest = new SearchRequest();
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
Map<String, Object> queryBody = Map.of(
"knn", Map.of(
"text_field_embedding", Map.of(
"query_text", "存储卡",
"k", 10,
"filter", Map.of(
"bool", Map.of(
"filter", List.of(
Map.of("match", Map.of("merit", "质量好")),
Map.of("term", Map.of("brand", "品牌E")),
Map.of("terms", Map.of("tag", List.of("SD卡", "存储卡")))
)
)
)
)
)
);
searchSourceBuilder.size(10);
searchSourceBuilder.query(QueryBuilders.wrapperQuery(new Gson().toJson(queryBody)));
Map<String, String> ext = Map.of("filter_type", "efficient_filter", "ef_search", "200");
searchSourceBuilder.ext(Collections.singletonList(new LVectorExtBuilder("lvector", ext)));
searchRequest.source(searchSourceBuilder);
searchRequest.indices("search_vector_test");
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
System.out.println(searchResponse);
返回结果
search responseBody = {
"took": 73,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.7433592,
"hits": [
{
"_index": "search_vector_test",
"_id": "8195",
"_score": 0.7433592,
"_source": {
"text_field": "品牌E Class4 32G TF卡(micro SD)手机存储卡",
"merit": "容量挺大的、速度快、好用、质量好",
"tag": [
"存储设备",
"存储卡",
"SD卡"
],
"brand": "品牌E"
}
}
]
}
}
SearchRequest searchRequest = new SearchRequest();
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
Map<String, Object> queryBody = Map.of(
"knn", Map.of(
"text_field_embedding", Map.of(
"query_text", "存储卡",
"filter", Map.of(
"bool", Map.of(
"must", List.of(
Map.of(
"bool", Map.of(
"must", List.of(
Map.of("match", Map.of("text_field", Map.of("query", "存储卡")))
)
)
),
Map.of(
"bool", Map.of(
"filter", List.of(
Map.of("match", Map.of("merit", "质量好")),
Map.of("term", Map.of("brand", "品牌E")),
Map.of("terms", Map.of("tag", List.of("SD卡", "存储卡")))
)
)
)
)
)
),
"k", 10
)
)
);
searchSourceBuilder.size(10);
searchSourceBuilder.query(QueryBuilders.wrapperQuery(new Gson().toJson(queryBody)));
Map<String, String> ext = Map.of(
"filter_type", "efficient_filter",
"hybrid_search_type", "filter_rrf",
"rrf_rank_constant", "1",
"ef_search", "200");
searchSourceBuilder.ext(Collections.singletonList(new LVectorExtBuilder("lvector", ext)));
searchRequest.source(searchSourceBuilder);
searchRequest.indices("search_vector_test");
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
System.out.println(searchResponse);
返回结果
search responseBody = {
"took": 95,
"timed_out": false,
"terminated_early": false,
"num_reduce_phases": 0,
"_shards": {
"total": 2,
"successful": 2,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "search_vector_test",
"_id": "8195",
"_score": 1.0,
"_source": {
"text_field": "品牌E Class4 32G TF卡(micro SD)手机存储卡",
"merit": "容量挺大的、速度快、好用、质量好",
"tag": [
"存储设备",
"存储卡",
"SD卡"
],
"brand": "品牌E"
}
}
]
}
}
- 本页导读
- 前提条件
- 注意事项
- 准备工作
- 操作步骤概览
- AI引擎部署Embedding模型
- 搜索引擎创建Pipeline
- 创建写入Pipeline
- 创建查询Pipeline
- 创建索引并指定Pipeline
- 创建向量索引
- 修改现有向量索引设置
- 数据写入
- 数据查询