通过Java自动Embedding的写入与查询方法_云原生多模数据库 Lindorm(Lindorm)-阿里云帮助中心

自动Embedding技术通过内置预训练模型，将文本自动转化为向量，消除了传统方案中手动定义向量字段的繁琐流程。本文介绍基于Java语言，如何在Lindorm向量引擎中通过Java High Level REST Client客户端实现自动Embedding数据的写入与查询。

前提条件

已安装Java环境，要求安装JDK 1.8及以上版本。
已开通向量引擎。如何开通，请参见开通向量引擎。
已开通搜索引擎，且搜索引擎为3.9.10及以上版本。如何开通，请参见开通指南。如何查看或升级当前版本，请参见搜索引擎版本说明和升级小版本。
重要
如果您的搜索引擎为3.9.10以下版本，但控制台显示已是最新版本，请联系Lindorm技术支持（钉钉号：s0s3eg3）。
已开通AI引擎。如何开通，请参见开通指南。
说明
由于AI引擎的功能实现依赖于宽表引擎，因此在开通AI引擎时必须同时开通宽表引擎。
已将客户端IP地址添加至Lindorm白名单，具体操作请参见设置白名单。

注意事项

本文所有示例代码中的JSON字符串均采用了文本块（Text Block），这是JDK15及以上版本支持的正式标准特性，即通过使用三对双引号 """ """ 来标识文本块的开始和结束。如果您的JDK版本过低，可以将文本块自行转回多行字符串拼接的样式。

准备工作

在使用高级特性前，您需要先安装Java High Level REST Client并连接搜索引擎。具体操作，请参见准备工作。

操作步骤概览

操作步骤	涉及引擎	说明

操作步骤	涉及引擎	说明
AI引擎部署Embedding模型	AI引擎	通过curl命令调用AI引擎RESTful API，部署Embedding模型BGE-M3，用于将文本数据转换为向量。
创建写入Pipeline	搜索引擎	在搜索引擎中创建写入Pipeline，用于在写入数据时，自动将文本数据转换为向量数据（Embedding）。
创建查询Pipeline	搜索引擎	在搜索引擎中创建查询Pipeline，用于在查询数据时，自动将文本数据转化为向量数据。
创建索引并指定Pipeline	向量引擎，搜索引擎	在创建或修改向量索引时，需指定写入和查询Pipeline，用于将写入与查询数据自动转换为向量数据。
数据写入	向量引擎，搜索引擎	使用指定的写入Pipeline，将写入的文本数据自动转化为向量数据。
数据查询	向量引擎，搜索引擎	使用指定的查询Pipeline，将查询的文本数据自动转化为向量数据。

AI引擎部署Embedding模型

AI引擎部署模型的具体操作请参见模型管理和通过curl命令使用AI引擎RESTful API示例。

部署BGE-M3模型示例如下，参数详情请参见模型管理。

重要

curl请求地址URL使用AI引擎的专用网络连接地址。

curl -i -k --location --header 'x-ld-ak:<username>' --header 'x-ld-sk:<password>' -X POST http://<URL>/v1/ai/models/create  -H "Content-Type: application/json" -d '{
          "model_name": "bge_m3_model",
          "model_path": "huggingface://BAAI/bge-m3",
          "task": "FEATURE_EXTRACTION",
          "algorithm": "BGE_M3",
          "settings": {"instance_count": "2"}
     }'

搜索引擎创建Pipeline

在搜索引擎中创建两种Pipeline，分别用于实现数据写入和查询的自动向量化处理。

创建写入Pipeline

String pipelineId = "write_embedding_pipeline";
String pipelineDefinition = """
  {
    "description": "demo_chunking pipeline",
    "processors": [
     {
        "text-embedding": {
          "inputFields": ["text_field"],
          "outputFields": ["text_field_embedding"],
          "userName": "user", //AI引擎的用户名
          "password": "test****", //AI引擎的密码
          "url": "http://ld-t4n5668xk31ui****-proxy-ai-vpc.lindorm.aliyuncs.com:9002", // AI引擎的专有网络连接地址
          "modeName": "bge_m3_model"
        }
     }
    ]
  }
  """;
BytesArray source = new BytesArray(pipelineDefinition.getBytes(StandardCharsets.UTF_8));
PutPipelineRequest request = new PutPipelineRequest(pipelineId, source, XContentType.JSON);
AcknowledgedResponse response = client.ingest().putPipeline(request, RequestOptions.DEFAULT);
System.out.println("CreatePipeline Acknowledged: " + response.isAcknowledged());

参数说明

参数	说明

参数	说明
processors	对写入进行Pipeline操作。
text-embedding	固定Key，必须填写。
inputFields	需要进行向量化的文本字段。
outputFields	向量化后的向量字段。
userName	Lindorm AI引擎的用户名。
password	Lindorm AI引擎的密码。
url	AI引擎的连接地址，务必使用专有网络连接地址。
modeName	模型名称，本文对应`bge_m3_model`。

说明

写入和查询Pipeline中指定的inputFields和outputFields，必须与创建向量索引时填写的text_field和text_field_embedding保持一致。

创建查询Pipeline

String jsonString = """
{
  "request_processors": [
    {
      "text-embedding" : {
        "tag" : "auto-query-embedding",
        "description" : "Auto query embedding",
        "model_config" : {
          "inputFields": ["text_field"],
          "outputFields": ["text_field_embedding"],
          "userName": "user", //AI引擎的用户名
          "password": "test****", //AI引擎的密码
          "url": "http://ld-t4n5668xk31ui****-proxy-ai-vpc.lindorm.aliyuncs.com:9002", // AI引擎的专有网络连接地址
          "modeName": "bge_m3_model"
        }
      }
    }
  ]
}
""";
String pipelineName = "knnsearch_pipeline";
Request createPipelineRequest = new Request("PUT", "/_search/pipeline/" + pipelineName);
createPipelineRequest.setJsonEntity(jsonString);
Response response =  client.getLowLevelClient().performRequest(createPipelineRequest);
String responseBody = EntityUtils.toString(response.getEntity());
System.out.println("create knnSearch Pipeline responseBody = " + responseBody);

参数说明

参数	说明

参数	说明
request_processors	表示对搜索请求进行Pipeline操作。
text-embedding	固定Key，必须填写。
inputFields	需要进行向量化的文本字段，起到占位作用。
outputFields	向量化以后的向量字段。
userName	Lindorm AI引擎的用户名。
password	Lindorm AI引擎的密码。
url	AI引擎的连接地址，务必使用专有网络连接地址。
modeName	模型名称，本文对应`bge_m3_model`。

说明

写入和查询Pipeline中指定的inputFields和outputFields，必须与创建向量索引时填写的text_field和text_field_embedding保持一致。

创建索引并指定Pipeline

在创建向量索引或修改现有向量索引设置时，请指定所需的Pipeline。

创建向量索引

String indexName = "search_vector_test";
CreateIndexRequest createIndexRequest = new CreateIndexRequest(indexName);

createIndexRequest.settings(Map.of(
  "index", Map.of(
    "number_of_shards", 2,
    "knn", true,
    "default_pipeline", "write_embedding_pipeline",
    "search.default_pipeline", "knnsearch_pipeline")));

createIndexRequest.mapping(Map.of(
  "_source", Map.of("excludes", new String[] {"text_field_embedding"}),
  "properties", Map.of(
    "text_field", Map.of(
      "type", "text",
      "analyzer", "ik_max_word"
    ),
    "text_field_embedding", Map.of(
      "type", "knn_vector",
      "dimension", 1024,
      "data_type", "float",
      "method", Map.of(
        "engine", "lvector",
        "name", "hnsw",
        "space_type", "cosinesimil",
        "parameters", Map.of(
          "m", 24,
          "ef_construction", 500
        )
      )
    ),
    "tag", Map.of(
      "type", "keyword"
    ),
    "brand", Map.of(
      "type", "keyword"
    ),
    "merit", Map.of(
      "type", "text",
      "analyzer", "ik_max_word"
    )
  )
));

CreateIndexResponse createIndexResponse = client.indices().create(createIndexRequest, RequestOptions.DEFAULT);
System.out.println("createIndexResponse: " + createIndexResponse.index());

修改现有向量索引设置

如果您已经创建了向量索引，可以通过以下方式修改其配置，指定写入和查询时使用的 Pipeline，以满足特定的业务需求。

String indexName = "search_vector_test";
String jsonString = """
  {
     "index": {
          "default_pipeline": "write_embedding_pipeline",
           "search.default_pipeline": "knnsearch_pipeline"
       }
   }
  """;
UpdateSettingsRequest updateSettingsRequest = new UpdateSettingsRequest(indexName).settings(jsonString, XContentType.JSON);
AcknowledgedResponse response = client.indices().putSettings(updateSettingsRequest, RequestOptions.DEFAULT);
System.out.println("updateIndexSettings Acknowledged: " + response.isAcknowledged());

数据写入

由于指定了写入的Pipeline，因此，在写入过程中，除了将文本字段text_field写入外，还会根据该Pipeline将text_field编码成向量形式，并将其作为text_field_embedding一并写入。

BulkRequest bulkRequest = new BulkRequest();

// Adding multiple IndexRequest to BulkRequest
bulkRequest.add(new IndexRequest("search_vector_test").id("3982")
  .source(XContentType.JSON,
    "text_field", "品牌A 时尚节能无线鼠标(草绿)(眩光.悦动.时尚炫舞鼠标 12个月免换电池 高精度光学寻迹引擎 超细微接收器10米传输距离)",
    "tag", new String[] {"鼠标", "电子产品"},
    "brand", "品牌A",
    "merit", "好用、外观漂亮"));

bulkRequest.add(new IndexRequest("search_vector_test").id("323519")
  .source(XContentType.JSON,
    "text_field", "品牌B 光学鼠标(经典黑)(智能自动对码/1000DPI高精度光学引擎)",
    "tag", new String[] {"鼠标", "电子产品"},
    "brand", "品牌B",
    "merit", "质量好、到货速度快、外观漂亮、好用"));

bulkRequest.add(new IndexRequest("search_vector_test").id("300265")
  .source(XContentType.JSON,
    "text_field", "品牌C 耳塞式耳机 白色(经典时尚)",
    "tag", new String[] {"耳机", "电子产品"},
    "brand", "品牌C",
    "merit", "外观漂亮、质量好"));

bulkRequest.add(new IndexRequest("search_vector_test").id("6797")
  .source(XContentType.JSON,
    "text_field", "品牌D 两刀头充电式电动剃须刀",
    "tag", new String[] {"家用电器", "电动剃须刀"},
    "brand", "品牌D",
    "merit", "好用、外观漂亮"));

bulkRequest.add(new IndexRequest("search_vector_test").id("8195")
  .source(XContentType.JSON,
    "text_field", "品牌E Class4 32G TF卡(micro SD)手机存储卡",
    "tag", new String[] {"存储设备", "存储卡", "SD卡"},
    "brand", "品牌E",
    "merit", "容量挺大的、速度快、好用、质量好"));

bulkRequest.add(new IndexRequest("search_vector_test").id("13316")
  .source(XContentType.JSON,
    "text_field", "品牌E 101 G2 32GB 优盘",
    "tag", new String[] {"存储设备", "U盘", "优盘"},
    "brand", "品牌E",
    "merit", "好用、容量挺大的、速度快"));

bulkRequest.add(new IndexRequest("search_vector_test").id("14103")
  .source(XContentType.JSON,
    "text_field", "品牌B 64GB至尊高速移动存储卡 UHS-1制式 读写速度最高可达30MB",
    "tag", new String[] {"存储设备", "存储卡", "SD卡"},
    "brand", "品牌B",
    "merit", "容量挺大的、速度快、好用"));

bulkRequest.setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE);
BulkResponse bulkResponse = client.bulk(bulkRequest, RequestOptions.DEFAULT);

if (bulkResponse.hasFailures()) {
  // 处理可能的失败情况
  System.out.println("Bulk operation had failures:");
  System.out.println(bulkResponse.buildFailureMessage());
} else {
  System.out.println("Bulk operation completed successfully.");
}

数据查询

纯向量

向量+属性过滤

向量+全文+属性过滤

SearchRequest searchRequest = new SearchRequest();
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
Map<String, Object> queryBody = Map.of(
  "knn", Map.of(
    "text_field_embedding", Map.of(
      "query_text", "存储卡",
      "k", 10
    )
  )
);
searchSourceBuilder.size(10);
searchSourceBuilder.query(QueryBuilders.wrapperQuery(new Gson().toJson(queryBody)));
Map<String, String> ext = Map.of("ef_search", "200");
searchSourceBuilder.ext(Collections.singletonList(new LVectorExtBuilder("lvector", ext)));
searchRequest.source(searchSourceBuilder);
searchRequest.indices("search_vector_test");
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
System.out.println(searchResponse);

返回结果

search responseBody = {
  "took": 46,
  "timed_out": false,
  "terminated_early": false,
  "num_reduce_phases": 0,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 7,
      "relation": "eq"
    },
    "max_score": 0.7433592,
    "hits": [
      {
        "_index": "search_vector_test",
        "_id": "8195",
        "_score": 0.7433592,
        "_source": {
          "text_field": "品牌E Class4 32G TF卡(micro SD)手机存储卡",
          "merit": "容量挺大的、速度快、好用、质量好",
          "tag": [
            "存储设备",
            "存储卡",
            "SD卡"
          ],
          "brand": "品牌E"
        }
      },
      {
        "_index": "search_vector_test",
        "_id": "14103",
        "_score": 0.7116537,
        "_source": {
          "text_field": "品牌B 64GB至尊高速移动存储卡 UHS-1制式 读写速度最高可达30MB",
          "merit": "容量挺大的、速度快、好用",
          "tag": [
            "存储设备",
            "存储卡",
            "SD卡"
          ],
          "brand": "品牌B"
        }
      },
      {
        "_index": "search_vector_test",
        "_id": "13316",
        "_score": 0.6831677,
        "_source": {
          "text_field": "品牌E 101 G2 32GB 优盘",
          "merit": "好用、容量挺大的、速度快",
          "tag": [
            "存储设备",
            "U盘",
            "优盘"
          ],
          "brand": "品牌E"
        }
      },
      {
        "_index": "search_vector_test",
        "_id": "3982",
        "_score": 0.64234203,
        "_source": {
          "text_field": "品牌A 时尚节能无线鼠标(草绿)(眩光.悦动.时尚炫舞鼠标 12个月免换电池 高精度光学寻迹引擎 超细微接收器10米传输距离)",
          "merit": "好用、外观漂亮",
          "tag": [
            "鼠标",
            "电子产品"
          ],
          "brand": "品牌A"
        }
      },
      {
        "_index": "search_vector_test",
        "_id": "6797",
        "_score": 0.6357207,
        "_source": {
          "text_field": "品牌D 两刀头充电式电动剃须刀",
          "merit": "好用、外观漂亮",
          "tag": [
            "家用电器",
            "电动剃须刀"
          ],
          "brand": "品牌D"
        }
      },
      {
        "_index": "search_vector_test",
        "_id": "323519",
        "_score": 0.62445086,
        "_source": {
          "text_field": "品牌B 光学鼠标(经典黑)(智能自动对码/1000DPI高精度光学引擎)",
          "merit": "质量好、到货速度快、外观漂亮、好用",
          "tag": [
            "鼠标",
            "电子产品"
          ],
          "brand": "品牌B"
        }
      },
      {
        "_index": "search_vector_test",
        "_id": "300265",
        "_score": 0.62144196,
        "_source": {
          "text_field": "品牌C 耳塞式耳机 白色(经典时尚)",
          "merit": "外观漂亮、质量好",
          "tag": [
            "耳机",
            "电子产品"
          ],
          "brand": "品牌C"
        }
      }
    ]
  }
}

SearchRequest searchRequest = new SearchRequest();
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
Map<String, Object> queryBody = Map.of(
  "knn", Map.of(
    "text_field_embedding", Map.of(
      "query_text", "存储卡",
      "k", 10,
      "filter", Map.of(
        "bool", Map.of(
          "filter", List.of(
            Map.of("match", Map.of("merit", "质量好")),
            Map.of("term", Map.of("brand", "品牌E")),
            Map.of("terms", Map.of("tag", List.of("SD卡", "存储卡")))
          )
        )
      )
    )
  )
);

searchSourceBuilder.size(10);
searchSourceBuilder.query(QueryBuilders.wrapperQuery(new Gson().toJson(queryBody)));
Map<String, String> ext = Map.of("filter_type", "efficient_filter", "ef_search", "200");
searchSourceBuilder.ext(Collections.singletonList(new LVectorExtBuilder("lvector", ext)));
searchRequest.source(searchSourceBuilder);
searchRequest.indices("search_vector_test");
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
System.out.println(searchResponse);

返回结果

search responseBody = {
  "took": 73,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.7433592,
    "hits": [
      {
        "_index": "search_vector_test",
        "_id": "8195",
        "_score": 0.7433592,
        "_source": {
          "text_field": "品牌E Class4 32G TF卡(micro SD)手机存储卡",
          "merit": "容量挺大的、速度快、好用、质量好",
          "tag": [
            "存储设备",
            "存储卡",
            "SD卡"
          ],
          "brand": "品牌E"
        }
      }
    ]
  }
}

SearchRequest searchRequest = new SearchRequest();
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
Map<String, Object> queryBody = Map.of(
  "knn", Map.of(
    "text_field_embedding", Map.of(
      "query_text", "存储卡",
      "filter", Map.of(
        "bool", Map.of(
          "must", List.of(
            Map.of(
              "bool", Map.of(
                "must", List.of(
                  Map.of("match", Map.of("text_field", Map.of("query", "存储卡")))
                )
              )
            ),
            Map.of(
              "bool", Map.of(
                "filter", List.of(
                  Map.of("match", Map.of("merit", "质量好")),
                  Map.of("term", Map.of("brand", "品牌E")),
                  Map.of("terms", Map.of("tag", List.of("SD卡", "存储卡")))
                )
              )
            )
          )
        )
      ),
      "k", 10
    )
  )
);
searchSourceBuilder.size(10);
searchSourceBuilder.query(QueryBuilders.wrapperQuery(new Gson().toJson(queryBody)));
Map<String, String> ext = Map.of(
  "filter_type", "efficient_filter",
  "hybrid_search_type", "filter_rrf",
  "rrf_rank_constant", "1",
  "ef_search", "200");
searchSourceBuilder.ext(Collections.singletonList(new LVectorExtBuilder("lvector", ext)));
searchRequest.source(searchSourceBuilder);
searchRequest.indices("search_vector_test");
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
System.out.println(searchResponse);

返回结果

search responseBody = {
  "took": 95,
  "timed_out": false,
  "terminated_early": false,
  "num_reduce_phases": 0,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "search_vector_test",
        "_id": "8195",
        "_score": 1.0,
        "_source": {
          "text_field": "品牌E Class4 32G TF卡(micro SD)手机存储卡",
          "merit": "容量挺大的、速度快、好用、质量好",
          "tag": [
            "存储设备",
            "存储卡",
            "SD卡"
          ],
          "brand": "品牌E"
        }
      }
    ]
  }
}