写入与查询自动Embedding

更新时间:2025-03-19 01:48:24

自动Embedding技术通过内置预训练模型,将文本自动转化为向量,消除了传统方案中手动定义向量字段的繁琐流程。本文介绍基于Java语言,如何在Lindorm向量引擎中通过Java High Level REST Client客户端实现自动Embedding数据的写入与查询。

前提条件

  • 已安装Java环境,要求安装JDK 1.8及以上版本。

  • 已开通向量引擎。如何开通,请参见开通向量引擎

  • 已开通搜索引擎,且搜索引擎为3.9.10及以上版本。如何开通,请参见开通指南。如何查看或升级当前版本,请参见搜索引擎版本说明升级小版本

    重要

    如果您的搜索引擎为3.9.10以下版本,但控制台显示已是最新版本,请联系Lindorm技术支持(钉钉号:s0s3eg3)。

  • 已开通AI引擎。如何开通,请参见开通指南

    说明

    由于AI引擎的功能实现依赖于宽表引擎,因此在开通AI引擎时必须同时开通宽表引擎。

  • 已将客户端IP地址添加至Lindorm白名单,具体操作请参见设置白名单

注意事项

本文所有示例代码中的JSON字符串均采用了文本块(Text Block),这是JDK15及以上版本支持的正式标准特性,即通过使用三对双引号 """ """ 来标识文本块的开始和结束。如果您的JDK版本过低,可以将文本块自行转回多行字符串拼接的样式。

准备工作

在使用高级特性前,您需要先安装Java High Level REST Client并连接搜索引擎。具体操作,请参见准备工作

操作步骤概览

操作步骤

涉及引擎

说明

操作步骤

涉及引擎

说明

AI引擎部署Embedding模型

AI引擎

通过curl命令调用AI引擎RESTful API,部署Embedding模型BGE-M3,用于将文本数据转换为向量。

创建写入Pipeline

搜索引擎

在搜索引擎中创建写入Pipeline,用于在写入数据时,自动将文本数据转换为向量数据(Embedding)。

创建查询Pipeline

搜索引擎

在搜索引擎中创建查询Pipeline,用于在查询数据时,自动将文本数据转化为向量数据。

创建索引并指定Pipeline

向量引擎,搜索引擎

在创建或修改向量索引时,需指定写入和查询Pipeline,用于将写入与查询数据自动转换为向量数据。

数据写入

向量引擎,搜索引擎

使用指定的写入Pipeline,将写入的文本数据自动转化为向量数据。

数据查询

向量引擎,搜索引擎

使用指定的查询Pipeline,将查询的文本数据自动转化为向量数据。

AI引擎部署Embedding模型

AI引擎部署模型的具体操作请参见模型管理通过curl命令使用AI引擎RESTful API示例

部署BGE-M3模型示例如下,参数详情请参见模型管理

重要

curl请求地址URL使用AI引擎的专用网络连接地址。

curl -i -k --location --header 'x-ld-ak:<username>' --header 'x-ld-sk:<password>' -X POST http://<URL>/v1/ai/models/create  -H "Content-Type: application/json" -d '{
          "model_name": "bge_m3_model",
          "model_path": "huggingface://BAAI/bge-m3",
          "task": "FEATURE_EXTRACTION",
          "algorithm": "BGE_M3",
          "settings": {"instance_count": "2"}
     }'

搜索引擎创建Pipeline

在搜索引擎中创建两种Pipeline,分别用于实现数据写入和查询的自动向量化处理。

创建写入Pipeline

String pipelineId = "write_embedding_pipeline";
String pipelineDefinition = """
  {
    "description": "demo_chunking pipeline",
    "processors": [
     {
        "text-embedding": {
          "inputFields": ["text_field"],
          "outputFields": ["text_field_embedding"],
          "userName": "user", //AI引擎的用户名
          "password": "test****", //AI引擎的密码
          "url": "http://ld-t4n5668xk31ui****-proxy-ai-vpc.lindorm.aliyuncs.com:9002", // AI引擎的专有网络连接地址
          "modeName": "bge_m3_model"
        }
     }
    ]
  }
  """;
BytesArray source = new BytesArray(pipelineDefinition.getBytes(StandardCharsets.UTF_8));
PutPipelineRequest request = new PutPipelineRequest(pipelineId, source, XContentType.JSON);
AcknowledgedResponse response = client.ingest().putPipeline(request, RequestOptions.DEFAULT);
System.out.println("CreatePipeline Acknowledged: " + response.isAcknowledged());

参数说明

参数

说明

参数

说明

processors

对写入进行Pipeline操作。

text-embedding

固定Key,必须填写。

inputFields

需要进行向量化的文本字段。

outputFields

向量化后的向量字段。

userName

Lindorm AI引擎的用户名。

password

Lindorm AI引擎的密码。

url

AI引擎的连接地址,务必使用专有网络连接地址。

modeName

模型名称,本文对应bge_m3_model

说明

写入和查询Pipeline中指定的inputFieldsoutputFields,必须与创建向量索引时填写的text_fieldtext_field_embedding保持一致。

创建查询Pipeline

String jsonString = """
{
  "request_processors": [
    {
      "text-embedding" : {
        "tag" : "auto-query-embedding",
        "description" : "Auto query embedding",
        "model_config" : {
          "inputFields": ["text_field"],
          "outputFields": ["text_field_embedding"],
          "userName": "user", //AI引擎的用户名
          "password": "test****", //AI引擎的密码
          "url": "http://ld-t4n5668xk31ui****-proxy-ai-vpc.lindorm.aliyuncs.com:9002", // AI引擎的专有网络连接地址
          "modeName": "bge_m3_model"
        }
      }
    }
  ]
}
""";
String pipelineName = "knnsearch_pipeline";
Request createPipelineRequest = new Request("PUT", "/_search/pipeline/" + pipelineName);
createPipelineRequest.setJsonEntity(jsonString);
Response response =  client.getLowLevelClient().performRequest(createPipelineRequest);
String responseBody = EntityUtils.toString(response.getEntity());
System.out.println("create knnSearch Pipeline responseBody = " + responseBody);

参数说明

参数

说明

参数

说明

request_processors

表示对搜索请求进行Pipeline操作。

text-embedding

固定Key,必须填写。

inputFields

需要进行向量化的文本字段,起到占位作用。

outputFields

向量化以后的向量字段。

userName

Lindorm AI引擎的用户名。

password

Lindorm AI引擎的密码。

url

AI引擎的连接地址,务必使用专有网络连接地址。

modeName

模型名称,本文对应bge_m3_model

说明

写入和查询Pipeline中指定的inputFieldsoutputFields,必须与创建向量索引时填写的text_fieldtext_field_embedding保持一致。

创建索引并指定Pipeline

在创建向量索引或修改现有向量索引设置时,请指定所需的Pipeline。

创建向量索引

String indexName = "search_vector_test";
CreateIndexRequest createIndexRequest = new CreateIndexRequest(indexName);

createIndexRequest.settings(Map.of(
  "index", Map.of(
    "number_of_shards", 2,
    "knn", true,
    "default_pipeline", "write_embedding_pipeline",
    "search.default_pipeline", "knnsearch_pipeline")));

createIndexRequest.mapping(Map.of(
  "_source", Map.of("excludes", new String[] {"text_field_embedding"}),
  "properties", Map.of(
    "text_field", Map.of(
      "type", "text",
      "analyzer", "ik_max_word"
    ),
    "text_field_embedding", Map.of(
      "type", "knn_vector",
      "dimension", 1024,
      "data_type", "float",
      "method", Map.of(
        "engine", "lvector",
        "name", "hnsw",
        "space_type", "cosinesimil",
        "parameters", Map.of(
          "m", 24,
          "ef_construction", 500
        )
      )
    ),
    "tag", Map.of(
      "type", "keyword"
    ),
    "brand", Map.of(
      "type", "keyword"
    ),
    "merit", Map.of(
      "type", "text",
      "analyzer", "ik_max_word"
    )
  )
));

CreateIndexResponse createIndexResponse = client.indices().create(createIndexRequest, RequestOptions.DEFAULT);
System.out.println("createIndexResponse: " + createIndexResponse.index());

修改现有向量索引设置

如果您已经创建了向量索引,可以通过以下方式修改其配置,指定写入和查询时使用的 Pipeline,以满足特定的业务需求。

String indexName = "search_vector_test";
String jsonString = """
  {
     "index": {
          "default_pipeline": "write_embedding_pipeline",
           "search.default_pipeline": "knnsearch_pipeline"
       }
   }
  """;
UpdateSettingsRequest updateSettingsRequest = new UpdateSettingsRequest(indexName).settings(jsonString, XContentType.JSON);
AcknowledgedResponse response = client.indices().putSettings(updateSettingsRequest, RequestOptions.DEFAULT);
System.out.println("updateIndexSettings Acknowledged: " + response.isAcknowledged());

数据写入

由于指定了写入的Pipeline,因此,在写入过程中,除了将文本字段text_field写入外,还会根据该Pipelinetext_field编码成向量形式,并将其作为text_field_embedding一并写入。

BulkRequest bulkRequest = new BulkRequest();

// Adding multiple IndexRequest to BulkRequest
bulkRequest.add(new IndexRequest("search_vector_test").id("3982")
  .source(XContentType.JSON,
    "text_field", "品牌A 时尚节能无线鼠标(草绿)(眩光.悦动.时尚炫舞鼠标 12个月免换电池 高精度光学寻迹引擎 超细微接收器10米传输距离)",
    "tag", new String[] {"鼠标", "电子产品"},
    "brand", "品牌A",
    "merit", "好用、外观漂亮"));

bulkRequest.add(new IndexRequest("search_vector_test").id("323519")
  .source(XContentType.JSON,
    "text_field", "品牌B 光学鼠标(经典黑)(智能自动对码/1000DPI高精度光学引擎)",
    "tag", new String[] {"鼠标", "电子产品"},
    "brand", "品牌B",
    "merit", "质量好、到货速度快、外观漂亮、好用"));

bulkRequest.add(new IndexRequest("search_vector_test").id("300265")
  .source(XContentType.JSON,
    "text_field", "品牌C 耳塞式耳机 白色(经典时尚)",
    "tag", new String[] {"耳机", "电子产品"},
    "brand", "品牌C",
    "merit", "外观漂亮、质量好"));

bulkRequest.add(new IndexRequest("search_vector_test").id("6797")
  .source(XContentType.JSON,
    "text_field", "品牌D 两刀头充电式电动剃须刀",
    "tag", new String[] {"家用电器", "电动剃须刀"},
    "brand", "品牌D",
    "merit", "好用、外观漂亮"));

bulkRequest.add(new IndexRequest("search_vector_test").id("8195")
  .source(XContentType.JSON,
    "text_field", "品牌E Class4 32G TF卡(micro SD)手机存储卡",
    "tag", new String[] {"存储设备", "存储卡", "SD卡"},
    "brand", "品牌E",
    "merit", "容量挺大的、速度快、好用、质量好"));

bulkRequest.add(new IndexRequest("search_vector_test").id("13316")
  .source(XContentType.JSON,
    "text_field", "品牌E 101 G2 32GB 优盘",
    "tag", new String[] {"存储设备", "U盘", "优盘"},
    "brand", "品牌E",
    "merit", "好用、容量挺大的、速度快"));

bulkRequest.add(new IndexRequest("search_vector_test").id("14103")
  .source(XContentType.JSON,
    "text_field", "品牌B 64GB至尊高速移动存储卡 UHS-1制式 读写速度最高可达30MB",
    "tag", new String[] {"存储设备", "存储卡", "SD卡"},
    "brand", "品牌B",
    "merit", "容量挺大的、速度快、好用"));

bulkRequest.setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE);
BulkResponse bulkResponse = client.bulk(bulkRequest, RequestOptions.DEFAULT);

if (bulkResponse.hasFailures()) {
  // 处理可能的失败情况
  System.out.println("Bulk operation had failures:");
  System.out.println(bulkResponse.buildFailureMessage());
} else {
  System.out.println("Bulk operation completed successfully.");
}

数据查询

纯向量
向量+属性过滤
向量+全文+属性过滤
SearchRequest searchRequest = new SearchRequest();
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
Map<String, Object> queryBody = Map.of(
  "knn", Map.of(
    "text_field_embedding", Map.of(
      "query_text", "存储卡",
      "k", 10
    )
  )
);
searchSourceBuilder.size(10);
searchSourceBuilder.query(QueryBuilders.wrapperQuery(new Gson().toJson(queryBody)));
Map<String, String> ext = Map.of("ef_search", "200");
searchSourceBuilder.ext(Collections.singletonList(new LVectorExtBuilder("lvector", ext)));
searchRequest.source(searchSourceBuilder);
searchRequest.indices("search_vector_test");
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
System.out.println(searchResponse);

返回结果

search responseBody = {
  "took": 46,
  "timed_out": false,
  "terminated_early": false,
  "num_reduce_phases": 0,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 7,
      "relation": "eq"
    },
    "max_score": 0.7433592,
    "hits": [
      {
        "_index": "search_vector_test",
        "_id": "8195",
        "_score": 0.7433592,
        "_source": {
          "text_field": "品牌E Class4 32G TF卡(micro SD)手机存储卡",
          "merit": "容量挺大的、速度快、好用、质量好",
          "tag": [
            "存储设备",
            "存储卡",
            "SD卡"
          ],
          "brand": "品牌E"
        }
      },
      {
        "_index": "search_vector_test",
        "_id": "14103",
        "_score": 0.7116537,
        "_source": {
          "text_field": "品牌B 64GB至尊高速移动存储卡 UHS-1制式 读写速度最高可达30MB",
          "merit": "容量挺大的、速度快、好用",
          "tag": [
            "存储设备",
            "存储卡",
            "SD卡"
          ],
          "brand": "品牌B"
        }
      },
      {
        "_index": "search_vector_test",
        "_id": "13316",
        "_score": 0.6831677,
        "_source": {
          "text_field": "品牌E 101 G2 32GB 优盘",
          "merit": "好用、容量挺大的、速度快",
          "tag": [
            "存储设备",
            "U盘",
            "优盘"
          ],
          "brand": "品牌E"
        }
      },
      {
        "_index": "search_vector_test",
        "_id": "3982",
        "_score": 0.64234203,
        "_source": {
          "text_field": "品牌A 时尚节能无线鼠标(草绿)(眩光.悦动.时尚炫舞鼠标 12个月免换电池 高精度光学寻迹引擎 超细微接收器10米传输距离)",
          "merit": "好用、外观漂亮",
          "tag": [
            "鼠标",
            "电子产品"
          ],
          "brand": "品牌A"
        }
      },
      {
        "_index": "search_vector_test",
        "_id": "6797",
        "_score": 0.6357207,
        "_source": {
          "text_field": "品牌D 两刀头充电式电动剃须刀",
          "merit": "好用、外观漂亮",
          "tag": [
            "家用电器",
            "电动剃须刀"
          ],
          "brand": "品牌D"
        }
      },
      {
        "_index": "search_vector_test",
        "_id": "323519",
        "_score": 0.62445086,
        "_source": {
          "text_field": "品牌B 光学鼠标(经典黑)(智能自动对码/1000DPI高精度光学引擎)",
          "merit": "质量好、到货速度快、外观漂亮、好用",
          "tag": [
            "鼠标",
            "电子产品"
          ],
          "brand": "品牌B"
        }
      },
      {
        "_index": "search_vector_test",
        "_id": "300265",
        "_score": 0.62144196,
        "_source": {
          "text_field": "品牌C 耳塞式耳机 白色(经典时尚)",
          "merit": "外观漂亮、质量好",
          "tag": [
            "耳机",
            "电子产品"
          ],
          "brand": "品牌C"
        }
      }
    ]
  }
}
SearchRequest searchRequest = new SearchRequest();
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
Map<String, Object> queryBody = Map.of(
  "knn", Map.of(
    "text_field_embedding", Map.of(
      "query_text", "存储卡",
      "k", 10,
      "filter", Map.of(
        "bool", Map.of(
          "filter", List.of(
            Map.of("match", Map.of("merit", "质量好")),
            Map.of("term", Map.of("brand", "品牌E")),
            Map.of("terms", Map.of("tag", List.of("SD卡", "存储卡")))
          )
        )
      )
    )
  )
);

searchSourceBuilder.size(10);
searchSourceBuilder.query(QueryBuilders.wrapperQuery(new Gson().toJson(queryBody)));
Map<String, String> ext = Map.of("filter_type", "efficient_filter", "ef_search", "200");
searchSourceBuilder.ext(Collections.singletonList(new LVectorExtBuilder("lvector", ext)));
searchRequest.source(searchSourceBuilder);
searchRequest.indices("search_vector_test");
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
System.out.println(searchResponse);

返回结果

search responseBody = {
  "took": 73,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.7433592,
    "hits": [
      {
        "_index": "search_vector_test",
        "_id": "8195",
        "_score": 0.7433592,
        "_source": {
          "text_field": "品牌E Class4 32G TF卡(micro SD)手机存储卡",
          "merit": "容量挺大的、速度快、好用、质量好",
          "tag": [
            "存储设备",
            "存储卡",
            "SD卡"
          ],
          "brand": "品牌E"
        }
      }
    ]
  }
}
SearchRequest searchRequest = new SearchRequest();
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
Map<String, Object> queryBody = Map.of(
  "knn", Map.of(
    "text_field_embedding", Map.of(
      "query_text", "存储卡",
      "filter", Map.of(
        "bool", Map.of(
          "must", List.of(
            Map.of(
              "bool", Map.of(
                "must", List.of(
                  Map.of("match", Map.of("text_field", Map.of("query", "存储卡")))
                )
              )
            ),
            Map.of(
              "bool", Map.of(
                "filter", List.of(
                  Map.of("match", Map.of("merit", "质量好")),
                  Map.of("term", Map.of("brand", "品牌E")),
                  Map.of("terms", Map.of("tag", List.of("SD卡", "存储卡")))
                )
              )
            )
          )
        )
      ),
      "k", 10
    )
  )
);
searchSourceBuilder.size(10);
searchSourceBuilder.query(QueryBuilders.wrapperQuery(new Gson().toJson(queryBody)));
Map<String, String> ext = Map.of(
  "filter_type", "efficient_filter",
  "hybrid_search_type", "filter_rrf",
  "rrf_rank_constant", "1",
  "ef_search", "200");
searchSourceBuilder.ext(Collections.singletonList(new LVectorExtBuilder("lvector", ext)));
searchRequest.source(searchSourceBuilder);
searchRequest.indices("search_vector_test");
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
System.out.println(searchResponse);

返回结果

search responseBody = {
  "took": 95,
  "timed_out": false,
  "terminated_early": false,
  "num_reduce_phases": 0,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "search_vector_test",
        "_id": "8195",
        "_score": 1.0,
        "_source": {
          "text_field": "品牌E Class4 32G TF卡(micro SD)手机存储卡",
          "merit": "容量挺大的、速度快、好用、质量好",
          "tag": [
            "存储设备",
            "存储卡",
            "SD卡"
          ],
          "brand": "品牌E"
        }
      }
    ]
  }
}
  • 本页导读
  • 前提条件
  • 注意事项
  • 准备工作
  • 操作步骤概览
  • AI引擎部署Embedding模型
  • 搜索引擎创建Pipeline
  • 创建写入Pipeline
  • 创建查询Pipeline
  • 创建索引并指定Pipeline
  • 创建向量索引
  • 修改现有向量索引设置
  • 数据写入
  • 数据查询