通过Java自动Embedding的写入与查询方法-阿里云帮助中心

自动Embedding技术通过内置预训练模型，将文本自动转化为向量，消除了传统方案中手动定义向量字段的繁琐流程。本文介绍基于Java语言，如何在Lindorm向量引擎中通过Java Low Level REST Client客户端实现自动Embedding数据的写入与查询。

前提条件

已安装Java环境，要求安装JDK 1.8及以上版本。
已开通向量引擎。如何开通，请参见开通向量引擎。
已开通搜索引擎，且搜索引擎为3.9.10及以上版本。如何开通，请参见开通指南。如何查看或升级当前版本，请参见搜索引擎版本说明和升级小版本。
重要
如果您的搜索引擎为3.9.10以下版本，但控制台显示已是最新版本，请联系Lindorm技术支持（钉钉号：s0s3eg3）。
已开通AI引擎。如何开通，请参见开通指南。
说明
由于AI引擎的功能实现依赖于宽表引擎，因此在开通AI引擎时必须同时开通宽表引擎。
已将客户端IP地址添加至Lindorm白名单，具体操作请参见设置白名单。

注意事项

本文所有示例代码中的JSON字符串均采用了文本块（Text Block），这是JDK15及以上版本支持的正式标准特性，即通过使用三对双引号 """ """ 来标识文本块的开始和结束。如果您的JDK版本过低，可以将文本块自行转回多行字符串拼接的样式。

准备工作

在使用高级特性前，您需要先安装Java Low Level REST Client并连接搜索引擎，具体操作，请参见准备工作。

操作步骤概览

操作步骤	涉及引擎	说明

操作步骤	涉及引擎	说明
AI引擎部署Embedding模型	AI引擎	通过curl命令调用AI引擎RESTful API，部署Embedding模型BGE-M3，用于将文本数据转换为向量。
创建写入Pipeline	搜索引擎	在搜索引擎中创建写入Pipeline，用于在写入数据时，自动将文本数据转换为向量数据（Embedding）。
创建写入Pipeline	搜索引擎	在搜索引擎中创建查询Pipeline，用于在查询数据时，自动将文本数据转化为向量数据。
创建索引并指定Pipeline	向量引擎，搜索引擎	在创建或修改向量索引时，需指定写入和查询Pipeline，用于将写入与查询数据自动转换为向量数据。
数据写入	向量引擎，搜索引擎	使用指定的写入Pipeline，将写入的文本数据自动转化为向量数据。
数据查询	向量引擎，搜索引擎	使用指定的查询Pipeline，将查询的文本数据自动转化为向量数据。

AI引擎部署Embedding模型

AI引擎部署模型的具体操作请参见模型管理和通过curl命令使用AI引擎RESTful API示例。

部署BGE-M3模型示例如下，参数详情请参见模型管理。

重要

curl请求地址URL使用AI引擎的专用网络连接地址。

curl -i -k --location --header 'x-ld-ak:<username>' --header 'x-ld-sk:<password>' -X POST http://<URL>/v1/ai/models/create  -H "Content-Type: application/json" -d '{
          "model_name": "bge_m3_model",
          "model_path": "huggingface://BAAI/bge-m3",
          "task": "FEATURE_EXTRACTION",
          "algorithm": "BGE_M3",
          "settings": {"instance_count": "2"}
     }'

搜索引擎创建Pipeline

在搜索引擎中创建两种Pipeline，分别用于实现数据写入和查询的自动向量化处理。

创建写入Pipeline

String jsonString = """
  {
    "description": "demo_chunking pipeline",
    "processors": [
      {
        "text-embedding": {
          "inputFields": ["text_field"],
          "outputFields": ["text_field_embedding"],
          "userName": "user", //AI引擎的用户名
          "password": "test****", //AI引擎的密码
          "url": "http://ld-t4n5668xk31ui****-proxy-ai-vpc.lindorm.aliyuncs.com:9002", // AI引擎的专有网络连接地址
          "modeName": "bge_m3_model"
        }
      }
    ]
  }
""";
String pipelineName = "write_embedding_pipeline";
Request createPipelineRequest = new Request("PUT", "/_ingest/pipeline/" + pipelineName);
createPipelineRequest.setJsonEntity(jsonString);
Response response = restClient.performRequest(createPipelineRequest);
String responseBody = EntityUtils.toString(response.getEntity());
System.out.println("createPipeline responseBody = " + responseBody);

参数说明

参数	说明

参数	说明
processors	对写入进行Pipeline操作。
text-embedding	固定Key，必须填写。
inputFields	需要进行向量化的文本字段。
outputFields	向量化后的向量字段。
userName	Lindorm AI引擎的用户名。
password	Lindorm AI引擎的密码。
url	AI引擎的连接地址，务必使用专有网络连接地址。
modeName	模型名称，本文对应`bge_m3_model`。

说明

写入和查询Pipeline中指定的inputFields和outputFields，必须与创建向量索引时填写的text_field和text_field_embedding保持一致。

创建查询Pipeline

String jsonString = """
{
  "request_processors": [
    {
      "text-embedding" : {
        "tag" : "auto-query-embedding",
        "description" : "Auto query embedding",
        "model_config" : {
          "inputFields": ["text_field"],
          "outputFields": ["text_field_embedding"],
          "userName": "user", //AI引擎的用户名
          "password": "test****", //AI引擎的密码
          "url": "http://ld-t4n5668xk31ui****-proxy-ai-vpc.lindorm.aliyuncs.com:9002", //AI引擎的专有网络连接地址 
          "modeName": "bge_m3_model"
        }
      }
    }
  ]
}
""";
String pipelineName = "knnsearch_pipeline";
Request createPipelineRequest = new Request("PUT", "/_search/pipeline/" + pipelineName);
createPipelineRequest.setJsonEntity(jsonString);
Response response = restClient.performRequest(createPipelineRequest);
String responseBody = EntityUtils.toString(response.getEntity());
System.out.println("create knnSearch Pipeline responseBody = " + responseBody);

参数说明

参数	说明

参数	说明
request_processors	表示对搜索请求进行Pipeline操作。
text-embedding	固定Key，必须填写。
inputFields	需要进行向量化的文本字段，起到占位作用。
outputFields	向量化以后的向量字段。
userName	Lindorm AI引擎的用户名。
password	Lindorm AI引擎的密码。
url	AI引擎的连接地址，务必使用专有网络连接地址。
modeName	模型名称，本文对应`bge_m3_model`。

说明

写入和查询Pipeline中指定的inputFields和outputFields，必须与创建向量索引时填写的text_field和text_field_embedding保持一致。

创建索引并指定Pipeline

在创建向量索引或修改现有向量索引设置时，请指定所需的Pipeline。

创建向量索引

// 创建索引
String indexName = "search_vector_test";
Request indexRequest = new Request("PUT", "/" + indexName);
String jsonString = """
{
 "settings" : {
    "index": {
      "number_of_shards": 2,
      "knn": true,
      "default_pipeline": "write_embedding_pipeline",
      "search.default_pipeline": "knnsearch_pipeline"
    }
  },
  "mappings": {
    "_source": {
      "excludes": ["text_field_embedding"]
    },
    "properties": {
      "text_field": {
        "type": "text",
        "analyzer": "ik_max_word" 
      },
      "text_field_embedding": {
        "type": "knn_vector",
        "dimension": 1024,
        "data_type": "float",
        "method": {
          "engine": "lvector",
          "name": "hnsw", 
          "space_type": "cosinesimil",
          "parameters": {
            "m": 24,
            "ef_construction": 500
         }
       }
      },
      "tag": {
          "type": "keyword"
      },
      "brand": {
          "type": "keyword"
      },
      "merit" : {
        "type": "text",
        "analyzer": "ik_max_word"  
      }
    }
  }
}
""";
indexRequest.setJsonEntity(jsonString);
Response response = restClient.performRequest(indexRequest);
String responseBody = EntityUtils.toString(response.getEntity());
System.out.println("createIndex responseBody = " + responseBody);

修改现有向量索引设置

如果您已经创建了向量索引，可以通过以下方式修改其配置，指定写入和查询时使用的 Pipeline，以满足特定的业务需求。

String pipelineName = "write_embedding_pipeline";
String knnPipelineName = "knnsearch_pipeline";
String indexName = "vector_test6";
Request linkPipelineRequest = new Request("PUT", "/" + indexName + "/_settings");
String jsonString = """
  {
    "index": {
      "default_pipeline": "%s",
      "search.default_pipeline": "%s"
    }
  }
""".formatted(pipelineName, knnPipelineName);
linkPipelineRequest.setJsonEntity(jsonString);
Response response = restClient.performRequest(linkPipelineRequest);
String responseBody = EntityUtils.toString(response.getEntity());
System.out.println("linkPipeline responseBody = " + responseBody);
}

数据写入

由于指定了写入的Pipeline，因此，在写入过程中，除了将文本字段text_field写入外，还会根据该Pipeline将text_field编码成向量形式，并将其作为text_field_embedding一并写入。

Request bulkRequest = new Request("POST", "/_bulk");
String jsonString = """
{ "index" : { "_index" : "search_vector_test", "_id" : "3982" } }
{ "text_field" : "品牌A 时尚节能无线鼠标(草绿)(眩光.悦动.时尚炫舞鼠标 12个月免换电池 高精度光学寻迹引擎 超细微接收器10米传输距离)", "tag": ["鼠标", "电子产品"], "brand":"品牌A", "merit":"好用、外观漂亮"}
{ "index" : { "_index" : "search_vector_test", "_id" : "323519" } }
{ "text_field" : "品牌B 光学鼠标(经典黑)(智能自动对码/1000DPI高精度光学引擎)", "tag": ["鼠标", "电子产品"], "brand":"品牌B", "merit":"质量好、到货速度快、外观漂亮、好用"}
{ "index" : { "_index" : "search_vector_test", "_id" : "300265" } }
{ "text_field" : "品牌C 耳塞式耳机 白色(经典时尚)", "tag": ["耳机", "电子产品"], "brand":"品牌C", "merit":"外观漂亮、质量好"}
{ "index" : { "_index" : "search_vector_test", "_id" : "6797" } }
{ "text_field" : "品牌D 两刀头充电式电动剃须刀", "tag": ["家用电器", "电动剃须刀"], "brand":"品牌D", "merit":"好用、外观漂亮"}
{ "index" : { "_index" : "search_vector_test", "_id" : "8195" } }
{ "text_field" : "品牌E Class4 32G TF卡(micro SD)手机存储卡", "tag": ["存储设备", "存储卡", "SD卡"], "brand":"品牌E", "merit":"容量挺大的、速度快、好用、质量好"}
{ "index" : { "_index" : "search_vector_test", "_id" : "13316" } }
{ "text_field" : "品牌E 101 G2 32GB 优盘", "tag": ["存储设备","U盘", "优盘"], "brand":"品牌E", "merit":"好用、容量挺大的、速度快"}
{ "index" : { "_index" : "search_vector_test", "_id" : "14103" } }
{ "text_field" : "品牌B 64GB至尊高速移动存储卡 UHS-1制式 读写速度最高可达30MB", "tag": ["存储设备", "存储卡", "SD卡"], "brand":"品牌B", "merit":"容量挺大的、速度快、好用"}
""";
bulkRequest.setJsonEntity(jsonString);
Response response = restClient.performRequest(bulkRequest);
String responseBody = EntityUtils.toString(response.getEntity());
System.out.println("bulkWriteDoc responseBody = " + responseBody);

数据查询

纯向量

向量+属性过滤

向量+全文+属性过滤

Request searchRequest = new Request("POST", "/" + indexName + "/_search?pretty");
String jsonString = """
{
  "size": 10,
  "_source": true,
  "query": {
    "knn": {
      "text_field_embedding": {
        "query_text": "存储卡",
        "k": 10
      }
    }
  },
  "ext": {
    "lvector": {
      "ef_search": "200"
    }
  }
}  
""";
searchRequest.setJsonEntity(jsonString);
Response response = restClient.performRequest(searchRequest);
String responseBody = EntityUtils.toString(response.getEntity());
System.out.println("search responseBody = " + responseBody);

返回结果

search responseBody = {
  "took": 46,
  "timed_out": false,
  "terminated_early": false,
  "num_reduce_phases": 0,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 7,
      "relation": "eq"
    },
    "max_score": 0.7433592,
    "hits": [
      {
        "_index": "search_vector_test",
        "_id": "8195",
        "_score": 0.7433592,
        "_source": {
          "text_field": "品牌E Class4 32G TF卡(micro SD)手机存储卡",
          "merit": "容量挺大的、速度快、好用、质量好",
          "tag": [
            "存储设备",
            "存储卡",
            "SD卡"
          ],
          "brand": "品牌E"
        }
      },
      {
        "_index": "search_vector_test",
        "_id": "14103",
        "_score": 0.7116537,
        "_source": {
          "text_field": "品牌B 64GB至尊高速移动存储卡 UHS-1制式 读写速度最高可达30MB",
          "merit": "容量挺大的、速度快、好用",
          "tag": [
            "存储设备",
            "存储卡",
            "SD卡"
          ],
          "brand": "品牌B"
        }
      },
      {
        "_index": "search_vector_test",
        "_id": "13316",
        "_score": 0.6831677,
        "_source": {
          "text_field": "品牌E 101 G2 32GB 优盘",
          "merit": "好用、容量挺大的、速度快",
          "tag": [
            "存储设备",
            "U盘",
            "优盘"
          ],
          "brand": "品牌E"
        }
      },
      {
        "_index": "search_vector_test",
        "_id": "3982",
        "_score": 0.64234203,
        "_source": {
          "text_field": "品牌A 时尚节能无线鼠标(草绿)(眩光.悦动.时尚炫舞鼠标 12个月免换电池 高精度光学寻迹引擎 超细微接收器10米传输距离)",
          "merit": "好用、外观漂亮",
          "tag": [
            "鼠标",
            "电子产品"
          ],
          "brand": "品牌A"
        }
      },
      {
        "_index": "search_vector_test",
        "_id": "6797",
        "_score": 0.6357207,
        "_source": {
          "text_field": "品牌D 两刀头充电式电动剃须刀",
          "merit": "好用、外观漂亮",
          "tag": [
            "家用电器",
            "电动剃须刀"
          ],
          "brand": "品牌D"
        }
      },
      {
        "_index": "search_vector_test",
        "_id": "323519",
        "_score": 0.62445086,
        "_source": {
          "text_field": "品牌B 光学鼠标(经典黑)(智能自动对码/1000DPI高精度光学引擎)",
          "merit": "质量好、到货速度快、外观漂亮、好用",
          "tag": [
            "鼠标",
            "电子产品"
          ],
          "brand": "品牌B"
        }
      },
      {
        "_index": "search_vector_test",
        "_id": "300265",
        "_score": 0.62144196,
        "_source": {
          "text_field": "品牌C 耳塞式耳机 白色(经典时尚)",
          "merit": "外观漂亮、质量好",
          "tag": [
            "耳机",
            "电子产品"
          ],
          "brand": "品牌C"
        }
      }
    ]
  }
}

Request searchRequest = new Request("POST", "/" + indexName + "/_search?pretty");
String jsonString = """
{
  "size": 10,
  "_source": true,
  "query": {
    "knn": {
      "text_field_embedding": {
        "query_text": "存储卡",
        "k": 10,
        "filter": {
          "bool": {
            "filter": [{
              "match": {
                "merit": "质量好"
              }
            },
            {
              "term": {
                "brand": "品牌E"
              }
            },
            {
              "terms": {
                "tag": ["SD卡", "存储卡"]
              }
            }]
          }
        }
      }
    }
  },
  "ext": {
    "lvector": {
      "filter_type": "efficient_filter",
      "ef_search": "200"
    }
  }
}
""";
searchRequest.setJsonEntity(jsonString);
Response response = restClient.performRequest(searchRequest);
String responseBody = EntityUtils.toString(response.getEntity());
System.out.println("search responseBody = " + responseBody);

返回结果

search responseBody = {
  "took": 73,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.7433592,
    "hits": [
      {
        "_index": "search_vector_test",
        "_id": "8195",
        "_score": 0.7433592,
        "_source": {
          "text_field": "品牌E Class4 32G TF卡(micro SD)手机存储卡",
          "merit": "容量挺大的、速度快、好用、质量好",
          "tag": [
            "存储设备",
            "存储卡",
            "SD卡"
          ],
          "brand": "品牌E"
        }
      }
    ]
  }
}

Request searchRequest = new Request("POST", "/" + indexName + "/_search?pretty");
String jsonString = """
{
  "size": 10,
  "_source": true,
  "query": {
    "knn": {
      "text_field_embedding": {
        "query_text": "存储卡",
        "filter": {
          "bool": {
            "must": [{
              "bool": {
                "must": [{
                  "match": {
                    "text_field": {
                      "query": "存储卡"
                    }
                  }
                }]
              }
            },
            {
              "bool": {
                "filter": [{
                  "match": {
                    "merit": "质量好"
                  }
                },
                {
                  "term": {
                    "brand": "品牌E"
                  }
                },
                {
                  "terms": {
                    "tag": ["SD卡", "存储卡"]
                  }
                }]
              }
            }]
          }
        },
        "k": 10
      }
    }
  },
  "ext": {
    "lvector": {
      "filter_type": "efficient_filter",
      "hybrid_search_type": "filter_rrf",
      "rrf_rank_constant": "1",
      "ef_search": "200"
    }
  }
}  
""";
searchRequest.setJsonEntity(jsonString);
Response response = restClient.performRequest(searchRequest);
String responseBody = EntityUtils.toString(response.getEntity());
System.out.println("search responseBody = " + responseBody);

返回结果

search responseBody = {
  "took": 95,
  "timed_out": false,
  "terminated_early": false,
  "num_reduce_phases": 0,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "search_vector_test",
        "_id": "8195",
        "_score": 1.0,
        "_source": {
          "text_field": "品牌E Class4 32G TF卡(micro SD)手机存储卡",
          "merit": "容量挺大的、速度快、好用、质量好",
          "tag": [
            "存储设备",
            "存储卡",
            "SD卡"
          ],
          "brand": "品牌E"
        }
      }
    ]
  }
}