全文向量混合检索_云原生多模数据库 Lindorm(Lindorm)-阿里云帮助中心

全文向量混合检索结合了全文检索和纯向量检索，相较于单纯的全文检索或向量检索，其检索结果通常更加精确，相似度也更高。本文介绍如何使用Lindorm向量引擎的全文向量混合检索功能。

前提条件

已开通向量引擎。如何开通，请参见开通向量引擎。
已开通搜索引擎。如何开通，请参见开通指南。
搜索引擎为3.9.10及以上版本。如何查看或升级当前版本，请参见搜索引擎版本说明和升级小版本。
重要
如果您的搜索引擎为3.9.10以下版本，但控制台显示已是最新版本，请联系Lindorm技术支持（钉钉号：s0s3eg3）。
已将客户端IP地址添加至Lindorm白名单，具体操作请参见设置白名单。

准备工作

在使用高级特性前，请先通过curl命令连接搜索引擎。具体操作及连接参数说明，请参见连接搜索引擎。

全文+向量双路召回（RRF融合检索）

在一些查询场景中，您需要综合考虑全文索引和向量索引的排序，根据一定的打分规则对各自返回的结果进一步进行加权计算，并得到最终的排名。

创建索引

以下示例使用hsnw算法。

重要

如果使用ivfpq算法，需要先将knn.offline.construction设置为true，导入离线数据后发起索引构建，构建成功后方可进行查询，详细说明请参见创建向量索引和索引构建。

curl -u <username>:<password> -H 'Content-Type: application/json' -XPUT "http://ld-t4n566i****.lindorm.aliyuncs.com:30070/vector_text_hybridSearch?pretty"  -d '
{
 "settings" : {
    "index": {
      "number_of_shards": 2,
      "knn": true
    }
  },
  "mappings": {
    "_source": {
      "excludes": ["vector1"]
    },
    "properties": {
      "vector1": {
        "type": "knn_vector",
        "dimension": 3,
        "data_type": "float",
        "method": {
          "engine": "lvector",
          "name": "hnsw", 
          "space_type": "l2",
          "parameters": {
            "m": 24,
            "ef_construction": 500
         }
       }
      },
      "text_field": {
        "type": "text",
        "analyzer": "ik_max_word"
      },
      "field1": {
        "type": "long"
      },
      "filed2": {
        "type": "keyword"
      }
    }
  }
}'

数据写入

curl -u <username>:<password> -H "Content-Type: application/json" -XPOST "http://ld-t4n566i****-proxy-search-pub.lindorm.rds.aliyuncs.com:30070/_bulk?pretty" -d '
{ "index" : { "_index" : "vector_text_hybridSearch", "_id" : "1" } }
{ "field1" : 1, "field2" : "flag1", "vector1": [2.5, 2.3, 2.4], "text_field": "hello test5"}
{ "index" : { "_index" : "vector_text_hybridSearch", "_id" : "2" } }
{ "field1" : 2, "field2" : "flag1", "vector1": [2.6, 2.3, 2.4], "text_field": "hello test6 test5"}
{ "index" : { "_index" : "vector_text_hybridSearch", "_id" : "3" } }
{ "field1" : 3, "field2" : "flag1", "vector1": [2.7, 2.3, 2.4], "text_field": "hello test7"}
{ "index" : { "_index" : "vector_text_hybridSearch", "_id" : "4" } }
{ "field1" : 4, "field2" : "flag2","vector1": [2.8, 2.3, 2.4], "text_field": "hello test8 test7"}
{ "index" : { "_index" : "vector_text_hybridSearch", "_id" : "5" } }
{ "field1" : 5, "field2" : "flag2","vector1": [2.9, 2.3, 2.4], "text_field": "hello test9"}
'

数据查询（融合查询）

RRF计算方式如下：

进行查询时系统会根据传入的rrf_rank_constant参数，对全文检索和向量检索分别获得的topK结果进行处理。对于每个返回的文档_id，使用公式1/(rrf_rank_constant + rank(i))计算得分，其中rank(i)表示该文档在结果中的排名。

如果某个文档_id同时出现在全文检索和向量检索的topK结果中，其最终得分为两种检索方法计算得分之和。而仅出现在其中一种检索结果中的文档，则只保留该检索方法的得分。

以rrf_rank_constant = 1为例，计算结果如下：

# doc   | queryA     | queryB         | score
_id: 1 =  1.0/(1+1)  + 0              = 0.5
_id: 2 =  1.0/(1+2)  + 0              = 0.33
_id: 4 =    0        + 1.0/(1+2)      = 0.33
_id: 5 =    0        + 1.0/(1+1)      = 0.5

支持通过_search接口或_msearch_rrf接口进行融合查询，两种接口的对比如下：

接口	开源性	易读性	是否支持全文、向量检索比例调整

接口	开源性	易读性	是否支持全文、向量检索比例调整
_search	兼容	不易读	支持
_msearch_rrf	自研接口	易读	不支持

以下是两种场景下使用_search接口或_msearch_rrf接口的具体写法：

无标量字段过滤的场景

使用开源_search接口

使用自研_msearch_rrf接口

优点：兼容开源_search接口，支持通过rrf_knn_weight_factor参数调整全文检索与纯向量检索的比例。

缺点：写法较为复杂。

在ext.lvector扩展字段中，不设置filter_type，则表示该RRF检索只包含全文检索和纯向量检索，同时向量检索中无需进行标量字段的过滤。

curl -u <username>:<password> -H "Content-Type: application/json" -XPOST "http://ld-t4n566i****-proxy-search-vpc.lindorm.rds.aliyuncs.com:30070/vector_text_hybridSearch/_search?pretty" -d '{
  "size": 10,
  "_source": false,
  "query": {
    "knn": {
      "vector1": {
        "vector": [2.8, 2.3, 2.4],
        "filter": {
          "match": {
             "text_field": "test5 test6 test7 test8 test9"
          }
        },
        "k": 10
      }
    }
  },
  "ext": {"lvector": {
    "hybrid_search_type": "filter_rrf", 
    "rrf_rank_constant": "60",
    "rrf_knn_weight_factor": "0.5"
  }}
}'

如果使用ivfpq算法：不使用属性过滤的场景，ext.lvector扩展参数可以设置为：

 "ext": {"lvector": {
    "hybrid_search_type": "filter_rrf", 
    "rrf_rank_constant": "60",
    "rrf_knn_weight_factor": "0.5",
    "nprobe": "80", 
    "reorder_factor": "2",
    "client_refactor":"true"
  }}

说明

您可以适当增加nprobe的值，例如设置为80、100、120、140或160。nprobe参数对性能的损耗远比reorder_factor参数小，但也不宜将nprobe的值设置得过大。
如果查询语句中参数k的值设置得较大，例如大于等于100，建议将reorder_factor设置为1或者2。

返回结果：

单击展开返回结果

{
  "took": 4,
  "timed_out": false,
  "terminated_early": false,
  "num_reduce_phases": 0,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 5,
      "relation": "eq"
    },
    "max_score": 0.032522473,
    "hits": [
      {
        "_index": "vector_text_hybridSearch",
        "_id": "4",
        "_score": 0.032522473
      },
      {
        "_index": "vector_text_hybridSearch",
        "_id": "2",
        "_score": 0.03201844
      },
      {
        "_index": "vector_text_hybridSearch",
        "_id": "5",
        "_score": 0.031746034
      },
      {
        "_index": "vector_text_hybridSearch",
        "_id": "3",
        "_score": 0.031513646
      },
      {
        "_index": "vector_text_hybridSearch",
        "_id": "1",
        "_score": 0.031009614
      }
    ]
  }
}

优点：写法较清晰。

缺点：不兼容开源_search接口，不支持调整全文检索与纯向量检索的比例。

curl -u  <username>:<password> -H 'Content-Type: application/json' -XGET "http://ld-t4n566i****-proxy-search-vpc.lindorm.aliyuncs.com:30070/_msearch_rrf?re_score=true&rrf_rank_constant=60&pretty"  -d '
{"index": "vector_text_hybridSearch"}
{"size":10,"_source":false, "query":{"match":{"text_field":"test5 test6 test7 test8 test9"}}}
{"index": "vector_text_hybridSearch"}
{"size":10,"_source":false,"query":{"knn":{"vector1":{"vector":[2.8,2.3,2.4],"k":10}}}}
'

说明

连接参数中必须添加re_score=true。

返回结果：

单击展开返回结果

{
  "took": 6,
  "timed_out": false,
  "terminated_early": false,
  "num_reduce_phases": 0,
  "_shards": {
    "total": 4,
    "successful": 4,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 5,
      "relation": "eq"
    },
    "max_score": 0.032522473,
    "hits": [
      {
        "_index": "vector_text_hybridSearch",
        "_id": "4",
        "_score": 0.032522473
      },
      {
        "_index": "vector_text_hybridSearch",
        "_id": "2",
        "_score": 0.03201844
      },
      {
        "_index": "vector_text_hybridSearch",
        "_id": "5",
        "_score": 0.031746034
      },
      {
        "_index": "vector_text_hybridSearch",
        "_id": "3",
        "_score": 0.031513646
      },
      {
        "_index": "vector_text_hybridSearch",
        "_id": "1",
        "_score": 0.031009614
      }
    ]
  }
}

包含标量字段过滤场景

使用开源_search接口

使用自研_msearch_rrf接口

在ext.lvector扩展字段中设置filter_type参数，则表示该RRF检索中的向量检索还需进行标量字段的过滤。

说明

RRF融合检索时，如果希望携带filter过滤条件，需要将全文检索的query条件和用于过滤的filter条件分别设置到两个bool表达式中，通过bool.must进行连接。must中的第一个bool表达式将用于全文检索，计算全文匹配度得分。must中第二个bool filter表达式将用于knn检索的过滤条件。

设置单个条件的filter

curl -u <username>:<password> -H "Content-Type: application/json" -XPOST "http://ld-ld-t4n566i****-proxy-search-pub.lindorm.rds.aliyuncs.com:30070/vector_text_hybridSearch/_search?pretty" -d '{
  "size": 10,
  "_source": false,
  "query": {
    "knn": {
      "vector1": {
        "vector": [2.8, 2.3, 2.4],
        "filter": {
          "bool": {
             "must": [
                {
                  "bool": {
                    "must":[{
                      "match": {
                        "text_field": {
                          "query": "test5 test6 test7 test8 test9"
                        }
                      }
                    }]
                  }
                },
                {
                  "bool": {
                    "filter": [{
                      "range": {
                        "field1": {
                          "gt": 2
                        }
                      }
                    }]
                  }
                }
              ]
          }
        },
        "k": 10
      }
    }
  },
  "ext": {"lvector": {
    "filter_type": "efficient_filter",
    "hybrid_search_type": "filter_rrf", 
    "rrf_rank_constant": "60"
  }}
}'

如果使用ivfpq算法：不使用属性过滤的场景，ext.lvector扩展参数可以设置为：

 "ext": {"lvector": {
    "filter_type": "efficient_filter",
    "hybrid_search_type": "filter_rrf", 
    "rrf_rank_constant": "60",
    "rrf_knn_weight_factor": "0.5",
    "nprobe": "80", 
    "reorder_factor": "2",
    "client_refactor":"true"
  }}

说明

您可以适当增加nprobe的值，例如设置为80、100、120、140或160。nprobe参数对性能的损耗远比reorder_factor参数小，但也不宜将nprobe的值设置得过大。
如果查询语句中参数k的值设置得较大，例如大于等于100，建议将reorder_factor设置为1或者2。

返回结果：

单击展开返回结果

 {
  "took": 42,
  "timed_out": false,
  "terminated_early": false,
  "num_reduce_phases": 0,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 0.032786883,
    "hits": [
      {
        "_index": "vector_text_hybridSearch",
        "_id": "4",
        "_score": 0.032786883
      },
      {
        "_index": "vector_text_hybridSearch",
        "_id": "3",
        "_score": 0.032002047
      },
      {
        "_index": "vector_text_hybridSearch",
        "_id": "5",
        "_score": 0.032002047
      }
    ]
  }
}

设置多个条件的filter

说明

多条件filter的更多写法，请参见属性过滤表达式写法扩充。

curl -u <username>:<password> -H "Content-Type: application/json" -XPOST "http://ld-t4n566i****-proxy-search-vpc.lindorm.rds.aliyuncs.com:30070/vector_text_hybridSearch/_search?pretty" -d '{
  "size": 10,
  "_source": ["field1", "field2"],
  "query": {
    "knn": {
      "vector1": {
        "vector": [2.8, 2.3, 2.4],
        "filter": {
          "bool": {
             "must": [
                {
                  "bool": {
                    "must":[{
                      "match": {
                        "text_field": {
                          "query": "test5 test6 test7 test8 test9"
                        }
                      }
                    }]
                  }
                },
                {
                  "bool": {
                    "filter": [{
                      "range": {
                        "field1": {
                          "gt": 2
                        }
                      }
                    }, 
                    {
                      "term": {
                        "field2":"flag2"
                      }
                    }
                    ]
                  }
                }
              ]
          }
        },
        "k": 100
      }
    }
  },
  "ext": {"lvector": {
    "filter_type": "efficient_filter",
    "hybrid_search_type": "filter_rrf", 
    "rrf_rank_constant": "60"
  }}
}

返回结果：

单击展开返回结果

{
  "took": 6,
  "timed_out": false,
  "terminated_early": false,
  "num_reduce_phases": 0,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.032786883,
    "hits": [
      {
        "_index": "vector_text_hybridSearch",
        "_id": "4",
        "_score": 0.032786883,
        "_source": {
          "field1": 4,
          "field2": "flag2"
        }
      },
      {
        "_index": "vector_text_hybridSearch",
        "_id": "5",
        "_score": 0.032258064,
        "_source": {
          "field1": 5,
          "field2": "flag2"
        }
      }
    ]
  }
}

curl -u   <username>:<password> -H 'Content-Type: application/json' -XGET "http://ld-t4n566i****-proxy-search-vpc.lindorm.aliyuncs.com:30070/_msearch_rrf?re_score=true&rrf_rank_constant=60&pretty"  -d '
{"index": "vector_text_hybridSearch"}
{"size": 10,"_source":false,"query":{"bool":{"must":[{"match":{"text_field":"test5 test6 test7 test8 test9"}}],"filter":[{"range":{"field1":{"gt":2}}}]}}}
{"index": "vector_text_hybridSearch"}
{"size":10,"_source":false,"query":{"knn":{"vector1":{"vector":[2.8,2.3,2.4],"filter":{"range":{"field1":{"gt":2}}},"k":10}}},"ext":{"lvector":{"filter_type":"post_filter"}}}
'

说明

连接参数中必须添加re_score=true。

返回结果：

单击展开返回结果

{
  "took" : 6,
  "timed_out" : false,
  "terminated_early" : false,
  "num_reduce_phases" : 0,
  "_shards" : {
    "total" : 4,
    "successful" : 4,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 0.032786883,
    "hits" : [
      {
        "_index" : "vector_text_hybridSearch",
        "_id" : "3",
        "_score" : 0.032786883
      },
      {
        "_index" : "vector_text_hybridSearch",
        "_id" : "2",
        "_score" : 0.032002047
      },
      {
        "_index" : "vector_text_hybridSearch",
        "_id" : "4",
        "_score" : 0.032002047
      }
    ]
  }
}

重要

_msearch_rrf接口的返回结果与_msearch接口不同，_msearch接口会返回多个独立的查询结果，而_msearch_rrf接口会将多个查询结果进行RRF融合排序后再返回。

参数说明

参数	是否必填	默认值	说明

参数	是否必填	默认值	说明
filter_type	否	无	查询使用的模式。支持的取值：pre_filter、post_filter和efficient_filter。参数详细说明，请参见参数说明。重要包含标量字段过滤的场景下该参数必填，无标量字段过滤的场景该参数不必填写。
hybrid_search_type	是	无	设置为`filter_rrf`表示进行RRF融合检索。说明使用自研_msearch_rrf接口时忽略该参数。
rrf_rank_constant	否	60	表示RRF计算公式里计算得分的加权系数。计算公式为：`1/(rrf_rank_constant + rank(i))`。
rrf_window_size	否	topK	表示全文检索需要返回的中间结果数，默认和knn检索的topK保持一致。说明使用自研_msearch_rrf接口时忽略该参数。
rrf_knn_weight_factor 重要仅3.9.3及以上版本的搜索引擎支持该参数。	否	0.5	取值范围为`(0, 1)`，`0.01`代表纯全文检索，`0.99`代表纯向量检索。说明使用自研_msearch_rrf接口时忽略该参数。