Full-text and vector hybrid search

更新时间:
复制 MD 格式

Full-text and vector hybrid search combines full-text search with pure vector search. Compared to using only one of these methods, hybrid search typically provides more accurate and relevant results. This topic describes how to use the full-text and vector hybrid search feature of the Lindorm vector engine.

Prerequisites

  • The vector engine is enabled. For more information, see Enable vector engine.

  • The search engine is enabled. For more information, see Activation guide.

  • The search engine is version 3.9.10 or later. To view or upgrade the current version, see Search engine version guide and Minor version update.

    Important

    If your search engine version is earlier than 3.9.10 but the console indicates that you are using the latest version, contact Lindorm technical support (DingTalk ID: s0s3eg3).

  • The client IP address is added to the Lindorm whitelist. For more information, see Configure a whitelist.

Preparations

Before you can use advanced features, you must connect to the search engine using the curl command. For more information about the connection method and parameters, see Connect to the search engine.

Full-text and vector dual-channel retrieval (RRF-based hybrid search)

In some query scenarios, you may need to combine results from both full-text and vector indexes. The results from each index are then weighted based on a scoring rule to determine the final ranking.

Create an index

The following example uses the HNSW algorithm.

Important

If you use the IVFPQ algorithm, you must first set knn.offline.construction to true, import offline data, and then start index construction. You can perform queries only after the index construction is complete. For more information, see Create a vector index and Index construction.

curl -u <username>:<password> -H 'Content-Type: application/json' -XPUT "http://ld-t4n566i****.lindorm.aliyuncs.com:30070/vector_text_hybridSearch?pretty"  -d '
{
 "settings" : {
    "index": {
      "number_of_shards": 2,
      "knn": true
    }
  },
  "mappings": {
    "_source": {
      "excludes": ["vector1"]
    },
    "properties": {
      "vector1": {
        "type": "knn_vector",
        "dimension": 3,
        "method": {
          "engine": "lvector",
          "name": "hnsw", 
          "space_type": "l2",
          "parameters": {
            "m": 24,
            "ef_construction": 500
         }
       }
      },
      "text_field": {
        "type": "text",
        "analyzer": "ik_max_word"
      },
      "field1": {
        "type": "long"
      },
      "field2": {
        "type": "keyword"
      }
    }
  }
}'

Write data

curl -u <username>:<password> -H "Content-Type: application/json" -XPOST "http://ld-t4n566i****-proxy-search-pub.lindorm.rds.aliyuncs.com:30070/_bulk?pretty" -d '
{ "index" : { "_index" : "vector_text_hybridSearch", "_id" : "1" } }
{ "field1" : 1, "field2" : "flag1", "vector1": [2.5, 2.3, 2.4], "text_field": "hello test5"}
{ "index" : { "_index" : "vector_text_hybridSearch", "_id" : "2" } }
{ "field1" : 2, "field2" : "flag1", "vector1": [2.6, 2.3, 2.4], "text_field": "hello test6 test5"}
{ "index" : { "_index" : "vector_text_hybridSearch", "_id" : "3" } }
{ "field1" : 3, "field2" : "flag1", "vector1": [2.7, 2.3, 2.4], "text_field": "hello test7"}
{ "index" : { "_index" : "vector_text_hybridSearch", "_id" : "4" } }
{ "field1" : 4, "field2" : "flag2","vector1": [2.8, 2.3, 2.4], "text_field": "hello test8 test7"}
{ "index" : { "_index" : "vector_text_hybridSearch", "_id" : "5" } }
{ "field1" : 5, "field2" : "flag2","vector1": [2.9, 2.3, 2.4], "text_field": "hello test9"}
'

Query data (hybrid query)

The Reciprocal Rank Fusion (RRF) calculation is as follows:

When you perform a query, the system retrieves the topK results from the full-text search and the vector search. The system then calculates a score for each returned document ID using the RRF formula: 1/(rrf_rank_constant + rank(i)). In this formula, rrf_rank_constant is a configurable parameter and `rank(i)` is the rank of the document in the result set.

If a document ID appears in the top K results of both the full-text search and the vector search, its final score is the sum of its scores from both search methods. If a document ID appears in the results of only one search method, its final score is the score from that method.

For example, if rrf_rank_constant = 1, the results are calculated as follows:

# doc   | queryA     | queryB         | score
_id: 1 =  1.0/(1+1)  + 0              = 0.5
_id: 2 =  1.0/(1+2)  + 0              = 0.33
_id: 4 =    0        + 1.0/(1+2)      = 0.33
_id: 5 =    0        + 1.0/(1+1)      = 0.5

You can perform hybrid queries using the _search interface or the _msearch_rrf interface. The following table compares the two interfaces.

Interface

Open source

Readability

Can I adjust the ratio between full-text search and vector search?

_search

Compatible

Difficult to read

Support

_msearch_rrf

Proprietary

Easy to read

No

The following examples show how to use the _search and _msearch_rrf interfaces in two scenarios:

Scenario without scalar field filtering

Use the open source _search interface

Pros: It is compatible with the open source _search interface and lets you adjust the weight ratio between the full-text search and the vector search using the rrf_knn_weight_factor parameter.

Cons: The syntax is complex.

If you do not set the filter_type parameter in the ext.lvector extension field, the RRF search includes only the full-text search and the vector search. Scalar field filtering is not performed during the vector search.

curl -u <username>:<password> -H "Content-Type: application/json" -XPOST "http://ld-t4n566i****-proxy-search-vpc.lindorm.rds.aliyuncs.com:30070/vector_text_hybridSearch/_search?pretty" -d '{
  "size": 10,
  "_source": false,
  "query": {
    "knn": {
      "vector1": {
        "vector": [2.8, 2.3, 2.4],
        "filter": {
          "match": {
             "text_field": "test5 test6 test7 test8 test9"
          }
        },
        "k": 10
      }
    }
  },
  "ext": {"lvector": {
    "hybrid_search_type": "filter_rrf", 
    "rrf_rank_constant": "60",
    "rrf_knn_weight_factor": "0.5"
  }}
}'

If you use the IVFPQ algorithm without property filtering, you can set the ext.lvector extension parameters as follows:

 "ext": {"lvector": {
    "hybrid_search_type": "filter_rrf", 
    "rrf_rank_constant": "60",
    "rrf_knn_weight_factor": "0.5",
    "nprobe": "80", 
    "reorder_factor": "2",
    "client_refactor":"true"
  }}
Note
  • You can increase the value of nprobe to 80, 100, 120, 140, or 160. The nprobe parameter has a much smaller impact on performance than the reorder_factor parameter. However, do not set the nprobe value too high.

  • If the value of the k parameter in the query statement is large, such as 100 or greater, set reorder_factor to 1 or 2.

Results:

Click to view the results

{
  "took": 4,
  "timed_out": false,
  "terminated_early": false,
  "num_reduce_phases": 0,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 5,
      "relation": "eq"
    },
    "max_score": 0.032522473,
    "hits": [
      {
        "_index": "vector_text_hybridSearch",
        "_id": "4",
        "_score": 0.032522473
      },
      {
        "_index": "vector_text_hybridSearch",
        "_id": "2",
        "_score": 0.03201844
      },
      {
        "_index": "vector_text_hybridSearch",
        "_id": "5",
        "_score": 0.031746034
      },
      {
        "_index": "vector_text_hybridSearch",
        "_id": "3",
        "_score": 0.031513646
      },
      {
        "_index": "vector_text_hybridSearch",
        "_id": "1",
        "_score": 0.031009614
      }
    ]
  }
}

Use the proprietary _msearch_rrf interface

Pros: The syntax is clearer.

Cons: It is not compatible with the open source _search interface and does not support adjusting the weight ratio between the full-text search and the vector search.

curl -u  <username>:<password> -H 'Content-Type: application/json' -XGET "http://ld-t4n566i****-proxy-search-vpc.lindorm.aliyuncs.com:30070/_msearch_rrf?re_score=true&rrf_rank_constant=60&pretty"  -d '
{"index": "vector_text_hybridSearch"}
{"size":10,"_source":false, "query":{"match":{"text_field":"test5 test6 test7 test8 test9"}}}
{"index": "vector_text_hybridSearch"}
{"size":10,"_source":false,"query":{"knn":{"vector1":{"vector":[2.8,2.3,2.4],"k":10}}}}
'
Note

You must add re_score=true to the connection parameters.

Results:

Click to view the results

{
  "took": 6,
  "timed_out": false,
  "terminated_early": false,
  "num_reduce_phases": 0,
  "_shards": {
    "total": 4,
    "successful": 4,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 5,
      "relation": "eq"
    },
    "max_score": 0.032522473,
    "hits": [
      {
        "_index": "vector_text_hybridSearch",
        "_id": "4",
        "_score": 0.032522473
      },
      {
        "_index": "vector_text_hybridSearch",
        "_id": "2",
        "_score": 0.03201844
      },
      {
        "_index": "vector_text_hybridSearch",
        "_id": "5",
        "_score": 0.031746034
      },
      {
        "_index": "vector_text_hybridSearch",
        "_id": "3",
        "_score": 0.031513646
      },
      {
        "_index": "vector_text_hybridSearch",
        "_id": "1",
        "_score": 0.031009614
      }
    ]
  }
}

Scenario with scalar field filtering

Use the open source _search interface

You can set the filter_type parameter in the ext.lvector extension field to specify that the vector search component of the RRF search also requires scalar field filtering.

Note

For an RRF-based hybrid search with a filter condition, you must place the full-text search query condition and the filter condition into two separate bool expressions connected by bool.must. The first bool expression in must is used for the full-text search to calculate the full-text match score. The second bool filter expression is used as the filter condition for the k-nearest neighbor (k-NN) search.

  • Set a filter with a single condition

    curl -u <username>:<password> -H "Content-Type: application/json" -XPOST "http://ld-ld-t4n566i****-proxy-search-pub.lindorm.rds.aliyuncs.com:30070/vector_text_hybridSearch/_search?pretty" -d '{
      "size": 10,
      "_source": false,
      "query": {
        "knn": {
          "vector1": {
            "vector": [2.8, 2.3, 2.4],
            "filter": {
              "bool": {
                 "must": [
                    {
                      "bool": {
                        "must":[{
                          "match": {
                            "text_field": {
                              "query": "test5 test6 test7 test8 test9"
                            }
                          }
                        }]
                      }
                    },
                    {
                      "bool": {
                        "filter": [{
                          "range": {
                            "field1": {
                              "gt": 2
                            }
                          }
                        }]
                      }
                    }
                  ]
              }
            },
            "k": 10
          }
        }
      },
      "ext": {"lvector": {
        "filter_type": "efficient_filter",
        "hybrid_search_type": "filter_rrf", 
        "rrf_rank_constant": "60"
      }}
    }'

    If you use the IVFPQ algorithm without property filtering, you can set the ext.lvector extension parameters as follows:

     "ext": {"lvector": {
        "filter_type": "efficient_filter",
        "hybrid_search_type": "filter_rrf", 
        "rrf_rank_constant": "60",
        "rrf_knn_weight_factor": "0.5",
        "nprobe": "80", 
        "reorder_factor": "2",
        "client_refactor":"true"
      }}
    Note
    • You can increase the value of nprobe to 80, 100, 120, 140, or 160. The nprobe parameter has a much smaller impact on performance than the reorder_factor parameter. However, do not set the nprobe value too high.

    • If the value of the k parameter in the query statement is large, such as 100 or greater, set reorder_factor to 1 or 2.

    Results:

    Click to view the results

     {
      "took": 42,
      "timed_out": false,
      "terminated_early": false,
      "num_reduce_phases": 0,
      "_shards": {
        "total": 2,
        "successful": 2,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": {
          "value": 3,
          "relation": "eq"
        },
        "max_score": 0.032786883,
        "hits": [
          {
            "_index": "vector_text_hybridSearch",
            "_id": "4",
            "_score": 0.032786883
          },
          {
            "_index": "vector_text_hybridSearch",
            "_id": "3",
            "_score": 0.032002047
          },
          {
            "_index": "vector_text_hybridSearch",
            "_id": "5",
            "_score": 0.032002047
          }
        ]
      }
    }
  • Set a filter with multiple conditions

    Note

    For more information about how to write multi-condition filters, see Additional property filter expressions.

    curl -u <username>:<password> -H "Content-Type: application/json" -XPOST "http://ld-t4n566i****-proxy-search-vpc.lindorm.rds.aliyuncs.com:30070/vector_text_hybridSearch/_search?pretty" -d '{
      "size": 10,
      "_source": ["field1", "field2"],
      "query": {
        "knn": {
          "vector1": {
            "vector": [2.8, 2.3, 2.4],
            "filter": {
              "bool": {
                 "must": [
                    {
                      "bool": {
                        "must":[{
                          "match": {
                            "text_field": {
                              "query": "test5 test6 test7 test8 test9"
                            }
                          }
                        }]
                      }
                    },
                    {
                      "bool": {
                        "filter": [{
                          "range": {
                            "field1": {
                              "gt": 2
                            }
                          }
                        }, 
                        {
                          "term": {
                            "field2":"flag2"
                          }
                        }
                        ]
                      }
                    }
                  ]
              }
            },
            "k": 100
          }
        }
      },
      "ext": {"lvector": {
        "filter_type": "efficient_filter",
        "hybrid_search_type": "filter_rrf", 
        "rrf_rank_constant": "60"
      }}
    }

    Results:

    Click to expand the response

    {
      "took": 6,
      "timed_out": false,
      "terminated_early": false,
      "num_reduce_phases": 0,
      "_shards": {
        "total": 2,
        "successful": 2,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": {
          "value": 2,
          "relation": "eq"
        },
        "max_score": 0.032786883,
        "hits": [
          {
            "_index": "vector_text_hybridSearch",
            "_id": "4",
            "_score": 0.032786883,
            "_source": {
              "field1": 4,
              "field2": "flag2"
            }
          },
          {
            "_index": "vector_text_hybridSearch",
            "_id": "5",
            "_score": 0.032258064,
            "_source": {
              "field1": 5,
              "field2": "flag2"
            }
          }
        ]
      }
    }

Use the proprietary _msearch_rrf interface

curl -u   <username>:<password> -H 'Content-Type: application/json' -XGET "http://ld-t4n566i****-proxy-search-vpc.lindorm.aliyuncs.com:30070/_msearch_rrf?re_score=true&rrf_rank_constant=60&pretty"  -d '
{"index": "vector_text_hybridSearch"}
{"size": 10,"_source":false,"query":{"bool":{"must":[{"match":{"text_field":"test5 test6 test7 test8 test9"}}],"filter":[{"range":{"field1":{"gt":2}}}]}}}
{"index": "vector_text_hybridSearch"}
{"size":10,"_source":false,"query":{"knn":{"vector1":{"vector":[2.8,2.3,2.4],"filter":{"range":{"field1":{"gt":2}}},"k":10}}},"ext":{"lvector":{"filter_type":"post_filter"}}}
'
Note

You must add re_score=true to the connection parameters.

Results:

Click to view the results

{
  "took" : 6,
  "timed_out" : false,
  "terminated_early" : false,
  "num_reduce_phases" : 0,
  "_shards" : {
    "total" : 4,
    "successful" : 4,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 0.032786883,
    "hits" : [
      {
        "_index" : "vector_text_hybridSearch",
        "_id" : "3",
        "_score" : 0.032786883
      },
      {
        "_index" : "vector_text_hybridSearch",
        "_id" : "2",
        "_score" : 0.032002047
      },
      {
        "_index" : "vector_text_hybridSearch",
        "_id" : "4",
        "_score" : 0.032002047
      }
    ]
  }
}
Important

The results from the _msearch_rrf interface differ from the results from the _msearch interface. The _msearch interface returns multiple independent query result sets, whereas the _msearch_rrf interface merges and sorts multiple query result sets using RRF before it returns them.

Parameter description

Parameter

Required

Default value

Description

filter_type

No

None

The query mode. Valid values: pre_filter, post_filter, and efficient_filter.

For more information about the parameters, see Parameter description.

Important

This parameter is required for scenarios that include scalar field filtering. This parameter is not required for scenarios without scalar field filtering.

hybrid_search_type

Yes

None

Set this parameter to filter_rrf to perform an RRF-based hybrid search.

Note

This parameter is ignored when you use the proprietary `_msearch_rrf` interface.

rrf_rank_constant

No

60

The weighting coefficient for score calculation in the RRF formula. The formula is 1/(rrf_rank_constant + rank(i)).

rrf_window_size

No

topK

The number of intermediate results to return from the full-text search. By default, this is the same as the topK value of the k-NN search.

Note

This parameter is ignored when you use the proprietary `_msearch_rrf` interface.

rrf_knn_weight_factor

Important

This parameter is supported only by search engine versions 3.9.3 and later.

No

0.5

The value must be in the range of (0, 1). A value of 0.01 indicates a pure full-text search, and a value of 0.99 indicates a pure vector search.

Note

This parameter is ignored when you use the proprietary `_msearch_rrf` interface.