Hybrid search

更新时间:
复制 MD 格式

In text retrieval, hybrid search combines full-text search and vector search to deliver more accurate results than either method alone. Reciprocal Rank Fusion (RRF) is an algorithm that optimizes the final ranking by fusing results from multiple search systems. It calculates a combined score by summing the reciprocal ranks from each system. This topic describes how to perform a dual-path recall using full-text and vector search and then implement hybrid search with the RRF algorithm in AnalyticDB for MySQL.

Prerequisites

  • xuanwu_v1 table engine: Only `l2_distance` is supported. Your cluster must run kernel version 3.1.4.0 or later. The vector index feature is more stable on clusters that run kernel version 3.1.5.16, 3.1.6.8, 3.1.8.6, or a later version.

  • xuanwu_v2 table engine: `l2_distance` and `cosine_similarity` are supported. To use `l2_distance`, your cluster must run kernel version 3.2.6.0 or later. To use `cosine_similarity`, your cluster must run kernel version 3.2.7.0 or later.

The examples in this topic use cosine_similarity for vector search.

Note

To view and update the minor version, go to the Configuration Information section on the Cluster Information page in the AnalyticDB for MySQL console.

Prepare data

  1. Create a table and indexes

    Create a document table that includes a text field, a vector field, and scalar filter fields. Then, create a full-text index and a vector index on the table.

    DROP TABLE IF EXISTS documents;
    
    -- Create a table with text, vector, and scalar fields, plus full-text and vector indexes.
    CREATE TABLE documents (
        id INT,
        text_field TEXT, -- Full-text search field
        float_feature ARRAY < FLOAT >(3), -- Vector field
        field1 INT, -- Scalar field
        field2 TEXT, -- Scalar field
        FULLTEXT INDEX idx_text_field(`text_field`) WITH ANALYZER ik, -- Use the IK analyzer
        ANN INDEX idx_float_feature(`float_feature`),
        PRIMARY KEY (id)
    ) DISTRIBUTED BY HASH(id);
  2. Import data

    INSERT INTO documents (id, text_field, float_feature, field1, field2) VALUES
      (1, 'Customers need better products and services', '[2.5, 2.3, 2.4]', 1, 'flag1'),
      (2, 'Wuhan Yangtze River Bridge', '[2.6, 2.3, 2.4]', 2, 'flag1'),
      (3, 'Hangzhou, Zhejiang Province', '[2.7, 2.3, 2.4]', 3, 'flag1'),
      (4, 'The user value and business value of products', '[2.8, 2.3, 2.4]', 4, 'flag2');

Single-path recall

  • Vector search

    SELECT
      id,
      similarity,
      ROW_NUMBER() OVER (
        ORDER BY
          similarity DESC
      ) AS vec_rank
    FROM
      (
        SELECT
          id,
          cosine_similarity(float_feature, '[2.8, 2.3, 2.4]') AS similarity
        FROM
          documents
        WHERE
          field1 > 1
        ORDER BY
          similarity DESC
        LIMIT
          100
      ) inner_vec

    Query results

    id  similarity  vec_rank
    4   1.0000001   1
    3   0.99984056  2
    2   0.999343    3
  • Full-text search

    SELECT
      id,
      text_field,
      MATCH(text_field) AGAINST ('product services') AS match_score,
      ROW_NUMBER() OVER (
        ORDER BY
          MATCH(text_field) AGAINST ('product services') DESC
      ) AS txt_rank
    FROM
      documents
    WHERE
      MATCH(text_field) AGAINST ('product services')
    ORDER BY
      match_score DESC
    LIMIT
      100;

    Query results

    id   text_field                                    match_score           txt_rank
    1    Customers need better products and services     0.2615291476249695    1
    4    The user value and business value of products   0.13076457381248474   2

Hybrid search

Use a common table expression (CTE) to perform full-text search and vector search separately. Then, use the RRF algorithm to fuse the results from both paths.

-- Step 1: Perform vector search with scalar filtering
WITH vector_search AS (
  SELECT
    id,
    similarity,
    ROW_NUMBER() OVER (
      ORDER BY
        similarity DESC
    ) AS vec_rank
  FROM
    (
      SELECT
        id,
        cosine_similarity(float_feature, '[2.8, 2.3, 2.4]') AS similarity
      FROM
        documents
      WHERE
        field1 > 1
      ORDER BY
        similarity DESC
      LIMIT
        100
    ) inner_vec
), 
-- Step 2: Perform full-text search
fulltext_search AS (
  SELECT
    id,
    MATCH(text_field) AGAINST ('product services') AS match_score,
    ROW_NUMBER() OVER (
      ORDER BY
        MATCH(text_field) AGAINST ('product services') DESC
    ) AS txt_rank
  FROM
    documents
  WHERE
    MATCH(text_field) AGAINST ('product services')
  ORDER BY
    match_score DESC
  LIMIT
    100
)
SELECT
  COALESCE(fulltext_search.id, vector_search.id) AS doc_id,
  -- Calculate the RRF score using the formula: sum(1 / (k + rank)), where the constant k is 60.
  (
    CASE WHEN fulltext_search.txt_rank IS NOT NULL THEN 1.0 / (60 + fulltext_search.txt_rank) ELSE 0 END
  ) + (
    CASE WHEN vector_search.vec_rank IS NOT NULL THEN 1.0 / (60 + vector_search.vec_rank) ELSE 0 END
  ) AS rrf_score,
  -- (Optional) Join with the original table to display additional fields.
  documents.text_field,
  documents.field1,
  documents.field2
FROM
  fulltext_search FULL
  JOIN vector_search ON fulltext_search.id = vector_search.id
  LEFT JOIN documents ON COALESCE(fulltext_search.id, vector_search.id) = documents.id 
-- Sort by rrf_score in descending order and return the top 10 results.
ORDER BY
  rrf_score DESC
LIMIT
  10;

Query results

doc_id  rrf_score                text_field                                     field1   field2
4       0.032522474881015335801  The user value and business value of products  4        flag2
1       0.016393442622950819672  Customers need better products and services    1        flag1
3       0.016129032258064516129  Hangzhou, Zhejiang Province                    3        flag1
2       0.015873015873015873016  Wuhan Yangtze River Bridge                     2        flag1