混合检索

更新时间:
复制为 MD 格式

在文本检索业务中,通过全文+向量混合检索,相较于仅使用全文检索或向量检索,可有效提升检索结果的准确率。RRF(Reciprocal Rank Fusion)是一种通过融合多个检索系统的排名结果来优化最终排序的算法,通过对多个检索系统的排名结果进行倒数加权求和,以计算综合得分。本文为您介绍在AnalyticDB for MySQL中如何对全文+向量双路召回,通过RRF算法实现混合检索。

前提条件

  • xuanwu_v1表引擎:仅支持l2_distance,集群的内核版本需为3.1.4.0及以上版本,3.1.5.16、3.1.6.8、3.1.8.6及以上版本的集群向量索引功能相对稳定。

  • xuanwu_v2表引擎:支持l2_distance、cosine_similarity。l2_distance要求集群内核版本为3.2.6.0及以上,cosine_similarity要求集群内核版本为3.2.7.0及以上。

本文向量检索以cosine_similarity为例介绍使用方法。

说明

云原生数据仓库AnalyticDB MySQL控制台集群信息页面,配置信息区域,查看和升级内核版本

数据准备

  1. 建表并创建索引

    创建文档表,包含文本字段、向量字段、标量过滤字段,并创建全文索引和向量索引。

    DROP TABLE IF EXISTS documents;
    
    -- 创建文档表(包含文本字段、向量字段、标量过滤字段),并创建全文索引和向量索引
    CREATE TABLE documents (
        id INT,
        text_field TEXT, -- 全文检索字段
        float_feature ARRAY < FLOAT >(3), -- 向量字段
        field1 INT, -- 标量字段
        field2 TEXT, -- 标量字段
        FULLTEXT INDEX idx_text_field(`text_field`) WITH ANALYZER ik, -- 使用IK分词器
        ANN INDEX idx_float_feature(`float_feature`),
        PRIMARY KEY (id)
    ) DISTRIBUTED BY HASH(id);
  2. 数据导入

    INSERT INTO documents (id, text_field, float_feature, field1, field2) VALUES
      (1, '客户需要更好的产品和服务', '[2.5, 2.3, 2.4]', 1, 'flag1'),
      (2, '武汉市长江大桥', '[2.6, 2.3, 2.4]', 2, 'flag1'),
      (3, 'Hangzhou, Zhejiang Province', '[2.7, 2.3, 2.4]', 3, 'flag1'),
      (4, '产品的用户价值和商业价值', '[2.8, 2.3, 2.4]', 4, 'flag2');

单路召回

  • 向量检索

    SELECT
      id,
      similarity,
      ROW_NUMBER() OVER (
        ORDER BY
          similarity DESC
      ) AS vec_rank
    FROM
      (
        SELECT
          id,
          cosine_similarity(float_feature, '[2.8, 2.3, 2.4]') AS similarity
        FROM
          documents
        WHERE
          field1 > 1
        ORDER BY
          similarity DESC
        LIMIT
          100
      ) inner_vec

    查询结果

    id  similarity  vec_rank
    4   1.0000001   1
    3   0.99984056  2
    2   0.999343    3
  • 全文检索

    SELECT
      id,
      text_field,
      MATCH(text_field) AGAINST ('产品服务') AS match_score,
      ROW_NUMBER() OVER (
        ORDER BY
          MATCH(text_field) AGAINST ('产品服务') DESC
      ) AS txt_rank
    FROM
      documents
    WHERE
      MATCH(text_field) AGAINST ('产品服务')
    ORDER BY
      match_score DESC
    LIMIT
      100;

    查询结果

    id   text_field                match_score           txt_rank
    1    客户需要更好的产品和服务    0.2615291476249695    1
    4    产品的用户价值和商业价值    0.13076457381248474   2

混合检索

通过CTE(Common Table Expression)分别进行全文检索和向量检索,然后使用RRF算法融合两路结果。

-- 步骤1:全文检索
WITH vector_search AS (
  SELECT
    id,
    similarity,
    ROW_NUMBER() OVER (
      ORDER BY
        similarity DESC
    ) AS vec_rank
  FROM
    (
      SELECT
        id,
        cosine_similarity(float_feature, '[2.8, 2.3, 2.4]') AS similarity
      FROM
        documents
      ORDER BY
        similarity DESC
      LIMIT
        100
    ) inner_vec
), -- 步骤2:向量检索+标量过滤
fulltext_search AS (
  SELECT
    id,
    MATCH(text_field) AGAINST ('产品服务') AS match_score,
    ROW_NUMBER() OVER (
      ORDER BY
        MATCH(text_field) AGAINST ('产品服务') DESC
    ) AS txt_rank
  FROM
    documents
  WHERE
    MATCH(text_field) AGAINST ('产品服务')
  ORDER BY
    match_score DESC
  LIMIT
    100
)
SELECT
  COALESCE(fulltext_search.id, vector_search.id) AS doc_id,
  -- RRF分数公式:sum(1/(rrf_rank_constant + rank)),常数rrf_rank_constant取60
  (
    CASE WHEN fulltext_search.txt_rank IS NOT NULL THEN 1.0 / (60 + fulltext_search.txt_rank) ELSE 0 END
  ) + (
    CASE WHEN vector_search.vec_rank IS NOT NULL THEN 1.0 / (60 + vector_search.vec_rank) ELSE 0 END
  ) AS rrf_score,
  -- 关联原始字段便于查看(可选)
  documents.text_field,
  documents.field1,
  documents.field2
FROM
  fulltext_search FULL
  JOIN vector_search ON fulltext_search.id = vector_search.id
  LEFT JOIN documents ON COALESCE(fulltext_search.id, vector_search.id) = documents.id -- 按RRF分数降序排序,取Top10
ORDER BY
  rrf_score DESC
LIMIT
  10;

查询结果

doc_id  rrf_score                text_field                   field1   field2
4       0.032522474881015335801  产品的用户价值和商业价值           4        flag2
1       0.016393442622950819672  客户需要更好的产品和服务           1        flag1
3       0.016129032258064516129  Hangzhou, Zhejiang Province    3        flag1
2       0.015873015873015873016  武汉市长江大桥                   2        flag1