混合检索
更新时间:
复制为 MD 格式
在文本检索业务中,通过全文+向量混合检索,相较于仅使用全文检索或向量检索,可有效提升检索结果的准确率。RRF(Reciprocal Rank Fusion)是一种通过融合多个检索系统的排名结果来优化最终排序的算法,通过对多个检索系统的排名结果进行倒数加权求和,以计算综合得分。本文为您介绍在AnalyticDB for MySQL中如何对全文+向量双路召回,通过RRF算法实现混合检索。
前提条件
xuanwu_v1表引擎:仅支持l2_distance,集群的内核版本需为3.1.4.0及以上版本,3.1.5.16、3.1.6.8、3.1.8.6及以上版本的集群向量索引功能相对稳定。
xuanwu_v2表引擎:支持l2_distance、cosine_similarity。l2_distance要求集群内核版本为3.2.6.0及以上,cosine_similarity要求集群内核版本为3.2.7.0及以上。
本文向量检索以cosine_similarity为例介绍使用方法。
说明
请在云原生数据仓库AnalyticDB MySQL控制台集群信息页面,配置信息区域,查看和升级内核版本。
数据准备
建表并创建索引
创建文档表,包含文本字段、向量字段、标量过滤字段,并创建全文索引和向量索引。
DROP TABLE IF EXISTS documents; -- 创建文档表(包含文本字段、向量字段、标量过滤字段),并创建全文索引和向量索引 CREATE TABLE documents ( id INT, text_field TEXT, -- 全文检索字段 float_feature ARRAY < FLOAT >(3), -- 向量字段 field1 INT, -- 标量字段 field2 TEXT, -- 标量字段 FULLTEXT INDEX idx_text_field(`text_field`) WITH ANALYZER ik, -- 使用IK分词器 ANN INDEX idx_float_feature(`float_feature`), PRIMARY KEY (id) ) DISTRIBUTED BY HASH(id);数据导入
INSERT INTO documents (id, text_field, float_feature, field1, field2) VALUES (1, '客户需要更好的产品和服务', '[2.5, 2.3, 2.4]', 1, 'flag1'), (2, '武汉市长江大桥', '[2.6, 2.3, 2.4]', 2, 'flag1'), (3, 'Hangzhou, Zhejiang Province', '[2.7, 2.3, 2.4]', 3, 'flag1'), (4, '产品的用户价值和商业价值', '[2.8, 2.3, 2.4]', 4, 'flag2');
单路召回
向量检索
SELECT id, similarity, ROW_NUMBER() OVER ( ORDER BY similarity DESC ) AS vec_rank FROM ( SELECT id, cosine_similarity(float_feature, '[2.8, 2.3, 2.4]') AS similarity FROM documents WHERE field1 > 1 ORDER BY similarity DESC LIMIT 100 ) inner_vec查询结果
id similarity vec_rank 4 1.0000001 1 3 0.99984056 2 2 0.999343 3全文检索
SELECT id, text_field, MATCH(text_field) AGAINST ('产品服务') AS match_score, ROW_NUMBER() OVER ( ORDER BY MATCH(text_field) AGAINST ('产品服务') DESC ) AS txt_rank FROM documents WHERE MATCH(text_field) AGAINST ('产品服务') ORDER BY match_score DESC LIMIT 100;查询结果
id text_field match_score txt_rank 1 客户需要更好的产品和服务 0.2615291476249695 1 4 产品的用户价值和商业价值 0.13076457381248474 2
混合检索
通过CTE(Common Table Expression)分别进行全文检索和向量检索,然后使用RRF算法融合两路结果。
-- 步骤1:全文检索
WITH vector_search AS (
SELECT
id,
similarity,
ROW_NUMBER() OVER (
ORDER BY
similarity DESC
) AS vec_rank
FROM
(
SELECT
id,
cosine_similarity(float_feature, '[2.8, 2.3, 2.4]') AS similarity
FROM
documents
ORDER BY
similarity DESC
LIMIT
100
) inner_vec
), -- 步骤2:向量检索+标量过滤
fulltext_search AS (
SELECT
id,
MATCH(text_field) AGAINST ('产品服务') AS match_score,
ROW_NUMBER() OVER (
ORDER BY
MATCH(text_field) AGAINST ('产品服务') DESC
) AS txt_rank
FROM
documents
WHERE
MATCH(text_field) AGAINST ('产品服务')
ORDER BY
match_score DESC
LIMIT
100
)
SELECT
COALESCE(fulltext_search.id, vector_search.id) AS doc_id,
-- RRF分数公式:sum(1/(rrf_rank_constant + rank)),常数rrf_rank_constant取60
(
CASE WHEN fulltext_search.txt_rank IS NOT NULL THEN 1.0 / (60 + fulltext_search.txt_rank) ELSE 0 END
) + (
CASE WHEN vector_search.vec_rank IS NOT NULL THEN 1.0 / (60 + vector_search.vec_rank) ELSE 0 END
) AS rrf_score,
-- 关联原始字段便于查看(可选)
documents.text_field,
documents.field1,
documents.field2
FROM
fulltext_search FULL
JOIN vector_search ON fulltext_search.id = vector_search.id
LEFT JOIN documents ON COALESCE(fulltext_search.id, vector_search.id) = documents.id -- 按RRF分数降序排序,取Top10
ORDER BY
rrf_score DESC
LIMIT
10;查询结果
doc_id rrf_score text_field field1 field2
4 0.032522474881015335801 产品的用户价值和商业价值 4 flag2
1 0.016393442622950819672 客户需要更好的产品和服务 1 flag1
3 0.016129032258064516129 Hangzhou, Zhejiang Province 3 flag1
2 0.015873015873015873016 武汉市长江大桥 2 flag1该文章对您有帮助吗?