7.0版向量检索与全文检索双路召回
在大多数场景下,仅使用向量检索就能在相似度召回中获得较高的召回率。然而在某些情况下,例如当Embedding模型表现不佳或查询复杂导致生成的向量与库内数据距离较远时,仅靠向量相似性召回可能无法达到预期效果。这时为了提高召回率,可以采用向量检索和全文检索双路召回策略。
云原生数据仓库 AnalyticDB PostgreSQL 版的双路召回通过向量检索和全文检索分别召回部分数据,然后合并两部分召回数据,做精排和后处理,以获得更佳的召回效果。具体步骤如下。
向量检索:基于嵌入向量的稠密表征,通过近似最近邻搜索(ANN)捕获语义相关性,召回Top-K相似项。
全文检索:针对词频、逆文档频率等统计特征做精准匹配,补充关键词强相关结果。AnalyticDB for PostgreSQL6.0版中,全文检索依赖GIN索引实现。而在7.0版,则升级为基于pgsearch的BM25索引,进一步提升了检索效率和相关性。
精排和后处理:将两路召回的数据合并,并做进一步的排序和处理,以确保最终结果的相关性和准确性。
本文介绍AnalyticDB for PostgreSQL7.0版的向量检索和全文检索双路召回。如果实例是6.0版,请查看6.0版双路召回。
版本限制
内核版本为7.2.1.0及以上的AnalyticDB for PostgreSQL7.0版实例。
前提条件
操作步骤
步骤一:创建样例表
创建样例表documents
并写入5条测试数据。
-- vector字段为向量
CREATE TABLE IF NOT EXISTS documents(
id TEXT,
docname TEXT,
title TEXT,
vector real[],
text TEXT);
-- 将向量列设置为内联模式
ALTER TABLE documents ALTER COLUMN vector SET STORAGE PLAIN;
-- 插入样本数据
INSERT INTO documents (id, docname, title, vector, text) VALUES
('1', 'doc_1', 'Exploring the Universe',
'{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}',
'The universe is vast, filled with mysteries and astronomical wonders waiting to be discovered.'),
('2', 'doc_2', 'The Art of Cooking',
'{0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1}',
'Cooking combines ingredients artfully, creating flavors that nourish and bring people together.'),
('3', 'doc_3', 'Technology and Society',
'{0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2}',
'Technology transforms society, reshaping communication, work, and our daily interactions significantly.'),
('4', 'doc_4', 'Psychology of Happiness',
'{0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3}',
'Happiness is complex, influenced by relationships, gratitude, and the pursuit of meaningful experiences.'),
('5', 'doc_5', 'Sustainable Living Practices',
'{0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4}',
'Sustainable living involves eco-friendly choices, reducing waste, and promoting environmental awareness.');
步骤二:创建索引
为向量字段创建向量索引。
CREATE INDEX documents_idx ON documents USING ann(vector) WITH (dim = 10, algorithm = hnswflat, distancemeasure = L2, vector_include = 0);
为文本字段创建全文索引。
CALL pgsearch.create_bm25( index_name => 'documents_bm25_idx', table_name => 'documents', text_fields => '{text: {}}' );
步骤三:双路召回查询
第一张临时表t1通过全文检索召回1条结果,第二张临时表通过向量检索召回5条结果。通过FULL OUTER JOIN
综合BM25得分和向量相似度得分得到总得分,最后按照总得分排序返回结果。
WITH t1 AS (
SELECT
id,
docname,
title,
text,
text @@@ pgsearch.config('text:astronomical') AS score,
2 AS source
FROM
documents
ORDER BY score
LIMIT 10
),
t2 AS (
SELECT
id,
docname,
title,
text,
cosine_similarity(vector,ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]::real[]) AS score,
1 AS source
FROM
documents
ORDER BY vector <-> ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]
LIMIT 10
)
SELECT t2.*, COALESCE(ABS(t1.score), 0.0) * 0.2 + COALESCE(t2.score, 0.0) * 0.8 AS hybrid_score
-- 此处得分的权重分配仅为演示,您可以根据业务需求选取合适的参数和计算方法。
FROM t1
FULL OUTER JOIN t2 ON t1.id = t2.id
ORDER BY hybrid_score DESC;
您也可以使用RRF(Reciprocal Rank Fusion)来计算最终得分。RRF通过结合向量检索和全文检索的排名来确定召回结果的最终排名。通常,如果某个召回结果在两种检索方法中的排名都比较靠前,那么它的综合得分也会更高。
RRF公式中的参数 k
用于平滑排名对最终得分的影响。较大的 k
值会使不同排名之间的得分差异减小,从而达到更好的平滑效果。默认情况下,k
的值为60。
步骤四:封装并调用函数(可选步骤)
将步骤三中的查询封装为函数,简化调用。
封装为函数。
CREATE OR REPLACE FUNCTION search_documents( table_name TEXT, vector_column TEXT, text_column TEXT, search_keyword TEXT, search_vector REAL[], limit_size INT, hnsw_weight FLOAT8 DEFAULT 0.8 -- Default weight for hnsw ) RETURNS TABLE ( id TEXT, docname TEXT, title TEXT, text TEXT, hybrid_score FLOAT8 ) AS $$ DECLARE query_string TEXT; bm25_weight FLOAT8; BEGIN bm25_weight := 1.0 - hnsw_weight; query_string := 'WITH t1 AS ( SELECT id, docname, title, ' || text_column || ', ' || text_column || ' @@@ pgsearch.config(''' || search_keyword || ''') AS score, 2 AS source FROM ' || table_name || ' ORDER BY score LIMIT ' || limit_size || ' ), t2 AS ( SELECT id, docname, title, ' || text_column || ', cosine_similarity(' || vector_column || ', $1) AS score, 1 AS source FROM ' || table_name || ' ORDER BY ' || vector_column || ' <-> $1 LIMIT ' || limit_size || ' ) SELECT t2.id, t2.docname, t2.title, t2.' || text_column || ', COALESCE(ABS(t1.score), 0.0) * ' || bm25_weight || ' + COALESCE(t2.score, 0.0) * ' || hnsw_weight || ' AS hybrid_score FROM t1 FULL OUTER JOIN t2 ON t1.id = t2.id ORDER BY hybrid_score DESC;'; RETURN QUERY EXECUTE query_string USING search_vector; END; $$ LANGUAGE plpgsql;
调用封装好的
search_documents
函数查询。SELECT * FROM search_documents( 'documents', 'vector', 'text', 'astronomical', ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], 10, 0.8 );
相关文档
- 本页导读
- 版本限制
- 前提条件
- 操作步骤
- 步骤一:创建样例表
- 步骤二:创建索引
- 步骤三:双路召回查询
- 步骤四:封装并调用函数(可选步骤)
- 相关文档