7.0版向量检索与全文检索双路召回

更新时间:2025-03-28 01:53:54

在大多数场景下,仅使用向量检索就能在相似度召回中获得较高的召回率。然而在某些情况下,例如当Embedding模型表现不佳或查询复杂导致生成的向量与库内数据距离较远时,仅靠向量相似性召回可能无法达到预期效果。这时为了提高召回率,可以采用向量检索和全文检索双路召回策略。

云原生数据仓库 AnalyticDB PostgreSQL 版的双路召回通过向量检索和全文检索分别召回部分数据,然后合并两部分召回数据,做精排和后处理,以获得更佳的召回效果。具体步骤如下。

  1. 向量检索:基于嵌入向量的稠密表征,通过近似最近邻搜索(ANN)捕获语义相关性,召回Top-K相似项。

  2. 全文检索:针对词频、逆文档频率等统计特征做精准匹配,补充关键词强相关结果。AnalyticDB for PostgreSQL6.0版中,全文检索依赖GIN索引实现。而在7.0版,则升级为基于pgsearchBM25索引,进一步提升了检索效率和相关性。

  3. 精排和后处理:将两路召回的数据合并,并做进一步的排序和处理,以确保最终结果的相关性和准确性。

本文介绍AnalyticDB for PostgreSQL7.0版的向量检索和全文检索双路召回。如果实例是6.0版,请查看6.0版双路召回

版本限制

内核版本为7.2.1.0及以上的AnalyticDB for PostgreSQL7.0版实例。

说明

您可以在控制台实例的基本信息页查看内核小版本。如不满足上述版本要求,需要您升级内核小版本

前提条件

  • 已为实例开启向量引擎优化

  • 已安装pgsearch插件。如果您已安装,在数据库的Schema列表中可以看到pgsearch。如未安装请提交工单,联系技术支持协助安装(需要重启实例)。

操作步骤

步骤一:创建样例表

创建样例表documents并写入5条测试数据。

-- vector字段为向量
CREATE TABLE IF NOT EXISTS documents(
                id TEXT,
                docname TEXT,
                title TEXT,
                vector real[],
                text TEXT);
-- 将向量列设置为内联模式
ALTER TABLE documents ALTER COLUMN vector SET STORAGE PLAIN;
-- 插入样本数据
INSERT INTO documents (id, docname, title, vector, text) VALUES
('1', 'doc_1', 'Exploring the Universe', 
'{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}', 
'The universe is vast, filled with mysteries and astronomical wonders waiting to be discovered.'),

('2', 'doc_2', 'The Art of Cooking', 
'{0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1}', 
'Cooking combines ingredients artfully, creating flavors that nourish and bring people together.'),

('3', 'doc_3', 'Technology and Society', 
'{0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2}', 
'Technology transforms society, reshaping communication, work, and our daily interactions significantly.'),

('4', 'doc_4', 'Psychology of Happiness', 
'{0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3}', 
'Happiness is complex, influenced by relationships, gratitude, and the pursuit of meaningful experiences.'),

('5', 'doc_5', 'Sustainable Living Practices', 
'{0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4}', 
'Sustainable living involves eco-friendly choices, reducing waste, and promoting environmental awareness.');

步骤二:创建索引

  • 为向量字段创建向量索引。

    CREATE INDEX documents_idx ON documents USING ann(vector) WITH (dim = 10, algorithm = hnswflat, distancemeasure = L2, vector_include = 0);
  • 为文本字段创建全文索引。

    CALL pgsearch.create_bm25(
        index_name => 'documents_bm25_idx',
        table_name => 'documents',
        text_fields => '{text: {}}'
    );

步骤三:双路召回查询

第一张临时表t1通过全文检索召回1条结果,第二张临时表通过向量检索召回5条结果。通过FULL OUTER JOIN综合BM25得分和向量相似度得分得到总得分,最后按照总得分排序返回结果。

WITH t1 AS (
    SELECT
            id,
            docname,
            title,
            text,
            text @@@ pgsearch.config('text:astronomical') AS score,
            2 AS source
    FROM
        documents
    ORDER BY score
    LIMIT 10
),
t2 AS (
    SELECT
        id,
        docname,
        title,
        text,
        cosine_similarity(vector,ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]::real[]) AS score,
        1 AS source
    FROM
        documents
    ORDER BY vector <-> ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]
    LIMIT 10
)
SELECT t2.*, COALESCE(ABS(t1.score), 0.0) * 0.2 + COALESCE(t2.score, 0.0) * 0.8 AS hybrid_score
-- 此处得分的权重分配仅为演示,您可以根据业务需求选取合适的参数和计算方法。
FROM t1
FULL OUTER JOIN t2 ON t1.id = t2.id 
ORDER BY  hybrid_score DESC;

您也可以使用RRF(Reciprocal Rank Fusion)来计算最终得分。RRF通过结合向量检索和全文检索的排名来确定召回结果的最终排名。通常,如果某个召回结果在两种检索方法中的排名都比较靠前,那么它的综合得分也会更高。

RRF公式中的参数 k 用于平滑排名对最终得分的影响。较大的 k 值会使不同排名之间的得分差异减小,从而达到更好的平滑效果。默认情况下,k 的值为60。

RRF使用示例

WITH bm25 AS (
        SELECT
            id,
            docname,
            title,
            text,
            text @@@ pgsearch.config('text:astronomical') AS score,
            2 AS source,
            ROW_NUMBER() OVER () AS rank_bm25
        FROM
            documents
        ORDER BY score
        LIMIT 10
), hnsw AS (
        SELECT
            id,
            docname,
            title,
            text,
            vector <-> ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1] AS score,
            1 AS source,
            ROW_NUMBER() OVER (ORDER BY vector <-> ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]) AS rank_hnsw
        FROM
            documents
        ORDER BY vector <-> ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]
        LIMIT 10
)
SELECT 
    COALESCE(bm25.id, hnsw.id) AS id,
    COALESCE(bm25.docname, hnsw.docname) AS docname,
    COALESCE(bm25.title, hnsw.title) AS title,
    COALESCE(bm25.text, hnsw.text) as text,
    CASE 
        WHEN bm25.rank_bm25 > 0 AND hnsw.rank_hnsw > 0 THEN 
            COALESCE(1.0 / (60 + bm25.rank_bm25), 0) + COALESCE(1.0 / (60 + hnsw.rank_hnsw), 0)
        WHEN bm25.rank_bm25 > 0 THEN 
            COALESCE(1.0 / (60 + bm25.rank_bm25), 0)
        WHEN hnsw.rank_hnsw > 0 THEN 
            COALESCE(1.0 / (60 + hnsw.rank_hnsw), 0)
        ELSE 0
    END AS hybrid_score
FROM 
    bm25
FULL OUTER JOIN hnsw ON bm25.id = hnsw.id 
ORDER BY hybrid_score DESC;

步骤四:封装并调用函数(可选步骤)

将步骤三中的查询封装为函数,简化调用。

  1. 封装为函数。

    CREATE OR REPLACE FUNCTION search_documents(
        table_name TEXT,
        vector_column TEXT,
        text_column TEXT,
        search_keyword TEXT,
        search_vector REAL[],
        limit_size INT,
        hnsw_weight FLOAT8 DEFAULT 0.8  -- Default weight for hnsw
    )
    RETURNS TABLE (
        id TEXT,
        docname TEXT,
        title TEXT,
        text TEXT,
        hybrid_score FLOAT8
    ) AS $$
    DECLARE
        query_string TEXT;
        bm25_weight FLOAT8;
    BEGIN
        bm25_weight := 1.0 - hnsw_weight;
    
        query_string := 'WITH t1 AS (
                                SELECT
                                    id,
                                    docname,
                                    title,
                                    ' || text_column || ',
                                    ' || text_column || ' @@@ pgsearch.config(''' || search_keyword || ''') AS score,
                                    2 AS source
                                FROM
                                    ' || table_name || '
                                ORDER BY score	
                                LIMIT ' || limit_size || '
                        ), t2 AS (
                                SELECT
                                    id,
                                    docname,
                                    title,
                                    ' || text_column || ',
                                    cosine_similarity(' || vector_column || ', $1) AS score,
                                    1 AS source
                                FROM
                                    ' || table_name || '
                                ORDER BY ' || vector_column || ' <-> $1
                                LIMIT ' || limit_size || '
                        )
                        SELECT t2.id, t2.docname, t2.title, t2.' || text_column || ', 
                        COALESCE(ABS(t1.score), 0.0) * ' || bm25_weight || ' + 
                        COALESCE(t2.score, 0.0) * ' || hnsw_weight || ' AS hybrid_score
                        FROM t1
                        FULL OUTER JOIN t2 ON t1.id = t2.id 
                        ORDER BY hybrid_score DESC;';
        
     RETURN QUERY EXECUTE query_string USING search_vector;
    END; $$
    LANGUAGE plpgsql;
  2. 调用封装好的search_documents函数查询。

    SELECT * 
    FROM search_documents(
        'documents', 
        'vector',
        'text', 
        'astronomical', 
        ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], 
        10,
        0.8
    );

相关文档

6.0版混合检索使用指南

  • 本页导读
  • 版本限制
  • 前提条件
  • 操作步骤
  • 步骤一:创建样例表
  • 步骤二:创建索引
  • 步骤三:双路召回查询
  • 步骤四:封装并调用函数(可选步骤)
  • 相关文档