如何实现7.0版向量检索和全文检索双路召回_云原生数据仓库AnalyticDB(AnalyticDB)-阿里云帮助中心

在大多数场景下，仅使用向量检索就能在相似度召回中获得较高的召回率。然而在某些情况下，例如当Embedding模型表现不佳或查询复杂导致生成的向量与库内数据距离较远时，仅靠向量相似性召回可能无法达到预期效果。这时为了提高召回率，可以采用向量检索和全文检索双路召回策略。

云原生数据仓库 AnalyticDB PostgreSQL 版的双路召回通过向量检索和全文检索分别召回部分数据，然后合并两部分召回数据，做精排和后处理，以获得更佳的召回效果。具体步骤如下。

向量检索：基于嵌入向量的稠密表征，通过近似最近邻搜索（ANN）捕获语义相关性，召回Top-K相似项。
全文检索：针对词频、逆文档频率等统计特征做精准匹配，补充关键词强相关结果。AnalyticDB for PostgreSQL6.0版中，全文检索依赖GIN索引实现。而在7.0版，则升级为基于pgsearch的BM25索引，进一步提升了检索效率和相关性。
精排和后处理：将两路召回的数据合并，并做进一步的排序和处理，以确保最终结果的相关性和准确性。

本文介绍AnalyticDB for PostgreSQL7.0版的向量检索和全文检索双路召回。如果实例是6.0版，请查看6.0版双路召回。

版本限制

内核版本为7.2.1.0及以上的AnalyticDB for PostgreSQL7.0版实例。

说明

您可以在控制台实例的基本信息页查看内核小版本。如不满足上述版本要求，需要您升级内核小版本。

前提条件

已为实例开启向量引擎优化。
已安装pgsearch插件。如果您已安装，在数据库的Schema列表中可以看到pgsearch。如未安装请提交工单，联系技术支持协助安装（需要重启实例）。

操作步骤

步骤一：创建样例表

创建样例表documents并写入5条测试数据。

-- vector字段为向量
CREATE TABLE IF NOT EXISTS documents(
                id TEXT,
                docname TEXT,
                title TEXT,
                vector real[],
                text TEXT);
-- 将向量列设置为内联模式
ALTER TABLE documents ALTER COLUMN vector SET STORAGE PLAIN;
-- 插入样本数据
INSERT INTO documents (id, docname, title, vector, text) VALUES
('1', 'doc_1', 'Exploring the Universe', 
'{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}', 
'The universe is vast, filled with mysteries and astronomical wonders waiting to be discovered.'),

('2', 'doc_2', 'The Art of Cooking', 
'{0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1}', 
'Cooking combines ingredients artfully, creating flavors that nourish and bring people together.'),

('3', 'doc_3', 'Technology and Society', 
'{0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2}', 
'Technology transforms society, reshaping communication, work, and our daily interactions significantly.'),

('4', 'doc_4', 'Psychology of Happiness', 
'{0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3}', 
'Happiness is complex, influenced by relationships, gratitude, and the pursuit of meaningful experiences.'),

('5', 'doc_5', 'Sustainable Living Practices', 
'{0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4}', 
'Sustainable living involves eco-friendly choices, reducing waste, and promoting environmental awareness.');

步骤二：创建索引

为向量字段创建向量索引。

CREATE INDEX documents_idx ON documents USING ann(vector) WITH (dim = 10, algorithm = hnswflat, distancemeasure = L2, vector_include = 0);

为文本字段创建全文索引。

CALL pgsearch.create_bm25(
    index_name => 'documents_bm25_idx',
    table_name => 'documents',
    text_fields => '{text: {}}'
);

步骤三：双路召回查询

第一张临时表t1通过全文检索召回1条结果，第二张临时表通过向量检索召回5条结果。通过FULL OUTER JOIN综合BM25得分和向量相似度得分得到总得分，最后按照总得分排序返回结果。

WITH t1 AS (
    SELECT
            id,
            docname,
            title,
            text,
            text @@@ pgsearch.config('text:astronomical') AS score,
            2 AS source
    FROM
        documents
    ORDER BY score
    LIMIT 10
),
t2 AS (
    SELECT
        id,
        docname,
        title,
        text,
        cosine_similarity(vector,ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]::real[]) AS score,
        1 AS source
    FROM
        documents
    ORDER BY vector <-> ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]
    LIMIT 10
)
SELECT t2.*, COALESCE(ABS(t1.score), 0.0) * 0.2 + COALESCE(t2.score, 0.0) * 0.8 AS hybrid_score
-- 此处得分的权重分配仅为演示，您可以根据业务需求选取合适的参数和计算方法。
FROM t1
FULL OUTER JOIN t2 ON t1.id = t2.id 
ORDER BY  hybrid_score DESC;

您也可以使用RRF（Reciprocal Rank Fusion）来计算最终得分。RRF通过结合向量检索和全文检索的排名来确定召回结果的最终排名。通常，如果某个召回结果在两种检索方法中的排名都比较靠前，那么它的综合得分也会更高。

RRF公式中的参数 k 用于平滑排名对最终得分的影响。较大的 k 值会使不同排名之间的得分差异减小，从而达到更好的平滑效果。默认情况下，k 的值为60。

RRF使用示例

WITH bm25 AS (
        SELECT
            id,
            docname,
            title,
            text,
            text @@@ pgsearch.config('text:astronomical') AS score,
            2 AS source,
            ROW_NUMBER() OVER () AS rank_bm25
        FROM
            documents
        ORDER BY score
        LIMIT 10
), hnsw AS (
        SELECT
            id,
            docname,
            title,
            text,
            vector <-> ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1] AS score,
            1 AS source,
            ROW_NUMBER() OVER (ORDER BY vector <-> ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]) AS rank_hnsw
        FROM
            documents
        ORDER BY vector <-> ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]
        LIMIT 10
)
SELECT 
    COALESCE(bm25.id, hnsw.id) AS id,
    COALESCE(bm25.docname, hnsw.docname) AS docname,
    COALESCE(bm25.title, hnsw.title) AS title,
    COALESCE(bm25.text, hnsw.text) as text,
    CASE 
        WHEN bm25.rank_bm25 > 0 AND hnsw.rank_hnsw > 0 THEN 
            COALESCE(1.0 / (60 + bm25.rank_bm25), 0) + COALESCE(1.0 / (60 + hnsw.rank_hnsw), 0)
        WHEN bm25.rank_bm25 > 0 THEN 
            COALESCE(1.0 / (60 + bm25.rank_bm25), 0)
        WHEN hnsw.rank_hnsw > 0 THEN 
            COALESCE(1.0 / (60 + hnsw.rank_hnsw), 0)
        ELSE 0
    END AS hybrid_score
FROM 
    bm25
FULL OUTER JOIN hnsw ON bm25.id = hnsw.id 
ORDER BY hybrid_score DESC;

步骤四：封装并调用函数（可选步骤）

将步骤三中的查询封装为函数，简化调用。

封装为函数。

CREATE OR REPLACE FUNCTION search_documents(
    table_name TEXT,
    vector_column TEXT,
    text_column TEXT,
    search_keyword TEXT,
    search_vector REAL[],
    limit_size INT,
    hnsw_weight FLOAT8 DEFAULT 0.8  -- Default weight for hnsw
)
RETURNS TABLE (
    id TEXT,
    docname TEXT,
    title TEXT,
    text TEXT,
    hybrid_score FLOAT8
) AS $$
DECLARE
    query_string TEXT;
    bm25_weight FLOAT8;
BEGIN
    bm25_weight := 1.0 - hnsw_weight;

    query_string := 'WITH t1 AS (
                            SELECT
                                id,
                                docname,
                                title,
                                ' || text_column || ',
                                ' || text_column || ' @@@ pgsearch.config(''' || search_keyword || ''') AS score,
                                2 AS source
                            FROM
                                ' || table_name || '
                            ORDER BY score	
                            LIMIT ' || limit_size || '
                    ), t2 AS (
                            SELECT
                                id,
                                docname,
                                title,
                                ' || text_column || ',
                                cosine_similarity(' || vector_column || ', $1) AS score,
                                1 AS source
                            FROM
                                ' || table_name || '
                            ORDER BY ' || vector_column || ' <-> $1
                            LIMIT ' || limit_size || '
                    )
                    SELECT t2.id, t2.docname, t2.title, t2.' || text_column || ', 
                    COALESCE(ABS(t1.score), 0.0) * ' || bm25_weight || ' + 
                    COALESCE(t2.score, 0.0) * ' || hnsw_weight || ' AS hybrid_score
                    FROM t1
                    FULL OUTER JOIN t2 ON t1.id = t2.id 
                    ORDER BY hybrid_score DESC;';
    
 RETURN QUERY EXECUTE query_string USING search_vector;
END; $$
LANGUAGE plpgsql;

调用封装好的search_documents函数查询。

SELECT * 
FROM search_documents(
    'documents', 
    'vector',
    'text', 
    'astronomical', 
    ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], 
    10,
    0.8
);

7.0版向量检索与全文检索双路召回