Hybrid search
In text retrieval, hybrid search combines full-text search and vector search to deliver more accurate results than either method alone. Reciprocal Rank Fusion (RRF) is an algorithm that optimizes the final ranking by fusing results from multiple search systems. It calculates a combined score by summing the reciprocal ranks from each system. This topic describes how to perform a dual-path recall using full-text and vector search and then implement hybrid search with the RRF algorithm in AnalyticDB for MySQL.
Prerequisites
xuanwu_v1 table engine: Only `l2_distance` is supported. Your cluster must run kernel version 3.1.4.0 or later. The vector index feature is more stable on clusters that run kernel version 3.1.5.16, 3.1.6.8, 3.1.8.6, or a later version.
xuanwu_v2 table engine: `l2_distance` and `cosine_similarity` are supported. To use `l2_distance`, your cluster must run kernel version 3.2.6.0 or later. To use `cosine_similarity`, your cluster must run kernel version 3.2.7.0 or later.
The examples in this topic use cosine_similarity for vector search.
To view and update the minor version, go to the Configuration Information section on the Cluster Information page in the AnalyticDB for MySQL console.
Prepare data
Create a table and indexes
Create a document table that includes a text field, a vector field, and scalar filter fields. Then, create a full-text index and a vector index on the table.
DROP TABLE IF EXISTS documents; -- Create a table with text, vector, and scalar fields, plus full-text and vector indexes. CREATE TABLE documents ( id INT, text_field TEXT, -- Full-text search field float_feature ARRAY < FLOAT >(3), -- Vector field field1 INT, -- Scalar field field2 TEXT, -- Scalar field FULLTEXT INDEX idx_text_field(`text_field`) WITH ANALYZER ik, -- Use the IK analyzer ANN INDEX idx_float_feature(`float_feature`), PRIMARY KEY (id) ) DISTRIBUTED BY HASH(id);Import data
INSERT INTO documents (id, text_field, float_feature, field1, field2) VALUES (1, 'Customers need better products and services', '[2.5, 2.3, 2.4]', 1, 'flag1'), (2, 'Wuhan Yangtze River Bridge', '[2.6, 2.3, 2.4]', 2, 'flag1'), (3, 'Hangzhou, Zhejiang Province', '[2.7, 2.3, 2.4]', 3, 'flag1'), (4, 'The user value and business value of products', '[2.8, 2.3, 2.4]', 4, 'flag2');
Single-path recall
Vector search
SELECT id, similarity, ROW_NUMBER() OVER ( ORDER BY similarity DESC ) AS vec_rank FROM ( SELECT id, cosine_similarity(float_feature, '[2.8, 2.3, 2.4]') AS similarity FROM documents WHERE field1 > 1 ORDER BY similarity DESC LIMIT 100 ) inner_vecQuery results
id similarity vec_rank 4 1.0000001 1 3 0.99984056 2 2 0.999343 3Full-text search
SELECT id, text_field, MATCH(text_field) AGAINST ('product services') AS match_score, ROW_NUMBER() OVER ( ORDER BY MATCH(text_field) AGAINST ('product services') DESC ) AS txt_rank FROM documents WHERE MATCH(text_field) AGAINST ('product services') ORDER BY match_score DESC LIMIT 100;Query results
id text_field match_score txt_rank 1 Customers need better products and services 0.2615291476249695 1 4 The user value and business value of products 0.13076457381248474 2
Hybrid search
Use a common table expression (CTE) to perform full-text search and vector search separately. Then, use the RRF algorithm to fuse the results from both paths.
-- Step 1: Perform vector search with scalar filtering
WITH vector_search AS (
SELECT
id,
similarity,
ROW_NUMBER() OVER (
ORDER BY
similarity DESC
) AS vec_rank
FROM
(
SELECT
id,
cosine_similarity(float_feature, '[2.8, 2.3, 2.4]') AS similarity
FROM
documents
WHERE
field1 > 1
ORDER BY
similarity DESC
LIMIT
100
) inner_vec
),
-- Step 2: Perform full-text search
fulltext_search AS (
SELECT
id,
MATCH(text_field) AGAINST ('product services') AS match_score,
ROW_NUMBER() OVER (
ORDER BY
MATCH(text_field) AGAINST ('product services') DESC
) AS txt_rank
FROM
documents
WHERE
MATCH(text_field) AGAINST ('product services')
ORDER BY
match_score DESC
LIMIT
100
)
SELECT
COALESCE(fulltext_search.id, vector_search.id) AS doc_id,
-- Calculate the RRF score using the formula: sum(1 / (k + rank)), where the constant k is 60.
(
CASE WHEN fulltext_search.txt_rank IS NOT NULL THEN 1.0 / (60 + fulltext_search.txt_rank) ELSE 0 END
) + (
CASE WHEN vector_search.vec_rank IS NOT NULL THEN 1.0 / (60 + vector_search.vec_rank) ELSE 0 END
) AS rrf_score,
-- (Optional) Join with the original table to display additional fields.
documents.text_field,
documents.field1,
documents.field2
FROM
fulltext_search FULL
JOIN vector_search ON fulltext_search.id = vector_search.id
LEFT JOIN documents ON COALESCE(fulltext_search.id, vector_search.id) = documents.id
-- Sort by rrf_score in descending order and return the top 10 results.
ORDER BY
rrf_score DESC
LIMIT
10;Query results
doc_id rrf_score text_field field1 field2
4 0.032522474881015335801 The user value and business value of products 4 flag2
1 0.016393442622950819672 Customers need better products and services 1 flag1
3 0.016129032258064516129 Hangzhou, Zhejiang Province 3 flag1
2 0.015873015873015873016 Wuhan Yangtze River Bridge 2 flag1