Hybrid search-PolarDB(PolarDB)-阿里云帮助中心

PolarDB for PostgreSQL and support multiple retrieval methods, including dense search, sparse search, and hybrid search.

Background

Dense search: Uses semantic context to understand the meaning behind a query.
Sparse search: Emphasizes text matching to find results based on specific terms. This is equivalent to full-text search.
Hybrid search: Combines the strengths of dense search and sparse search to capture both full context and specific keywords, delivering comprehensive search results.

Prepare data

Use a privileged account to create the extensions required for search.
```
CREATE EXTENSION IF NOT EXISTS pg_jieba;
CREATE EXTENSION IF NOT EXISTS rum;
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS polar_ai;
```
The extensions provide the following features:
- pg_jieba (Chinese word segmentation): Performs Chinese word segmentation.
- rum (full-text search acceleration): Supports full-text search and relevance sorting.
- vector (vector search): Supports vector search.
- polar_ai: Lets you create models for text vectorization.

Create a table and insert test data.

CREATE TABLE t_chunk(id serial, chunk text, embedding vector(1536), v tsvector);

INSERT INTO t_chunk(chunk) VALUES('大模型产品解决方案文档与社区权益中心定价云市场合作伙伴支持与服务了解阿里云首页云原生数据库 PolarDB云原生数据库 PolarDBPolarDB是阿里巴巴自研的新一代云原生数据库，在计算存储分离架构下，利用了软硬件结合的优势，为用户提供具备极致弹性、高性能、海量存储、安全可靠的数据库服务。100%兼容MySQL和PostgreSQL生态，高度兼容Oracle语法。'); 
INSERT INTO t_chunk(chunk) VALUES('PolarDB产品系列介绍快速入门价格计算器资源包规格计算器相关技术圈PolarDB引擎版本PolarDB MySQL版多主多写、多活容灾、HTAP特性，交易性能高达开源数据库的6倍，分析性能高达开源数据库的400倍，TCO 低于自建数据库50%'); 
INSERT INTO t_chunk(chunk) VALUES('PolarDB分布式版高吞吐、大存储、低延时、易扩展和超高可用的云时代数据库服务PolarDB PostgreSQL版快速弹性、高性能、海量存储、安全可靠的数据库服务，支持阿里云自研Ganos多维多模时空信息引擎及开源PostGIS地理信息引擎PolarDB PostgreSQL版（兼容Oracle）存储计算分离架构，软硬件结合，提供极致弹性、高性能、海量存储、安全可靠的数据库服务');
INSERT INTO t_chunk(chunk) VALUES('体验教程免费云资源，真实云环境，沉浸式所见即所得体验丰富实践场景与解决方案PolarDB MySQL Serverless极致弹性体验Serverless动态弹性升降资源体验弹性过程中的无感伸缩和秒级弹升体验并观测Serverless的性能和价格力PolarDB MySQL无感秒切无需购买任何资源，真正免费体验在线体验热备“无感”“秒切”的效果'); 
INSERT INTO t_chunk(chunk) VALUES('PolarDB MySQL列存索引IMCI体验IMCI对复杂查询的加速作用体验IMCI在OLAP场景下高效处理能力PolarDB MySQL弹性并行查询ePQ体验ePQ带来的查询加速体验ePQ带来的性能提升体验ePQ如何集群整体资源利用率PolarDB PostgreSQL Serverless极致弹性体验Serverless动态弹性升降资源体验弹性过程中的无感伸缩和秒级弹升体验并观测Serverless的性能和价格力');
INSERT INTO t_chunk(chunk) VALUES('PolarDB PostgreSQL一站式HTAP体验主节点与分析节点之间的实时数据同步体验一站式HTAP下处理复杂查询SQL的高效能PolarDB-X透明分布式体验PolarDB-X的Auto Partitioning能力体验自动化拆分和分布式Online DDL能力体验数据回流及数据一致性的特点智能SQL转换领航助手体验基于规则的智能转换Prompt体验基于LLM的交互式智能订正Oracle迁移至PolarDB场景下的智能SQL转换助手RDS MySQL迁移至PolarDB MySQL版解决问题：容量过大，存不下解决问题：多副本复制延迟解决问题：增减节点耗时长Serverless高可用架构卓越效能极简运维');
INSERT INTO t_chunk(chunk) VALUES('为什么选择阿里云什么是云计算全球基础设施技术领先稳定可靠安全合规分析师报告产品和定价全部产品免费试用');
INSERT INTO t_chunk(chunk) VALUES('产品动态产品定价配置报价器云上成本管理解决方案技术解决方案文档与社区文档开发者社区天池大赛培训与认证权益中心免费试用高校计划企业扶持计划推荐返现计划支持与服务基础服务企业增值服务迁云服务官网公告健康看板信任中心');

Generate vector data. You can create a custom model and call it to convert text into vectors. This example uses the text_embedding_v2 model provided by Alibaba Cloud Model Studio.

Note
Currently, only PolarDB for PostgreSQL Standard Edition clusters in the China (Beijing) region can call the built-in model text_embedding_v2. For more information, see How to quickly perform text vectorization.
1. Bind an API key
  
  Before you call a built-in model for the first time, you must first go to Alibaba Cloud Model Studio to activate the service and obtain an API key. Then, execute the following SQL command to bind your API key to the specified model. A return value of t indicates success, and f indicates failure.
```
SELECT polar_ai.AI_SetModelToken('_dashscope/text_embedding/text_embedding_v2', '<YOUR_API_KEY>');
```
2. Embed text with SQL
  
  Call the AI_Text_Embedding function to perform text vectorization.
```
-- Perform embedding
UPDATE t_chunk SET embedding = polar_ai.ai_text_embedding(chunk);
```

Create the indexes required for search.

Create a vector index. This example uses L2 distance, which you can change as needed.
```
CREATE INDEX ON t_chunk using hnsw(embedding vector_l2_ops);
```

Create a full-text index.

UPDATE  t_chunk SET v = to_tsvector('jiebacfg', chunk);

CREATE INDEX ON t_chunk USING rum (v rum_tsvector_ops);

Search

Hybrid search

Merge the results from both query methods to perform multi-channel recall.

WITH t AS (
SELECT chunk, embedding <-> polar_ai.ai_text_embedding('PolarDB有哪些产品系列')::vector(1536) as dist
FROM t_chunk
ORDER by dist ASC
limit 5 ),
t2 as (
  SELECT chunk, v <=> to_tsquery('jiebacfg', 'PolarDB|PostgreSQL版|快速') as rank
FROM t_chunk 
WHERE v @@ to_tsquery('jiebacfg', 'PolarDB|PostgreSQL版|快速')
ORDER by rank ASC
LIMIT 5
)
SELECT * FROM t
UNION ALL
SELECT * FROM t2;

Because the distance calculation methods for these two search types are different, their scores cannot be directly compared. To solve this, you can use Reciprocal Rank Fusion (RRF) to combine and re-rank the results. RRF merges multiple result sets from different search methods into a single list. It does not require tuning and produces high-quality results even when the relevance metrics of the different methods are not correlated. The basic steps are as follows:

Collect ranked lists

Multiple retrievers (each representing a recall channel) generate separate ranked lists of results for a given query.
Fuse ranks

RRF uses a simple scoring function to combine the ranks from each list. The RRF score for each document is calculated using the following formula:

$RRF_score (d) = i = 1 \sum N \frac{1}{k + r an k _{i} ( d )}$

Where $N$ is the number of different recall paths, $r an k_{i} (d)$ is retriever $i$ 's rank for document $d$ , and $k$ is a smoothing parameter, typically set to 60.
Re-rank results

Re-rank the documents based on their combined RRF scores to produce the final result list.

In this query, if you are not satisfied with the result order, adjust the parameter $k$ to change it. The system combines results from a full-text search and a vector search. Based on the $k$ parameter passed in the query, the system retrieves the $t o p K$ results from each search. It then scores each returned document using the formula $\frac{1}{k + r ank ( i )}$ . In this formula, $r ank (i)$ is the document's rank at position $i$ . If a document from the full-text $t o p K$ results does not appear in the vector search $t o p K$ results, it receives a single score. The same applies to documents that appear only in the vector search results. If a document appears in the $t o p K$ result sets of both searches, their scores are added together.

Note

The smoothing parameter $k$ determines how much documents in a single result set for each query affect the final ranking. The higher the value, the greater the impact that lower-ranked documents have on the final ranking.

-- 密集向量召回
WITH t1 as 
(
SELECT chunk, embedding <-> polar_ai.ai_text_embedding('PolarDB有哪些产品系列')::vector(1536) as dist
FROM t_chunk
ORDER by dist ASC
limit 5
),
t2 as (
SELECT ROW_NUMBER() OVER (ORDER BY dist ASC) AS row_num,
chunk
FROM t1
),
-- 稀疏向量召回
t3 as 
(
  SELECT chunk, v <=> to_tsquery('jiebacfg', 'PolarDB|PostgreSQL版|快速') as rank
  FROM t_chunk 
  WHERE v @@ to_tsquery('jiebacfg', 'PolarDB|PostgreSQL版|快速')
  ORDER by rank ASC
  LIMIT 5
),
t4 as (  
SELECT ROW_NUMBER() OVER (ORDER BY rank DESC) AS row_num,
chunk
FROM t3
),
-- Calculate RRF scores separately
t5 AS (
SELECT 1.0/(60+row_num) as score, chunk FROM t2
UNION ALL 
SELECT 1.0/(60+row_num), chunk FROM t4
)
-- 评分进行合并
SELECT sum(score) as score, chunk
FROM t5
GROUP BY chunk
ORDER BY score DESC;

Apply weights

You can also assign different weights to each result set. For example, you can assign a weight of 0.8 to the dense search results and 0.2 to the sparse search results.

-- 密集向量召回
WITH t1 as 
(
SELECT chunk, embedding <-> polar_ai.ai_text_embedding('PolarDB有哪些产品系列')::vector(1536) as dist
FROM t_chunk
ORDER by dist ASC
limit 5
),
t2 as (
SELECT ROW_NUMBER() OVER (ORDER BY dist ASC) AS row_num,
chunk
FROM t1
),
-- 稀疏向量召回
t3 as 
(
  SELECT chunk, v <=> to_tsquery('jiebacfg', 'PolarDB|PostgreSQL版|快速') as rank
  FROM t_chunk 
  WHERE v @@ to_tsquery('jiebacfg', 'PolarDB|PostgreSQL版|快速')
  ORDER by rank ASC
  LIMIT 5
),
t4 as (  
SELECT ROW_NUMBER() OVER (ORDER BY rank DESC) AS row_num,
chunk
FROM t3
),
-- Calculate RRF scores separately with weights 0.8 and 0.2
t5 as (
SELECT (1.0/(60+row_num)) * 0.8 as score , chunk FROM t2
UNION ALL 
SELECT (1.0/(60+row_num)) * 0.2, chunk FROM t4
)
-- 评分进行合并
SELECT sum(score) as score, chunk
FROM t5
GROUP BY chunk
ORDER BY score DESC;

Dense search

Performs a search based only on vectors, where a smaller distance indicates higher semantic similarity.

SELECT chunk, embedding <-> polar_ai.ai_text_embedding('PolarDB有哪些产品系列')::vector(1536) as dist
FROM t_chunk
ORDER by dist ASC
limit 5;

Sparse search

Performs a search based only on full-text matching, where a smaller rank value indicates higher relevance.

SELECT chunk, v <=> to_tsquery('jiebacfg', 'PolarDB|PostgreSQL版|快速') as rank
FROM t_chunk 
WHERE v @@ to_tsquery('jiebacfg', 'PolarDB|PostgreSQL版|快速')
ORDER by rank ASC
LIMIT 5;