Quick Start

更新时间:
复制 MD 格式

When keyword search with LIKE is too slow or lacks linguistic intelligence — no stemming, no relevance ranking, no index support — full-text search solves these problems. This guide walks you through full-text search on PolarDB for PostgreSQL (Compatible with Oracle) using the RUM extension: installing extensions, creating tables and indexes, and running ranked queries on both English and Chinese text.

Prerequisites

Before you begin, make sure you have:

  • A PolarDB for PostgreSQL (Compatible with Oracle) cluster

  • Database connection credentials with the CREATE EXTENSION privilege

Install extensions

Install the RUM extension for full-text search:

CREATE EXTENSION rum;

To support Chinese text, also install the pg_jieba extension for Chinese word segmentation:

CREATE EXTENSION pg_jieba;
Note

For version requirements, see pg_jieba and rum.

Create tables

Create a table for English text:

CREATE TABLE test_english(id serial, t text, d timestamp);

Create a table for Chinese text. The v column stores the pre-computed tsvector used for indexing and querying:

CREATE TABLE test_chinese(id serial, t text, d timestamp, v tsvector);

Insert data

The examples in this guide use the following test data.

English table (`test_english`)

id

t

d

1

The situation is most beautiful

2016-05-01 20:21:22

2

It is a beautiful

2016-05-01 21:21:22

3

It looks like a beautiful place

2016-05-01 22:21:22

4

It looks l a beautiful place

2016-05-02 00:21:22

5

It looks like a beautiful places

2016-05-02 02:21:22

6

It looks like a beaut places

2016-05-02 03:21:22

7

It looks like a pig

2016-05-02 03:21:22

Chinese table (`test_chinese`) — four rows of Chinese financial news text, all timestamped 2016-05-01 20:21:22.

id

t (excerpt)

d

1

According to Nancai Wealth Management data,5May public wealth management products...

2016-05-01 20:21:22

2

Among them,CITIC Bank Wealth Management"Anying Xiang Fixed Income Steady Profit..."...

2016-05-01 20:21:22

3

5The largest fundraising scale for new products launched in May was CITIC Bank Wealth Management...

2016-05-01 20:21:22

4

According to Nancai Wealth Management data,"Anying Xiang Fixed Income Steady Profit..."...

2016-05-01 20:21:22

Populate the English table:

INSERT INTO test_english(t,d) VALUES ('The situation is most beautiful', '2016-05-01 20:21:22.326724');
INSERT INTO test_english(t,d) VALUES ('It is a beautiful', '2016-05-01 21:21:22.326724');
INSERT INTO test_english(t,d) VALUES ('It looks like a beautiful place', '2016-05-01 22:21:22.326724');
INSERT INTO test_english(t,d) VALUES ('It looks l a beautiful place', '2016-05-02 00:21:22.326724');
INSERT INTO test_english(t,d) VALUES ('It looks like a beautiful places', '2016-05-02 02:21:22.326724');
INSERT INTO test_english(t,d) VALUES ('It looks like a beaut places', '2016-05-02 03:21:22.326724');
INSERT INTO test_english(t,d) VALUES ('It looks like a pig', '2016-05-02 03:21:22.326724');

Populate the Chinese table, then compute the tsvector column using the jiebacfg configuration for Chinese word segmentation:

INSERT INTO test_chinese(t,d) VALUES ('南财理财通数据显示,5月公募理财产品吸金榜前十名归属于5家机构,信银理财有5只产品上榜,南银理财占两席,民生理财、恒丰理财、浦银理财各占一席,上榜产品均为中低风险的封闭式固收类产品。从投资周期看,本月榜单产品中有6只产品投资周期为1-2年,3-6个月及6-12个月期限产品各有2只。', '2016-05-01 20:21:22.326724');
INSERT INTO test_chinese(t,d) VALUES ('其中,信银理财"安盈象固收稳利十四个月封闭式95号"理财产品夺得吸金榜冠军,募集规模达到106.204 亿元;民生理财"富竹固收优选14个月封闭39号"位列第二,募集规模为46.606亿元;信银理财"安盈象固收稳利十四个月封闭式94号"排名第三,募集规模为41.365 亿元。', '2016-05-01 20:21:22.326724');
INSERT INTO test_chinese(t,d) VALUES ('5月新发产品募集规模最大的是信银理财"安盈象固收稳利十四个月封闭式95号",实际募集资金高达106.204亿元,该系列产品已蝉联4个月榜首。', '2016-05-01 20:21:22.326724');
INSERT INTO test_chinese(t,d) VALUES ('南财理财通数据显示,"安盈象固收稳利十四个月封闭式95号"理财产品成立于2024年05月16日,将于2025年07月16日到期,风险评级PR2(中低风险),产品有A、C、D、K类份额,投资周期1-2年,业绩比较基准2.90%~3.45%。', '2016-05-01 20:21:22.326724');

-- Populate the tsvector column. Run this after each INSERT, or automate it with a trigger.
UPDATE test_chinese
SET v = to_tsvector('jiebacfg', t);
Note

The UPDATE above populates existing rows. In production, keep the v column current by running this UPDATE after each write, or by creating a trigger that calls to_tsvector('jiebacfg', t) automatically on every INSERT or UPDATE.

Create indexes

Create a RUM index on the English table using rum_text_ops, which indexes the raw text column directly:

CREATE INDEX ON test_english USING rum (t rum_text_ops);

Create a RUM index on the Chinese table using rum_tsvector_ops, which indexes the pre-computed tsvector column:

CREATE INDEX ON test_chinese USING rum (v rum_tsvector_ops);

Run queries

Two operators power full-text search queries:

Operator

Description

@@

Match operator. Returns true if a tsvector matches a tsquery.

<=>

Distance operator (RUM-specific). Returns a float score where a higher value means lower relevance (greater distance from the query terms).

Query the English table

Match rows containing `beautiful` or `place`:

SELECT * FROM test_english
WHERE to_tsvector('english', t) @@ to_tsquery('english', 'beautiful | place');

Output:

 id |                t                 |             d
----+----------------------------------+----------------------------
  1 | The situation is most beautiful  | 2016-05-01 20:21:22.326724
  2 | It is a beautiful                | 2016-05-01 21:21:22.326724
  3 | It looks like a beautiful place  | 2016-05-01 22:21:22.326724
  4 | It looks l a beautiful place     | 2016-05-02 00:21:22.326724
  5 | It looks like a beautiful places | 2016-05-02 02:21:22.326724
  6 | It looks like a beaut places     | 2016-05-02 03:21:22.326724

Match rows containing `beautiful` or `place`, ordered by distance score:

The <=> operator assigns a distance score to each matching row. Because a higher score means lower relevance, ordering by DESC places the least-relevant rows first. Rows that match only one query term receive a higher score than rows that match both terms.

SELECT *, to_tsvector('english', t) <=> to_tsquery('english', 'beautiful | place') AS rank
FROM test_english
WHERE to_tsvector('english', t) @@ to_tsquery('english', 'beautiful | place')
ORDER BY to_tsvector('english', t) <=> to_tsquery('english', 'beautiful | place') DESC;

Output:

 id |                t                 |             d              |  rank
----+----------------------------------+----------------------------+---------
  1 | The situation is most beautiful  | 2016-05-01 20:21:22.326724 | 16.4493
  2 | It is a beautiful                | 2016-05-01 21:21:22.326724 | 16.4493
  6 | It looks like a beaut places     | 2016-05-02 03:21:22.326724 | 16.4493
  3 | It looks like a beautiful place  | 2016-05-01 22:21:22.326724 | 8.22467
  4 | It looks l a beautiful place     | 2016-05-02 00:21:22.326724 | 8.22467
  5 | It looks like a beautiful places | 2016-05-02 02:21:22.326724 | 8.22467
(6 rows)

Rows 1, 2, and 6 each match only one query term (beautiful or place but not both), giving them a higher distance score (16.4493). Rows 3–5 match both terms, so their distance is lower (8.22467). To sort by most-relevant first, change DESC to ASC.

Query the Chinese table

Match rows containing `数据` or `理财`, ordered by distance score:

SELECT id, v <=> to_tsquery('jiebacfg', '数据 | 理财') as rank
FROM test_chinese
WHERE v @@ to_tsquery('jiebacfg', '数据 | 理财');

Output:

 id |   rank
----+-----------
  1 | 6.6024785
  2 |  12.08523
  3 |  16.44934
  4 |   8.22467

Match rows where `v` contains `数据`, ranked by similarity to `数据 | 理财`:

SELECT id, v <=> to_tsquery('jiebacfg', '数据 | 理财') as rank
FROM test_chinese
WHERE v @@ to_tsquery('jiebacfg', '数据');

Output:

 id |   rank
----+-----------
  1 | 6.6024785
  4 |   8.22467

Uninstall extensions (optional)

DROP EXTENSION rum CASCADE;
DROP EXTENSION pg_jieba CASCADE;

What's next

  • pg_jieba — Chinese word segmentation configuration and options

  • rum — RUM extension reference, including version requirements and operator details