倒排索引
当您需要快速检索云数据库 SelectDB 版中的大量文本数据时,本文档将指导您如何创建和使用倒排索引,以支持您的文本检索应用,并提高检索速度。
背景信息
倒排索引,是信息检索领域常用的索引技术。通过将文本分割成一个个词然后构建索引,以快速查找一个词在文档中出现的位置。云数据库 SelectDB 版支持倒排索引,可用于支持文本类型的全文检索、普通数值和日期类型的等值或范围查询,快速从大量数据中过滤出满足条件的数据。在SelectDB的倒排索引实现中,表的一行对应一个文档,一列对应文档中的一个字段,利用倒排索引可以根据关键词快速定位包含它的行,达到对WHERE子句加速的目的。
与SelectDB其他索引不同的是,倒排索引在存储层使用独立的文件,跟Segment文件有逻辑对应关系,但存储文件相互独立。因此可以做到在创建、删除索引时不用重写Tablet和Segment文件,大幅降低处理开销。
功能介绍
SelectDB倒排索引的功能包含如下。
支持字符串类型的全文检索。
支持字符串全文检索,包括同时匹配多个关键字
MATCH_ALL
、匹配任意一个关键字MATCH_ANY
、匹配短语词组MATCH_PHRASE
。支持字符串数组类型的全文检索。
支持英文、中文以及Unicode多语言分词。
加速普通等值,范围查询覆盖Bitmap索引的功能。
支持字符串、数值、日期时间类型的
=、!=、>、>=、<、<=
快速过滤。支持字符串、数字、日期时间数组类型的
=、!=、>、>=、<、<=
。
支持完善的逻辑组合。
新增索引对
OR、NOT
逻辑的下推。支持多个条件的任意
AND、OR、NOT
组合。
灵活、快速的索引管理。
支持在创建表上定义倒排索引。
支持在已有的表上增加倒排索引,而且支持增量构建倒排索引,无需重写表中的已有数据。
支持删除已有表上的倒排索引,无需重写表中的已有数据。
倒排索引管理
建表时定义索引
在创建表时定义倒排索引,示例如下。
CREATE TABLE <table_name>
(
columns_difinition,
INDEX idx_name1(column_name1) USING INVERTED [PROPERTIES("parser" = "english|unicode|chinese")] [COMMENT 'your comment']
INDEX idx_name2(column_name2) USING INVERTED [PROPERTIES("parser" = "english|unicode|chinese")] [COMMENT 'your comment']
INDEX idx_name3(column_name3) USING INVERTED [PROPERTIES("parser" = "chinese", "parser_mode" = "fine_grained|coarse_grained")] [COMMENT 'your comment']
INDEX idx_name4(column_name4) USING INVERTED [PROPERTIES("parser" = "english|unicode|chinese", "support_phrase" = "true|false")] [COMMENT 'your comment']
INDEX idx_name5(column_name4) USING INVERTED [PROPERTIES("char_filter_type" = "char_replace", "char_filter_pattern" = "._"), "char_filter_replacement" = " "] [COMMENT 'your comment']
INDEX idx_name5(column_name4) USING INVERTED [PROPERTIES("char_filter_type" = "char_replace", "char_filter_pattern" = "._")] [COMMENT 'your comment']
)
table_properties;
参数说明:
参数 | 必填 | 描述 |
USING INVERTED | 是 | 用于指定索引类型是倒排索引。 |
COMMENT | 否 | 用于指定注释。 |
PROPERTIES | 否 | 用于指定倒排索引的额外属性。详细参数说明,请参见PROPERTIES参数说明。 |
PROPERTIES参数说明:
参数名称 | 参数说明 |
parser | 指定分词器。默认不指定代表不分词。
|
parser_mode | 指定分词模式,默认coarse_grained。目前parser=chinese时支持如下模式:
|
support_phrase | 用于指定索引是否支持MATCH_PHRASE短语查询加速,默认false。
|
char_filter | 在分词前对字符串提前处理。目前char_filter_type仅支持char_replace。 char_replace将pattern中每个char替换为一个replacement中的char。
|
倒排索引在不同数据模型中有不同的使用限制:
Aggregate模型:只能为Key列建立倒排索引。
Unique模型:需要开启Merge on Write特性。开启后,可以为任意列建立倒排索引。
Duplicate模型:可以为任意列建立倒排索引。
已有表增加索引
增量数据生成倒排索引,语法如下。
-- 语法1,不进行分词
CREATE INDEX <index_name> ON table_name(column_name) USING INVERTED PROPERTIES("parser" = "english|unicode|chinese");
-- 语法2
ALTER TABLE <table_name> ADD INDEX index_name(column_name) USING INVERTED PROPERTIES("parser" = "english|unicode|chinese");
存量数据增加倒排索引,语法如下。
-- 语法1,默认给全表的存量数据加上倒排索引
BUILD INDEX <index_name> ON <table_name>;
-- 语法2,可指定partition,可指定一个或多个
BUILD INDEX <index_name> ON <table_name> PARTITIONS(<partition_name1>, <partition_name2>);
执行原数据添加倒排索引之前需要先执行增量数据生成倒排索引操作。
查看BUILD INDEX进展,语法如下。
SHOW BUILD INDEX [FROM db_name];
-- 示例1,查看所有的BUILD INDEX任务进展
SHOW BUILD INDEX;
-- 示例2,查看指定table的BUILD INDEX任务进展
SHOW BUILD INDEX where TableName = "table1";
取消BUILD INDEX,语法如下。
CANCEL BUILD INDEX ON <table_name>;
CANCEL BUILD INDEX ON <table_name> (job_id1,jobid_2,...);
删除索引
删除已有倒排索引,语法如下。
-- 语法1
DROP INDEX <index_name> ON <table_name>;
-- 语法2
ALTER TABLE <table_name> DROP INDEX <index_name>;
查看索引变更进度
ALTER TABLE相关索引变更操作是异步过程,可通过命令查看任务进度,语法如下。
SHOW ALTER TABLE COLUMN;
利用索引加速查询
利用倒排索引加速查询,示例如下。
-- 1. 全文检索关键词匹配,通过MATCH_ANY MATCH_ALL完成
SELECT * FROM table_name WHERE column_name MATCH_ANY | MATCH_ALL 'keyword1 ...';
-- 1.1 logmsg中包含keyword1的行
SELECT * FROM table_name WHERE logmsg MATCH_ANY 'keyword1';
-- 1.2 logmsg中包含keyword1或者keyword2的行,后面还可以添加多个keyword
SELECT * FROM table_name WHERE logmsg MATCH_ANY 'keyword1 keyword2';
-- 1.3 logmsg中同时包含keyword1和keyword2的行,后面还可以添加多个keyword
SELECT * FROM table_name WHERE logmsg MATCH_ALL 'keyword1 keyword2';
-- 1.4 logmsg中同时包含keyword1和keyword2的行,并且按照keyword1在前,keyword2在后的顺序
SELECT * FROM table_name WHERE logmsg MATCH_PHRASE 'keyword1 keyword2';
-- 2. 普通等值、范围、IN、NOT IN,正常的SQL语句即可,例如
SELECT * FROM table_name WHERE id = 123;
SELECT * FROM table_name WHERE ts > '2023-01-01 00:00:00';
SELECT * FROM table_name WHERE op_type IN ('add', 'delete');
分词函数
检查分词实际效果或者对一段文本进行分词的话,可以使用TOKENIZE
函数,示例如下。
SELECT TOKENIZE('武汉长江大桥','"parser"="chinese","parser_mode"="fine_grained"');
+-----------------------------------------------------------------------------------+
| tokenize('武汉长江大桥', '"parser"="chinese","parser_mode"="fine_grained"') |
+-----------------------------------------------------------------------------------+
| ["武汉", "武汉长江大桥", "长江", "长江大桥", "大桥"] |
+-----------------------------------------------------------------------------------+
1 row in set (0.02 sec)
SELECT TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="fine_grained"');
+--------------------------------------------------------------------------------------+
| tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="fine_grained"') |
+--------------------------------------------------------------------------------------+
| ["武汉", "武汉市", "市长", "长江", "长江大桥", "大桥"] |
+--------------------------------------------------------------------------------------+
1 row in set (0.02 sec)
SELECT TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="coarse_grained"');
+----------------------------------------------------------------------------------------+
| tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="coarse_grained"') |
+----------------------------------------------------------------------------------------+
| ["武汉市", "长江大桥"] |
+----------------------------------------------------------------------------------------+
1 row in set (0.02 sec)
SELECT TOKENIZE('I love CHINA','"parser"="english"');
+------------------------------------------------+
| tokenize('I love CHINA', '"parser"="english"') |
+------------------------------------------------+
| ["i", "love", "china"] |
+------------------------------------------------+
1 row in set (0.02 sec)
SELECT TOKENIZE('I love CHINA 我爱我的祖国','"parser"="unicode"');
+-------------------------------------------------------------------+
| tokenize('I love CHINA 我爱我的祖国', '"parser"="unicode"') |
+-------------------------------------------------------------------+
| ["i", "love", "china", "我", "爱", "我", "的", "祖", "国"] |
+-------------------------------------------------------------------+
1 row in set (0.02 sec)
最佳实践
使用hackernews
100万条数据展示倒排索引的创建、全文检索、普通查询,并与无索引的查询性能进行简单对比。
创建表
创建数据库,示例如下。
CREATE DATABASE test_inverted_index;
切换到创建的数据库下,示例如下。
USE test_inverted_index;
创建数据表,示例如下。
CREATE TABLE hackernews_1m ( `id` BIGINT, `deleted` TINYINT, `type` String, `author` String, `timestamp` DateTimeV2, `comment` String, `dead` TINYINT, `parent` BIGINT, `poll` BIGINT, `children` Array<BIGINT>, `url` String, `score` INT, `title` String, `parts` Array<INT>, `descendants` INT, INDEX idx_comment (`comment`) USING INVERTED PROPERTIES("parser" = "english") COMMENT 'inverted index for comment' ) DUPLICATE KEY(`id`) DISTRIBUTED BY HASH(`id`) BUCKETS 10; -- 创建表的同时创建了comment的倒排索引idx_comment -- USING INVERTED 指定索引类型是倒排索引 -- PROPERTIES("parser" = "english") 指定采用english分词,还支持"chinese"中文分词和"unicode"中英文多语言混合分词,如果不指定"parser"参数表示不分词
导入数据
通过Stream Load导入数据。示例如下。
wget https://doris-build-1308700295.cos.ap-beijing.myqcloud.com/regression/index/hacknernews_1m.csv.gz curl --location-trusted -u root: -H "compress_type:gz" -T hacknernews_1m.csv.gz http://<IP地址>:<端口>/api/test_inverted_index/hackernews_1m/_stream_load { "TxnId": 2, "Label": "a8a3e802-2329-49e8-912b-04c800a461a6", "TwoPhaseCommit": "false", "Status": "Success", "Message": "OK", "NumberTotalRows": 1000000, "NumberLoadedRows": 1000000, "NumberFilteredRows": 0, "NumberUnselectedRows": 0, "LoadBytes": 130618406, "LoadTimeMs": 8988, "BeginTxnTimeMs": 23, "StreamLoadPutTimeMs": 113, "ReadDataTimeMs": 4788, "WriteDataTimeMs": 8811, "CommitAndPublishTimeMs": 38 }
执行SQL
count()
确认导入数据成功。示例如下。SELECT count() FROM hackernews_1m; +---------+ | count() | +---------+ | 1000000 | +---------+ 1 row in set (0.02 sec)
查询
全文检索
匹配comment中含有
OLAP
的行数。用LIKE匹配计算comment中含有
OLAP
的行数,耗时0.18s。SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%'; +---------+ | count() | +---------+ | 34 | +---------+ 1 row in set (0.18 sec)
用基于倒排索引的全文检索MATCH_ANY计算comment中含有
OLAP
的行数,耗时0.02s,加速9倍,在更大的数据集上效果会更加明显。说明数据结果条数的差异,是因为倒排索引对comment分词后,还会对词进行统一成小写等归一化处理,因此MATCH_ANY比LIKE的结果多一些。
SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP'; +---------+ | count() | +---------+ | 35 | +---------+ 1 row in set (0.02 sec)
匹配comment中含有
OLTP
的行数。用LIKE匹配计算comment中含有
OLTP
的行数,耗时0.07s。SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 48 | +---------+ 1 row in set (0.07 sec)
用基于倒排索引的全文检索MATCH_ANY计算comment中含有
OLTP
的行数,耗时0.01s,加速7倍,由于缓存的原因LIKE和MATCH_ANY都有提升,倒排索引仍然有7倍加速。SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLTP'; +---------+ | count() | +---------+ | 51 | +---------+ 1 row in set (0.01 sec)
匹配comment中含有
OLAP
和OLTP
两个词的行数。用LIKE匹配计算,耗时0.13s。
SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' AND comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 14 | +---------+ 1 row in set (0.13 sec)
用基于倒排索引的全文检索
MATCH_ALL
计算。使用MATCH_ALL 'keyword1 keyword2 ...'
,耗时0.01s。加速12倍。SELECT count() FROM hackernews_1m WHERE comment MATCH_ALL 'OLAP OLTP'; +---------+ | count() | +---------+ | 15 | +---------+ 1 row in set (0.01 sec)
普通等值、范围查询
DataTime类型的列范围查询。
SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; +---------+ | count() | +---------+ | 999081 | +---------+ 1 row in set (0.03 sec)
为timestamp列增加一个倒排索引。
CREATE INDEX idx_timestamp ON hackernews_1m(timestamp) USING INVERTED; Query OK, 0 rows affected (0.03 sec) -- 对于日期时间类型USING INVERTED,不用指定分词 -- CREATE INDEX 是第一种建索引的语法,另外一种在后面展示 BUILD INDEX idx_timestamp ON hackernews_1m; Query OK, 0 rows affected (0.01 sec)
查看索引创建进度,通过FinishTime和CreateTime的差值,可以看到100万条数据对timestamp列建倒排索引只用了1s。
SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | 10030 | hackernews_1m | 2023-02-10 19:44:12.929 | 2023-02-10 19:44:13.938 | hackernews_1m | 10031 | 10008 | 1:1994690496 | 3 | FINISHED | | NULL | 2592000 | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ 1 row in set (0.00 sec) -- 若table没有分区,PartitionName默认就是TableName SHOW BUILD INDEX; +-------+---------------+---------------+----------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | JobId | TableName | PartitionName | AlterInvertedIndexes | CreateTime | FinishTime | TransactionId | State | Msg | Progress | +-------+---------------+---------------+----------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | 10191 | hackernews_1m | hackernews_1m | [ADD INDEX idx_timestamp (`timestamp`) USING INVERTED], | 2023-06-26 15:32:33.894 | 2023-06-26 15:32:34.847 | 3 | FINISHED | | NULL | +-------+---------------+---------------+----------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ 1 row in set (0.04 sec)
索引创建后,范围查询用同样的查询方式,SelectDB会自动识别索引进行优化,但是这里由于数据量小性能差别不大。
SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; +---------+ | count() | +---------+ | 999081 | +---------+ 1 row in set (0.01 sec)
在数值类型的列parent进行类似timestamp的操作,这里查询使用等值匹配。
SELECT count() FROM hackernews_1m WHERE parent = 11189; +---------+ | count() | +---------+ | 2 | +---------+ 1 row in set (0.01 sec) -- 对于数值类型USING INVERTED,不用指定分词 -- ALTER TABLE t ADD INDEX 是第二种建索引的语法 ALTER TABLE hackernews_1m ADD INDEX idx_parent(parent) USING INVERTED; Query OK, 0 rows affected (0.01 sec) BUILD INDEX idx_parent ON hackernews_1m; Query OK, 0 rows affected (0.01 sec) SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | 10030 | hackernews_1m | 2023-02-10 19:44:12.929 | 2023-02-10 19:44:13.938 | hackernews_1m | 10031 | 10008 | 1:1994690496 | 3 | FINISHED | | NULL | 2592000 | | 10053 | hackernews_1m | 2023-02-10 19:49:32.893 | 2023-02-10 19:49:33.982 | hackernews_1m | 10054 | 10008 | 1:378856428 | 4 | FINISHED | | NULL | 2592000 | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ SHOW BUILD INDEX; +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | JobId | TableName | PartitionName | AlterInvertedIndexes | CreateTime | FinishTime | TransactionId | State | Msg | Progress | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | 11005 | hackernews_1m | hackernews_1m | [ADD INDEX idx_parent (`parent`) USING INVERTED], | 2023-06-26 16:25:10.167 | 2023-06-26 16:25:10.838 | 1002 | FINISHED | | NULL | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ 1 row in set (0.01 sec) SELECT count() FROM hackernews_1m WHERE parent = 11189; +---------+ | count() | +---------+ | 2 | +---------+ 1 row in set (0.01 sec)
对字符串类型的author建立不分词的倒排索引,等值查询也可以利用索引加速。
SELECT count() FROM hackernews_1m WHERE author = 'faster'; +---------+ | count() | +---------+ | 20 | +---------+ 1 row in set (0.03 sec) -- 这里只用了USING INVERTED,不对author分词,整个当做一个词处理 ALTER TABLE hackernews_1m ADD INDEX idx_author(author) USING INVERTED; Query OK, 0 rows affected (0.01 sec) BUILD INDEX idx_author ON hackernews_1m; Query OK, 0 rows affected (0.01 sec) -- 100万条author数据增量建索引仅消耗1.5s SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | 10030 | hackernews_1m | 2023-02-10 19:44:12.929 | 2023-02-10 19:44:13.938 | hackernews_1m | 10031 | 10008 | 1:1994690496 | 3 | FINISHED | | NULL | 2592000 | | 10053 | hackernews_1m | 2023-02-10 19:49:32.893 | 2023-02-10 19:49:33.982 | hackernews_1m | 10054 | 10008 | 1:378856428 | 4 | FINISHED | | NULL | 2592000 | | 10076 | hackernews_1m | 2023-02-10 19:54:20.046 | 2023-02-10 19:54:21.521 | hackernews_1m | 10077 | 10008 | 1:1335127701 | 5 | FINISHED | | NULL | 2592000 | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ SHOW BUILD INDEX order by CreateTime desc limit 1; +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | JobId | TableName | PartitionName | AlterInvertedIndexes | CreateTime | FinishTime | TransactionId | State | Msg | Progress | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ | 13006 | hackernews_1m | hackernews_1m | [ADD INDEX idx_author (`author`) USING INVERTED], | 2023-06-26 17:23:02.610 | 2023-06-26 17:23:03.755 | 3004 | FINISHED | | NULL | +-------+---------------+---------------+----------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+ 1 row in set (0.01 sec) -- 创建索引后,字符串等值匹配也有明显加速 SELECT count() FROM hackernews_1m WHERE author = 'faster'; +---------+ | count() | +---------+ | 20 | +---------+ 1 row in set (0.01 sec)
任意出现'OLAP'和'OLTP'其中一个词,0.12s vs 0.01s,12倍加速。
只要求多个词任意一个或多个出现时(sOR关系)使用
MATCH_ANY 'keyword1 keyword2 ...'
SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' OR comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 68 | +---------+ 1 row in set (0.12 sec) SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP OLTP'; +---------+ | count() | +---------+ | 71 | +---------+ 1 row in set (0.01 sec)
- 本页导读 (1)