使用全文倒排索引实现全文检索-实时数仓 Hologres-阿里云

Hologres自V4.0版本起支持全文倒排索引，该功能基于高性能全文检索引擎Tantivy构建，提供高性能检索能力，同时支持BM25相似性评分算法，提供文档排序、关键词检索、短语检索等多方面能力。

实现原理

将检索源文本写入Hologres后，Hologres会根据索引配置，对每个数据文件，构建出全文倒排索引文件。该过程首先通过分词器对文本进行分词，得到若干分词token，然后通过索引记录每个token与检索源文本的映射关系、位置和词频等相关信息。

在对文本进行检索时，先对检索对象文本进行分词，得到对象token集合。通过BM25算法计算检索源的每个文本与检索对象token集合的相关性分数，实现高性能、高精度的全文检索。

注意事项

仅支持Hologres V4.0及以上版本的列存表、行列共存表使用全文倒排索引功能，不支持行存表。
仅支持对TEXT/CHAR/VARCHAR类型的列创建全文倒排索引。
仅支持对单列构建全文倒排索引。每一列仅支持构建一个全文倒排索引。如果有多列需要构建，则需要创建多个索引。
创建全文倒排索引后，索引文件将随Compaction过程异步构建。在索引文件构建完成前，数据的BM25相关性分数为0。
全文检索仅支持在已创建全文索引的列上执行，不支持在未构建索引的情况下进行暴力计算。
建议使用Serverless Computing资源执行数据的批量导入，Serverless资源将在数据导入时同步完成Compaction及全文索引构建，详情请参见使用Serverless Computing执行读写任务、使用Serverless Computing执行Compaction任务。如不使用Serverless资源，建议在批量导入数据或修改索引后，手动执行如下命令触发Compaction。
```
VACUUM <schema_name>.<table_name>;
```
BM25检索算法计算的相关性分数为文件级别，如果您的数据导入量较小，建议按需手动触发Compaction，以完成文件合并，提升检索准确率。
支持使用Serverless Computing资源执行全文检索的查询。

管理索引

创建索引

语法格式

CREATE INDEX [ IF NOT EXISTS ] <idx_name> ON <table_name>
       USING FULLTEXT (<column_name>)
       [ WITH ( <storage_parameter> = '<storage_value>' [ , ... ] ) ];

参数说明

参数	描述
idx_name	索引名。
table_name	目标表名。
column_name	构建全文倒排索引的目标列名。
storage_parameter	配置全文倒排索引参数，有如下两类参数： tokenizer：分词器名称。 analyzer_params：分词器配置，仅支持JSON格式字符串。每个分词器均有默认的analyzer_params配置，通常情况下，建议使用默认配置，即仅需指定tokenizer参数，无需显式配置analyzer_params参数。说明同一个索引中，仅支持设置一种tokenizer和analyzer_params。
storage_value	全文倒排索引参数值。当storage_parameter为tokenizer时，storage_value取值如下： jieba（默认）：结合规则匹配与统计模型的中文分词器。 whitespace：空格分词器。按空格分词。 standard：标准分词器。基于Unicode Standard Annex #29分词。 simple：简单分词器。按空格和标点符号分词。 keyword：关键词分词器。不进行任何操作，保持原样。是一种NO-OP分词器，适用于term查询场景。 icu：多语言文本处理分词器。当storage_parameter为analyzer_params时，支持自定义其中部分配置，详情请参见高级操作：自定义分词器配置。

使用示例

构建全文倒排索引，分词器及配置均为默认，即jieba分词器。
```
CREATE INDEX idx1 ON tbl 
       USING FULLTEXT (col1);
```

显式指定使用standard分词器，分词器使用默认配置。

CREATE INDEX idx1 ON tbl 
       USING FULLTEXT (col1)
       WITH (tokenizer = 'standard');

修改索引

语法格式

-- 修改索引配置
ALTER INDEX [ IF EXISTS ] <idx_name> SET ( <storage_parameter> = '<storage_value>' [ , ... ] );

-- 恢复默认配置
ALTER INDEX [ IF EXISTS ] <idx_name> RESET ( <storage_parameter> [ , ... ] );

参数说明

参数详细说明请参见参数说明。

使用示例

说明

修改全文倒排索引后，索引文件将随数据的Compaction过程异步构建。建议在修改索引后，手动执行VACUUM <schema_name>.<table_name>;命令，同步触发Compaction，详情请参见Compaction。

将索引的分词器改为standard。

ALTER INDEX idx1 SET (tokenizer = 'standard');

恢复默认分词器jieba，且使用jieba分词器的默认analyzer_params配置。

ALTER INDEX idx1 RESET (tokenizer);
ALTER INDEX idx1 RESET (tokenizer, analyzer_params);

恢复到当前分词器的默认analyzer_params配置。
```
ALTER INDEX idx1 RESET (analyzer_params);
```

删除索引

语法格式

DROP INDEX [ IF EXISTS ] <idx_name> [ RESTRICT ];

参数说明

参数详细说明请参见参数说明。

查看索引

Hologres提供hologres.hg_index_properties系统表，可查看表中已创建的全文倒排索引及对应位置。

SELECT * FROM hologres.hg_index_properties;

执行如下SQL，可查看索引对应的表和列。

SELECT 
    t.relname AS table_name, 
    a.attname AS column_name
FROM pg_class t
    JOIN pg_index i ON t.oid = i.indrelid
    JOIN pg_class idx ON i.indexrelid = idx.oid
    JOIN pg_attribute a ON a.attrelid = t.oid AND a.attnum = ANY(i.indkey)
WHERE t.relnamespace = (SELECT oid FROM pg_namespace WHERE nspname = '<namespace>')
    AND idx.relname = '<indexname>'
LIMIT 1;

参数说明：

namespace：为SELECT * FROM hologres.hg_index_properties;命令执行结果中的table_namespace字段值。
indexname：为实际创建的索引名称。

使用索引进行全文检索

Hologres支持丰富的检索模式，便于您按照业务逻辑，灵活进行全文检索。

检索模式	说明
关键词匹配	按检索对象的分词结果关键词进行检索，支持定义关键词间的AND/OR关系。
短语检索	按检索对象的短语进行检索，需满足多个词之间的距离要求才可匹配。
自然语言检索	支持自由定义复杂查询条件，灵活实现检索目标，如定义AND/OR关联关系、定义必须出现词/排除词、定义短语等。
术语检索	按检索对象精确检索，需要索引中精确包含查询串才可匹配。

检索函数TEXT_SEARCH

检索函数TEXT_SEARCH可基于检索对象文本，对检索源文本计算BM相关性分数。

函数语法

TEXT_SEARCH (
  <search_data> TEXT/VARCHAR/CHAR
  ,<search_expression> TEXT
  [ ,<mode> TEXT DEFAULT 'match'
  ,<operator> TEXT DEFAULT 'OR'
  ,<tokenizer> TEXT DEFAULT ''
  ,<analyzer_params> TEXT DEFAULT ''
  ,<options> TEXT DEFAULT '']
)

参数说明

参数	是否必填	描述
search_data	是	检索源。数据类型支持TEXT/VARCHAR/CHAR，仅支持列入参，且列上需已构建全文索引，否则会报错。
search_expression	是	检索对象。数据类型支持TEXT/VARCHAR/CHAR，仅支持常量。
mode	否	检索模式。支持的模式如下： match（默认）：关键词匹配。每个分词token为一个关键词，多个关键词之间的关系通过operator参数设置，默认为OR。 phrase：短语检索。短语中多个词之间的距离通过在options参数中指定slop进行配置，默认为0，即短语中各词必须紧邻。 natural_language：自然语言检索。支持使用自然语言表达复杂的查询条件，如AND/OR关键词、必须出现词、必须排除词、短语等。详情请参见Tantivy。 term：术语检索。对search_expression不做分词或其他处理，直接去索引中精确匹配。
operator	否	关键词之间的逻辑运算符。仅mode为match时生效。支持如下取值： OR（默认）：检索对象有多个分词token时，任意token匹配即可返回。 AND：检索对象有多个分词token时，所有token均匹配才可返回。
tokenizer、analyzer_params	否	对检索对象search_expression使用的分词器及配置，一般无需配置。不指定时，默认使用与检索源search_data列上全文倒排索引相同的分词器及配置。如检索源为常量，则使用默认分词器jieba。当指定时，检索对象search_expression将使用指定的分词器及配置进行分词。
options	否	全文检索的其他参数。入参格式为`'key1=v1;key2=v2;....;keyN=vN;'`。目前仅支持slop参数，仅mode为phrase时生效，支持slop为0（默认）或正整数，用于定义短语中各个词之间可容忍的距离。说明 slop表示短语组成词之间的最大允许间隔（或转换开销），对于jieba/keyword/icu等tokenizer来说，间隔的单位是字符数而不是tokens单词数。对于standard/simple/whitespace等tokenizer，间隔的单位是tokens单词数。

返回值说明

返回非负的FLOAT类型，表示检索源与检索对象的BM25相关性分数。相关性越高，分数越大。当文本完全不相关时，分数为0。

示例

使用关键词匹配模式，并将运算符改为AND。

-- 建议指定参数名
SELECT TEXT_SEARCH (content, 'machine learning', operator => 'AND') FROM tbl;

-- 不指定参数名，需按入参顺序显式指定
SELECT TEXT_SEARCH (content, 'machine learning', 'match', 'AND') FROM tbl;

使用短语检索模式，并将slop设为2。

SELECT TEXT_SEARCH (content, 'machine learning', 'phrase', options => 'slop=2;') FROM tbl;

使用自然语言检索模式。

-- 通过AND、OR运算符定义分词检索逻辑
SELECT TEXT_SEARCH (content, 'machine AND (system OR recognition)', 'natural_language') FROM tbl;

 -- 通过+（必须出现词）、-（必须排除词）定义分词检索逻辑
SELECT TEXT_SEARCH (content, '+learning -machine system', 'natural_language') FROM tbl;

分词函数TOKENIZE

分词函数TOKENIZE可按照分词器配置，输出分词结果，以便对全文倒排索引的分词效果进行调试。

函数语法

TOKENIZE (
  <search_data> TEXT
  [ ,<tokenizer> TEXT DEFAULT ''
  ,<analyzer_params> TEXT DEFAULT '']
)

参数说明

search_data：必填。分词目标文本，支持常量入参。
tokenizer、analyzer_params：选填。对分词目标文本search_data使用的分词器及配置。默认使用jieba分词器。

返回值说明

返回目标文本的分词token集合，类型为TEXT数组。

索引使用验证

可通过执行计划查看SQL是否使用了全文倒排索引，若其中出现Fulltext Filter，表示已成功使用。执行计划详情请参见EXPLAIN和EXPLAIN ANALYZE。

示例SQL：

EXPLAIN ANALYZE SELECT * FROM wiki_articles WHERE text_search(content, '长江') > 0;

执行计划如下：其中包含Fulltext Filter字段，表示该SQL已成功使用全文倒排索引。

QUERY PLAN
Gather  (cost=0.00..1.00 rows=1 width=12)
  ->  Local Gather  (cost=0.00..1.00 rows=1 width=12)
        ->  Index Scan using Clustering_index on wiki_articles  (cost=0.00..1.00 rows=1 width=12)
              Fulltext Filter: (text_search(content, search_expression => '长江'::text, mode => match, operator => OR, tokenizer => jieba, analyzer_params => {"filter":["removepunct","lowercase",{"stop_words":["_english_"],"type":"stop"},{"language":"english","type":"stemmer"}],"tokenizer":{"hmm":true,"mode":"search","type":"jieba"}}, options => ) > '0'::double precision)
Query Queue: init_warehouse.default_queue
Optimizer: HQO version 4.0.0

使用示例

数据准备

执行以下SQL创建测试表、并写入数据。

-- 创建表
CREATE TABLE wiki_articles (id int, content text);

-- 创建索引
CREATE INDEX ft_idx_1 ON wiki_articles
       USING FULLTEXT (content)
       WITH (tokenizer = 'jieba');

-- 写入数据
INSERT INTO wiki_articles VALUES
  (1, '长江是中国第一大河，世界第三长河，全长约6,300公里。'),
  (2, 'Li was born in 1962 in Wendeng County, Shandong.'),
  (3, 'He graduated from the department of physics at Shandong University.'),
  (4, '春节，即农历新年，是中国最重要的传统节日。'),
  (5, '春节通常在公历1月下旬至2月中旬之间。春节期间的主要习俗包括贴春联、放鞭炮、吃年夜饭、拜年等。'),
  (6, '2006年，春节被国务院批准为第一批国家级非物质文化遗产。'),
  (7, 'Shandong has dozens of universities.'),
  (8, 'ShanDa is a famous university of Shandong.');

-- Compaction
VACUUM wiki_articles;

-- 查询表数据
SELECT * FROM wiki_articles limit 1;

返回结果示例如下：

id |                       content                       
---+---------------------------------------------------
 1 | 长江是中国第一大河，世界第三长河，全长约6,300公里。

不同检索示例

关键词匹配。

-- (K1) 关键词匹配（默认operator=OR），包含'shandong'或'university'的文档均能匹配。
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0;

--返回结果
 id |                               content                               
----+---------------------------------------------------------------------
  2 | Li was born in 1962 in Wendeng County, Shandong.
  3 | He graduated from the department of physics at Shandong University.
  7 | Shandong has dozens of universities.
  8 | ShanDa is a famous university of Shandong.

-- (K2) 关键词匹配（operator=AND），必须同时包含'shandong'和'university'才能匹配
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', operator => 'AND') > 0;

-- 返回结果
 id |                               content                               
----+---------------------------------------------------------------------
  3 | He graduated from the department of physics at Shandong University.
  7 | Shandong has dozens of universities.
  8 | ShanDa is a famous university of Shandong.

短语检索。

-- (P1) 短语检索（默认slop = 0），即必须shandong后紧接university才匹配
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase') > 0;

--返回结果
 id |                               content                               
----+---------------------------------------------------------------------
  3 | He graduated from the department of physics at Shandong University.
  
 -- (P2) 短语检索指定slop = 14，即shandong与university之间的距离不超过14个字符，那么可以匹配“Shandong has dozens of universities.”
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase', options => 'slop=14;') > 0;
 
-- 返回结果
 id |                               content                               
----+---------------------------------------------------------------------
  3 | He graduated from the department of physics at Shandong University.
  7 | Shandong has dozens of universities.

-- (P3) 短语检索支持检索不保序的短语，但slop计算方式不一样，要比保序的slop要大。
-- 因此'university of Shandong'也能匹配以下查询，但slop=22时不会匹配到
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase', options => 'slop=23;') > 0;

-- 返回结果
 id |                               content                               
----+---------------------------------------------------------------------
  3 | He graduated from the department of physics at Shandong University.
  7 | Shandong has dozens of universities.
  8 | ShanDa is a famous university of Shandong.

-- (P4) 标点将被忽略。（jieba分词器为例）
-- 即使文本中，长河和全长之间是逗号，而查询串是句号。
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, '长河。全长', mode => 'phrase') > 0;

-- 返回结果
 id |                       content                       
----+-----------------------------------------------------
  1 | 长江是中国第一大河，世界第三长河，全长约6,300公里。

自然语言查询。

-- (N1) 自然语言查询：不加任何符号，默认等同于关键词匹配。与(K1)等价。
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, 'shandong university', 'natural_language') > 0;
 id |                               content                               
----+---------------------------------------------------------------------
  7 | Shandong has dozens of universities.
  2 | Li was born in 1962 in Wendeng County, Shandong.
  3 | He graduated from the department of physics at Shandong University.
  8 | ShanDa is a famous university of Shandong.

-- (N2) 自然语言查询：关键词匹配，必须(同时包含'shandong'和'university')或者包含'文化'才匹配。AND运算符优先级大于OR。
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, '(shandong AND university) OR 文化', 'natural_language') > 0;
-- 等价于
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, 'shandong AND university OR 文化', 'natural_language') > 0;
-- 等价于
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, '(+shandong +university) 文化', 'natural_language') > 0;

-- 返回结果
 id |                               content                               
----+---------------------------------------------------------------------
  8 | ShanDa is a famous university of Shandong.
  7 | Shandong has dozens of universities.
  3 | He graduated from the department of physics at Shandong University.
  6 | 2006年，春节被国务院批准为第一批国家级非物质文化遗产。

-- (N3) 自然语言查询：关键词匹配，必须包含'shandong'，必须不包含'university'，可以包含'文化'。
--      在这个查询中'文化'关键词前没有+, -符号，不会影响哪些行会匹配上，但会影响匹配分数，带有'文化'的匹配分数更高。
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, '+shandong -university 文化', 'natural_language') > 0;
 id |                     content                      
----+--------------------------------------------------
  2 | Li was born in 1962 in Wendeng County, Shandong.

-- 必须包含'shandong'，必须不包含'physics'，可以包含'famous'。包含famous的相关性分数更高。
-- 注：此Query为单Shard下的分数计算结果，不同的Shard数，不同的文件组织，计算出来的BM25分数可能不一样。
SELECT id,
       content,
       TEXT_SEARCH(content, '+shandong -physics famous', 'natural_language') as score
FROM wiki_articles
WHERE TEXT_SEARCH(content, '+shandong -physics famous', 'natural_language') > 0
ORDER BY score DESC;

-- 返回结果
 id |                     content                      |  score   
----+--------------------------------------------------+----------
  8 | ShanDa is a famous university of Shandong.       |  2.92376
  7 | Shandong has dozens of universities.             | 0.863399
  2 | Li was born in 1962 in Wendeng County, Shandong. | 0.716338

-- (N4) 自然语言查询：短语检索，与(P1)等价，短语需用双引号""包裹，如果中间有"，则需使用\转义。
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, '"shandong university"', 'natural_language') > 0;
        
-- 返回结果
 id |                               content                               
----+---------------------------------------------------------------------
  3 | He graduated from the department of physics at Shandong University.

-- (N5) 自然语言查询：短语检索，与(P2)等价，支持以~语法设置slop
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, '"shandong university"~23', 'natural_language') > 0;
        
-- 返回结果
 id |                               content                               
----+---------------------------------------------------------------------
  8 | ShanDa is a famous university of Shandong.
  7 | Shandong has dozens of universities.
  3 | He graduated from the department of physics at Shandong University.

-- (N6) 自然语言查询：匹配所有文档
SELECT * FROM wiki_articles                                                                                  
        WHERE TEXT_SEARCH(content, '*', 'natural_language') > 0;
        
-- 返回结果
 id |                                           content                                            
----+----------------------------------------------------------------------------------------------
  1 | 长江是中国第一大河，世界第三长河，全长约6,300公里。
  2 | Li was born in 1962 in Wendeng County, Shandong.
  3 | He graduated from the department of physics at Shandong University.
  4 | 春节，即农历新年，是中国最重要的传统节日。
  5 | 春节通常在公历1月下旬至2月中旬之间。春节期间的主要习俗包括贴春联、放鞭炮、吃年夜饭、拜年等。
  6 | 2006年，春节被国务院批准为第一批国家级非物质文化遗产。
  7 | Shandong has dozens of universities.
  8 | ShanDa is a famous university of Shandong.

复杂查询示例

与PK联合查询。

-- 检索源中包含shandong或university，且id=3的文本。
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0 and id = 3;

-- 返回结果
 id |                               content                               
----+---------------------------------------------------------------------
  3 | He graduated from the department of physics at Shandong University.
  

-- 检索源中包含shandong或university，或id<2的文本。
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0 OR id < 2;

-- 返回结果
 id |                               content                               
----+---------------------------------------------------------------------
  2 | Li was born in 1962 in Wendeng County, Shandong.
  8 | ShanDa is a famous university of Shandong.
  1 | 长江是中国第一大河，世界第三长河，全长约6,300公里。
  3 | He graduated from the department of physics at Shandong University.
  7 | Shandong has dozens of universities.

查出分数，并取TOP3。

SELECT id,
       content,
       TEXT_SEARCH(content, 'shandong university') AS score,
       TOKENIZE(content, 'jieba')
  FROM wiki_articles
ORDER BY score DESC
LIMIT 3;

-- 返回结果
id  |                               content                               |  score  |                     tokenize                     
----+---------------------------------------------------------------------+---------+--------------------------------------------------
  8 | ShanDa is a famous university of Shandong.                          | 2.74634 | {shanda,famous,univers,shandong}
  7 | Shandong has dozens of universities.                                | 2.74634 | {shandong,has,dozen,univers}
  3 | He graduated from the department of physics at Shandong University. | 2.38178 | {he,graduat,from,depart,physic,shandong,univers}

同时在output和where中使用TEXT_SEARCH函数。

SELECT id,
       content,
       TEXT_SEARCH(content, 'shandong university') AS score,
       TOKENIZE(content, 'jieba')
  FROM wiki_articles
 WHERE TEXT_SEARCH(content, 'shandong university') > 0
ORDER BY score DESC;

-- 返回结果
id  |                               content                               |  score  |                     tokenize                     
----+---------------------------------------------------------------------+---------+--------------------------------------------------
  7 | Shandong has dozens of universities.                                | 2.74634 | {shandong,has,dozen,univers}
  8 | ShanDa is a famous university of Shandong.                          | 2.74634 | {shanda,famous,univers,shandong}
  3 | He graduated from the department of physics at Shandong University. | 2.38178 | {he,graduat,from,depart,physic,shandong,univers}
  2 | Li was born in 1962 in Wendeng County, Shandong.                    | 1.09244 | {li,born,1962,wendeng,counti,shandong}

检索wiki来源中，和shandong university最相关的文档。

-- 来源表，用于JOIN。
CREATE TABLE article_source (id int primary key, source text);
INSERT INTO article_source VALUES (1, 'baike'), (2, 'wiki'), (3, 'wiki'), (4, 'baike'),
                                  (5, 'baike'), (6, 'baike'), (7, 'wiki'), (8, 'paper'),
                                  (9, 'http_log'), (10, 'http_log'), (11, 'http_log');
                                  
SELECT a.id,
       source, content,
       TEXT_SEARCH(content, 'shandong university') AS score,
       TOKENIZE(a.content, 'jieba')
  FROM wiki_articles a
  JOIN article_source b
    ON (a.id = b.id)
 WHERE TEXT_SEARCH(a.content, 'shandong university') > 0
   AND b.source = 'wiki'
ORDER BY score DESC;

-- 返回结果
id  | source |                               content                               |  score  |                     tokenize                     
----+--------+---------------------------------------------------------------------+---------+--------------------------------------------------
  7 | wiki   | Shandong has dozens of universities.                                | 2.74634 | {shandong,has,dozen,univers}
  3 | wiki   | He graduated from the department of physics at Shandong University. | 2.38178 | {he,graduat,from,depart,physic,shandong,univers}
  2 | wiki   | Li was born in 1962 in Wendeng County, Shandong.                    | 1.09244 | {li,born,1962,wendeng,counti,shandong}

不同分词器示例

使用默认jieba分词器，默认为search模式，会多分词型提高搜索效果。

SELECT TOKENIZE('他来到北京清华大学', 'jieba');

-- 返回结果
                tokenize                
--------------------------------------
{他,来到,北京,清华,华大,大学,清华大学}

使用自定义的exact模式的jieba分词器，不会多分出词型。

SELECT TOKENIZE('他来到北京清华大学', 'jieba', '{"tokenizer": {"type": "jieba", "mode": "exact"}}');

-- 返回结果
        tokenize         
-----------------------
{他,来到,北京,清华大学}

分词器对比。

SELECT TOKENIZE('他来到北京清华大学', 'jieba') as jieba,
       TOKENIZE('他来到北京清华大学', 'keyword') as keyword,
       TOKENIZE('他来到北京清华大学', 'whitespace') as whitespace,
       TOKENIZE('他来到北京清华大学', 'simple') as simple,
       TOKENIZE('他来到北京清华大学', 'standard') as standard,
       TOKENIZE('他来到北京清华大学', 'icu') as icu;
       
-- 返回结果
-[ RECORD 1 ]--------------------------------------
jieba      | {他,来到,北京,清华,华大,大学,清华大学}
keyword    | {他来到北京清华大学}
whitespace | {他来到北京清华大学}
simple     | {他来到北京清华大学}
standard   | {他,来,到,北,京,清,华,大,学}
icu        | {他,来到,北京,清华大学}

对http_logs的分词效果对比。

SELECT TOKENIZE('211.11.9.0 - - [1998-06-21T15:00:01-05:00] \"GET /english/index.html HTTP/1.0\" 304 0', 'jieba') as jieba,
       TOKENIZE('211.11.9.0 - - [1998-06-21T15:00:01-05:00] \"GET /english/index.html HTTP/1.0\" 304 0', 'keyword') as keyword,
       TOKENIZE('211.11.9.0 - - [1998-06-21T15:00:01-05:00] \"GET /english/index.html HTTP/1.0\" 304 0', 'whitespace') as whitespace,
       TOKENIZE('211.11.9.0 - - [1998-06-21T15:00:01-05:00] \"GET /english/index.html HTTP/1.0\" 304 0', 'simple') as simple,
       TOKENIZE('211.11.9.0 - - [1998-06-21T15:00:01-05:00] \"GET /english/index.html HTTP/1.0\" 304 0', 'standard') as standard,
       TOKENIZE('211.11.9.0 - - [1998-06-21T15:00:01-05:00] \"GET /english/index.html HTTP/1.0\" 304 0', 'icu') as icu;
       
-- 返回结果
-[ RECORD 1 ]-----------------------------------------------------------------------------------------------
jieba      | {211.11,9.0,1998-06,21t15,00,01-05,00,get,english,index,html,http,1.0,304,0}
keyword    | {"211.11.9.0 - - [1998-06-21T15:00:01-05:00] \\\"GET /english/index.html HTTP/1.0\\\" 304 0"}
whitespace | {211.11.9.0,-,-,[1998-06-21T15:00:01-05:00],"\\\"GET",/english/index.html,"HTTP/1.0\\\"",304,0}
simple     | {211,11,9,0,1998,06,21t15,00,01,05,00,get,english,index,html,http,1,0,304,0}
standard   | {211.11.9.0,1998,06,21t15,00,01,05,00,get,english,index.html,http,1.0,304,0}
icu        | {211.11.9.0,1998,06,21t15,00,01,05,00,get,english,index.html,http,1.0,304,0}

高级操作：自定义分词器配置

Hologres建议使用分词器的默认配置，但在全文倒排索引的实际使用过程中，可能出现分词器默认配置不满足业务需求的情况。您可自定义分词器配置，以满足业务更灵活的分词需要。

analyzer_params配置要求

分词器配置参数analyzer_params的配置要求如下：

仅支持JSON格式字符串。

JSON顶层支持tokenizer、filter两个键，取值如下：

参数

描述

tokenizer

必填。值为JSON对象，用于配置分词器属性。JSON对象中支持如下键：

type：必填，分词器名称。
mode：选填，定义分词模式。仅jieba分词器支持，取值如下：
- search（默认）：分词时列举多种可能的组合，允许冗余。如“传统节日”的分词结果为“传统”、“节日”和“传统节日”共3个token。
- exact：分词时不冗余切分。如“传统节日”的分词结果仅为“传统节日”1个token。
hmm：选填，定义是否使用隐马尔可夫模型来识别词典中没有的词，以提高新词识别能力。仅jieba分词器支持，取值如下：
- true（默认）：使用。
- false：不使用。

filter

选填。值为JSON数组，用于配置分词过滤属性。如需配置多个分词过滤属性，将严格遵循配置顺序作用到每一个分词token上。

analyzer_params默认配置

不同分词器对应的analyzer_params默认配置如下：

分词器名称	analyzer_params默认配置	分词示例
jieba（默认分词器）	`{ "tokenizer": { "type": "jieba", "mode": "search", "hmm": true }, "filter": [ "removepunct", "lowercase", {"type": "stop", "stop_words": ["_english_"]}, {"type": "stemmer", "language": "english"} ] }`	`春节，即农历新年，是中国最重要的传统节日传统节日。`
whitespace	`{ "tokenizer": { "type": "whitespace" } }`	`春节，即农历新年，是中国最重要的传统节日。`
keyword	`{ "tokenizer": { "type": "keyword" } }`	`春节，即农历新年，是中国最重要的传统节日。`
simple	`{ "tokenizer": { "type": "simple" }, "filter": [ "lowercase" ] }`	`春节即农历新年是中国最重要的传统节日`
standard	`{ "tokenizer": { "type": "standard", "max_token_length": 255 }, "filter": [ "lowercase" ] }`	`春节即农历新年是中国最重要的传统节日`
icu	`{ "tokenizer": { "type": "icu" }, "filter": [ "removepunct", "lowercase" ] }`	`春节，即农历新年，是中国最重要的传统节日。`

analyzer_params中的filter配置

Hologres支持在analyzer_params中配置如下filter（分词过滤属性）。

说明

如果配置了多个分词过滤属性，将严格遵循配置顺序作用到每一个分词token上。

属性名称	属性说明	参数格式	使用示例
lowercase	将token中的大写字母转为小写。	仅需声明lowercase。 `"lowercase"`	分词过滤属性定义 `"filter": ["lowercase"]` 分词过滤结果 `["Hello", "WORLD"]`->`["hello", "world"]`
stop	移除停用词token。	`stop_words`: 停用词列表，必须为只包含字符串的列表。支持用户自定义停用词，也支持如下针对部分语言的内置停用词词典： `"_english_" "_danish_" "_dutch_" "_finnish_" "_french_" "_german_" "_hungarian_" "_italian_" "_norwegian_" "_portuguese_" "_russian_" "_spanish_" "_swedish_"`	分词过滤属性定义 `"filter": [{ "type": "stop", "stop_words": ["_english_", "cat"] }]` 分词过滤结果 `["the", "cat", "is", "on", "a", "mat"]`->`["mat"]` 说明 "cat"是用户自定义停用词，"the"、"is"、"on"、"a"是内置"_english_"中包含的停用词。
stemmer	根据对应语言的语法规则，将token转化为其对应的词干。	`language`：语言，支持如下内置语言。 `"arabic", "danish", "dutch", "english", "finnish", "french", "german", "greek", "hungarian", "italian", "norwegian", "portuguese", "romanian", "russian", "spanish", "swedish", "tamil", "turkish"`	分词过滤属性定义 `"filter": [{ "type": "stemmer", "language": "english" }]` 分词过滤结果 `["machine", "learning"]`->`["machin", "learn"]`
length	移除超过指定长度的token。	`max`：保留的最大长度，必须为正整数。 `{"type": "length", "max": 10}`	分词过滤属性定义 `"filter": [{"type": "length", "max": 10}]` 分词过滤结果 `["AI", "for", "Artificial", "Intelligence"]`->`["AI", "for", "Artificial"]`
removepunct	移除只包含标点符号字符的token。	仅需声明removepunct。 `"removepunct"`	分词过滤属性定义 `"filter": ["removepunct"]` 分词过滤结果 `["中文", "english", "中文。", "english.", "124", "124!=8", "。", "、", "，，", " ..."]`->`["中文", "english", "中文。", "english.", "124", "124!=8"]`