pg_similarity
是PolarDB PostgreSQL版支持的一款第三方插件,用于相似性查询。
前提条件
支持的PolarDB PostgreSQL版的版本如下:
PostgreSQL 14(内核小版本14.10.18.0及以上)。
您可通过如下语句查看PolarDB PostgreSQL版的内核小版本的版本号:
select version();
概述
pg_similarity
插件用于相似性查询。除了传统的运算符(例如,=
和<>
)之外,还可以使用pg_similarity
定义的~~
和!
两个运算符(都表示一个相似度函数)进行查询。pg_similarity
拥有以下三个主要组件:
函数:一组实现文献中可用相似性算法的函数。这些函数可以用作UDF,并将作为实现相似性运算符的基础;
运算符:在相似性函数的顶部定义了一组运算符。它们使用相似性函数来获取相似性阈值,并将其值与用户定义的阈值进行比较,以决定是否为匹配项;
会话级变量:一组用于存储相似性函数参数的变量。这些变量可以在运行时定义。
函数与运算符
pg_similarity
插件支持了大部分知名的相似性算法,每个算法都适用于特定领域。提供的算法如下所示:
L1 Distance (as known as City Block or Manhattan Distance);
Cosine Distance;
Dice Coefficient;
Euclidean Distance;
Hamming Distance;
Jaccard Coefficient;
Jaro Distance;
Jaro-Winkler Distance;
Levenshtein Distance;
Matching Coefficient;
Monge-Elkan Coefficient;
Needleman-Wunsch Coefficient;
Overlap Coefficient;
Q-Gram Distance;
Smith-Waterman Coefficient;
Smith-Waterman-Gotoh Coefficient;
Soundex Distance.
算法 | 函数定义 | 操作符 | 是否使用索引 | 参数说明 |
L1 Distance | block(text, text) returns float8 | ~++ | yes | pg_similarity.block_tokenizer (enum) pg_similarity.block_threshold (float8) pg_similarity.block_is_normalized (bool) |
Cosine Distance | cosine(text, text) returns float8 | ~## | yes | pg_similarity.cosine_tokenizer (enum) pg_similarity.cosine_threshold (float8) pg_similarity.cosine_is_normalized (bool) |
Dice Coefficient | dice(text, text) returns float8 | ~-~ | yes | pg_similarity.dice_tokenizer (enum) pg_similarity.dice_threshold (float8) pg_similarity.dice_is_normalized (bool) |
Euclidean Distance | euclidean(text, text) returns float8 | ~!! | yes | pg_similarity.euclidean_tokenizer (enum) pg_similarity.euclidean_threshold (float8) pg_similarity.euclidean_is_normalized (bool) |
Hamming Distance | hamming(bit varying, bit varying) returns float8 hamming_text(text, text) returns float8 | ~@~ | no | pg_similarity.hamming_threshold (float8) pg_similarity.hamming_is_normalized (bool) |
Jaccard Coefficient | jaccard(text, text) returns float8 | ~?? | yes | pg_similarity.jaccard_tokenizer (enum) pg_similarity.jaccard_threshold (float8) pg_similarity.jaccard_is_normalized (bool) |
Jaro Distance | jaro(text, text) returns float8 | ~%% | no | pg_similarity.jaro_threshold (float8) pg_similarity.jaro_is_normalized (bool) |
Jaro-Winkler Distance | jarowinkler(text, text) returns float8 | ~@@ | no | pg_similarity.jarowinkler_threshold (float8) pg_similarity.jarowinkler_is_normalized (bool) |
Levenshtein Distance | lev(text, text) returns float8 | ~== | no | pg_similarity.levenshtein_threshold (float8) pg_similarity.levenshtein_is_normalized (bool) |
Matching Coefficient | matchingcoefficient(text, text) returns float8 | ~^^ | yes | pg_similarity.matching_tokenizer (enum) pg_similarity.matching_threshold (float8) pg_similarity.matching_is_normalized (bool) |
Monge-Elkan Coefficient | mongeelkan(text, text) returns float8 | ~|| | no | pg_similarity.mongeelkan_tokenizer (enum) pg_similarity.mongeelkan_threshold (float8) pg_similarity.mongeelkan_is_normalized (bool) |
Needleman-Wunsch Coefficient | needlemanwunsch(text, text) returns float8 | ~#~ | no | pg_similarity.nw_threshold (float8) pg_similarity.nw_is_normalized (bool) |
Overlap Coefficient | overlapcoefficient(text, text) returns float8 | ~** | yes | pg_similarity.overlap_tokenizer (enum) pg_similarity.overlap_threshold (float8) pg_similarity.overlap_is_normalized (bool) |
Q-Gram Distance | qgram(text, text) returns float8 | ~~~ | yes | pg_similarity.qgram_threshold (float8) pg_similarity.qgram_is_normalized (bool) |
Smith-Waterman Coefficient | smithwaterman(text, text) returns float8 | ~=~ | no | pg_similarity.sw_threshold (float8) pg_similarity.sw_is_normalized (bool) |
Smith-Waterman-Gotoh Coefficient | smithwatermangotoh(text, text) returns float8 | ~!~ | no | pg_similarity.swg_threshold (float8) pg_similarity.swg_is_normalized (bool) |
Soundex Distance | soundex(text, text) returns float8 | ~*~ | no | - |
一些参数控制着pg_similarity
函数和运算符的行为。这些参数可以被归类为三类:分词器(tokenizer)、阈值(threshold)和归一化(normalized)。
分词器:控制字符串如何被分词。取值为:
alnum
、gram
、word
和camelcase
。默认值为alnum
。所有分词都是小写(该选项可以在编译时设置,参见源代码中的PGS_IGNORE_CASE
)。alnum
:分隔符是任何非字母数字字符。这意味着,只有标准C语言环境中的字母字符和数字(0-9)会被接受为分词。例如,字符串 "Euler_Taveira_de_Oliveira 22/02/2011" 被分词为 "Euler"、"Taveira"、"de"、"Oliveira"、"22"、"02"、"2011"。gram
:一个n-gram是长度为n的子序列。通过使用滑动窗口技术,即通过一个字符滑动一个长度为n的窗口来从字符串中提取n-gram。例如,字符串 "euler taveira"(使用n=3 )被分词为 "eul"、"ule"、"ler"、"er "、"r t"、" ta"、"tav"、"ave"、"vei"、"eir" 和 "ira"。有些会将 " e"、" eu"、"ra " 和 "a " 添加到分词集中,即为完整n-grams(该选项可以在编译时设置,参见源代码中的PGS_FULL_NGRAM
)。word
:分隔符是空白字符(空格、换页符、换行符、回车符、水平制表符和垂直制表符)。例如,字符串 "Euler Taveira de Oliveira 22/02/2011" 被分词为 "Euler"、"Taveira"、"de"、"Oliveira" 和 "22/02/2011"。camelcase
:分隔符是大写字符,但它们也包含作为第一个分词字符。例如,字符串 "EulerTaveira de Oliveira" 被分词为 "Euler"、"Taveira de " 和 "Oliveira"。
阈值:控制比较字符串被判定为匹配的临界值。对于每对字符串,如果计算出的值(使用相应的相似性函数)大于或等于阈值,则认为匹配。取值范围为:0.0~1.0。默认值为0.7。
归一化:控制相似性系数/距离是否被归一化(如果相似性系数被归一化,则系数的范围会在[0,1]之间)。归一化值会被运算符自动用来匹配字符串,也就是说,该参数只有在使用相似性函数时才有意义。默认值是
true
。
使用方法
创建插件
CREATE EXTENSION pg_similarity;
在运行时设置参数
SHOW pg_similarity.levenshtein_threshold;
pg_similarity.levenshtein_threshold
-------------------------------------
0.7
(1 row)
SET pg_similarity.levenshtein_threshold TO 0.5;
SHOW pg_similarity.levenshtein_threshold;
pg_similarity.levenshtein_threshold
-------------------------------------
0.5
(1 row)
SET pg_similarity.cosine_tokenizer TO camelcase;
SET pg_similarity.euclidean_is_normalized TO false;
创建测试表
CREATE TABLE foo (a TEXT);
INSERT INTO foo VALUES
('Euler'),
('Oiler'),
('Euler Taveira de Oliveira'),
('Maria Taveira dos Santos'),
('Carlos Santos Silva');
CREATE TABLE bar (b TEXT);
INSERT INTO bar VALUES
('Euler T. de Oliveira'),
('Euller'),
('Oliveira, Euler Taveira'),
('Sr. Oliveira');
使用jaro函数
SELECT a, b, jaro(a, b) FROM foo, bar;
a | b | jaro
---------------------------+-------------------------+--------------------
Euler | Euler T. de Oliveira | 0.75
Euler | Euller | 0.9444444444444444
Euler | Oliveira, Euler Taveira | 0.6057971014492753
Euler | Sr. Oliveira | 0.5055555555555555
Oiler | Euler T. de Oliveira | 0.4722222222222222
Oiler | Euller | 0.7
Oiler | Oliveira, Euler Taveira | 0.672463768115942
Oiler | Sr. Oliveira | 0.6722222222222223
Euler Taveira de Oliveira | Euler T. de Oliveira | 0.7980701754385964
Euler Taveira de Oliveira | Euller | 0.6777777777777777
Euler Taveira de Oliveira | Oliveira, Euler Taveira | 0.7731884057971014
Euler Taveira de Oliveira | Sr. Oliveira | 0.5922222222222222
Maria Taveira dos Santos | Euler T. de Oliveira | 0.6023504273504273
Maria Taveira dos Santos | Euller | 0.3055555555555556
Maria Taveira dos Santos | Oliveira, Euler Taveira | 0.5350241545893719
Maria Taveira dos Santos | Sr. Oliveira | 0.6342592592592593
Carlos Santos Silva | Euler T. de Oliveira | 0.5421052631578946
Carlos Santos Silva | Euller | 0.3128654970760234
Carlos Santos Silva | Oliveira, Euler Taveira | 0.6066615814899567
Carlos Santos Silva | Sr. Oliveira | 0.5077276524644945
(20 rows)
levenshtein运算符
SHOW pg_similarity.levenshtein_threshold;
pg_similarity.levenshtein_threshold
-------------------------------------
0.5
(1 row)
SELECT a, b, lev(a,b) FROM foo, bar WHERE a ~== b;
a | b | lev
---------------------------+----------------------+--------------------
Euler | Euller | 0.8333333333333334
Oiler | Euller | 0.5
Euler Taveira de Oliveira | Euler T. de Oliveira | 0.76
(3 rows)
SET pg_similarity.levenshtein_threshold = 0.7;
SELECT a, b, lev(a,b) FROM foo, bar WHERE a ~== b;
a | b | lev
---------------------------+----------------------+--------------------
Euler | Euller | 0.8333333333333334
Euler Taveira de Oliveira | Euler T. de Oliveira | 0.76
(2 rows)
运算符效果比较
SELECT * FROM bar WHERE b ~@@ 'euler'; -- jaro-winkler operator
b
----------------------
Euler T. de Oliveira
Euller
(2 rows)
SELECT * FROM bar WHERE b ~~~ 'euler'; -- qgram operator
b
---
(0 rows)
SELECT * FROM bar WHERE b ~== 'euler'; -- levenshtein operator
b
--------
Euller
(1 row)
SELECT * FROM bar WHERE b ~## 'euler'; -- cosine operator
b
---
(0 rows)