文档

pg_similarity

更新时间:

pg_similarityPolarDB PostgreSQL版支持的一款第三方插件,用于相似性查询。

前提条件

支持的PolarDB PostgreSQL版的版本如下:

PostgreSQL 14(内核小版本14.10.18.0及以上)。

说明

您可通过如下语句查看PolarDB PostgreSQL版的内核小版本的版本号:

select version();

概述

pg_similarity插件用于相似性查询。除了传统的运算符(例如,=<>)之外,还可以使用pg_similarity定义的~~!两个运算符(都表示一个相似度函数)进行查询。pg_similarity拥有以下三个主要组件:

  • 函数:一组实现文献中可用相似性算法的函数。这些函数可以用作UDF,并将作为实现相似性运算符的基础;

  • 运算符:在相似性函数的顶部定义了一组运算符。它们使用相似性函数来获取相似性阈值,并将其值与用户定义的阈值进行比较,以决定是否为匹配项;

  • 会话级变量:一组用于存储相似性函数参数的变量。这些变量可以在运行时定义。

函数与运算符

pg_similarity插件支持了大部分知名的相似性算法,每个算法都适用于特定领域。提供的算法如下所示:

  • L1 Distance (as known as City Block or Manhattan Distance);

  • Cosine Distance;

  • Dice Coefficient;

  • Euclidean Distance;

  • Hamming Distance;

  • Jaccard Coefficient;

  • Jaro Distance;

  • Jaro-Winkler Distance;

  • Levenshtein Distance;

  • Matching Coefficient;

  • Monge-Elkan Coefficient;

  • Needleman-Wunsch Coefficient;

  • Overlap Coefficient;

  • Q-Gram Distance;

  • Smith-Waterman Coefficient;

  • Smith-Waterman-Gotoh Coefficient;

  • Soundex Distance.

算法

函数定义

操作符

是否使用索引

参数说明

L1 Distance

block(text, text) returns float8

~++

yes

pg_similarity.block_tokenizer (enum)

pg_similarity.block_threshold (float8)

pg_similarity.block_is_normalized (bool)

Cosine Distance

cosine(text, text) returns float8

~##

yes

pg_similarity.cosine_tokenizer (enum)

pg_similarity.cosine_threshold (float8)

pg_similarity.cosine_is_normalized (bool)

Dice Coefficient

dice(text, text) returns float8

~-~

yes

pg_similarity.dice_tokenizer (enum)

pg_similarity.dice_threshold (float8)

pg_similarity.dice_is_normalized (bool)

Euclidean Distance

euclidean(text, text) returns float8

~!!

yes

pg_similarity.euclidean_tokenizer (enum)

pg_similarity.euclidean_threshold (float8)

pg_similarity.euclidean_is_normalized (bool)

Hamming Distance

hamming(bit varying, bit varying) returns float8

hamming_text(text, text) returns float8

~@~

no

pg_similarity.hamming_threshold (float8)

pg_similarity.hamming_is_normalized (bool)

Jaccard Coefficient

jaccard(text, text) returns float8

~??

yes

pg_similarity.jaccard_tokenizer (enum)

pg_similarity.jaccard_threshold (float8)

pg_similarity.jaccard_is_normalized (bool)

Jaro Distance

jaro(text, text) returns float8

~%%

no

pg_similarity.jaro_threshold (float8)

pg_similarity.jaro_is_normalized (bool)

Jaro-Winkler Distance

jarowinkler(text, text) returns float8

~@@

no

pg_similarity.jarowinkler_threshold (float8)

pg_similarity.jarowinkler_is_normalized (bool)

Levenshtein Distance

lev(text, text) returns float8

~==

no

pg_similarity.levenshtein_threshold (float8)

pg_similarity.levenshtein_is_normalized (bool)

Matching Coefficient

matchingcoefficient(text, text) returns float8

~^^

yes

pg_similarity.matching_tokenizer (enum)

pg_similarity.matching_threshold (float8)

pg_similarity.matching_is_normalized (bool)

Monge-Elkan Coefficient

mongeelkan(text, text) returns float8

~||

no

pg_similarity.mongeelkan_tokenizer (enum)

pg_similarity.mongeelkan_threshold (float8)

pg_similarity.mongeelkan_is_normalized (bool)

Needleman-Wunsch Coefficient

needlemanwunsch(text, text) returns float8

~#~

no

pg_similarity.nw_threshold (float8)

pg_similarity.nw_is_normalized (bool)

Overlap Coefficient

overlapcoefficient(text, text) returns float8

~**

yes

pg_similarity.overlap_tokenizer (enum)

pg_similarity.overlap_threshold (float8)

pg_similarity.overlap_is_normalized (bool)

Q-Gram Distance

qgram(text, text) returns float8

~~~

yes

pg_similarity.qgram_threshold (float8)

pg_similarity.qgram_is_normalized (bool)

Smith-Waterman Coefficient

smithwaterman(text, text) returns float8

~=~

no

pg_similarity.sw_threshold (float8)

pg_similarity.sw_is_normalized (bool)

Smith-Waterman-Gotoh Coefficient

smithwatermangotoh(text, text) returns float8

~!~

no

pg_similarity.swg_threshold (float8)

pg_similarity.swg_is_normalized (bool)

Soundex Distance

soundex(text, text) returns float8

~*~

no

-

一些参数控制着pg_similarity函数和运算符的行为。这些参数可以被归类为三类:分词器(tokenizer)、阈值(threshold)和归一化(normalized)。

  • 分词器:控制字符串如何被分词。取值为:alnumgramwordcamelcase。默认值为alnum。所有分词都是小写(该选项可以在编译时设置,参见源代码中的PGS_IGNORE_CASE)。

    • alnum:分隔符是任何非字母数字字符。这意味着,只有标准C语言环境中的字母字符和数字(0-9)会被接受为分词。例如,字符串 "Euler_Taveira_de_Oliveira 22/02/2011" 被分词为 "Euler"、"Taveira"、"de"、"Oliveira"、"22"、"02"、"2011"。

    • gram:一个n-gram是长度为n的子序列。通过使用滑动窗口技术,即通过一个字符滑动一个长度为n的窗口来从字符串中提取n-gram。例如,字符串 "euler taveira"(使用n=3 )被分词为 "eul"、"ule"、"ler"、"er "、"r t"、" ta"、"tav"、"ave"、"vei"、"eir" 和 "ira"。有些会将 " e"、" eu"、"ra " 和 "a " 添加到分词集中,即为完整n-grams(该选项可以在编译时设置,参见源代码中的PGS_FULL_NGRAM)。

    • word:分隔符是空白字符(空格、换页符、换行符、回车符、水平制表符和垂直制表符)。例如,字符串 "Euler Taveira de Oliveira 22/02/2011" 被分词为 "Euler"、"Taveira"、"de"、"Oliveira" 和 "22/02/2011"。

    • camelcase:分隔符是大写字符,但它们也包含作为第一个分词字符。例如,字符串 "EulerTaveira de Oliveira" 被分词为 "Euler"、"Taveira de " 和 "Oliveira"。

  • 阈值:控制比较字符串被判定为匹配的临界值。对于每对字符串,如果计算出的值(使用相应的相似性函数)大于或等于阈值,则认为匹配。取值范围为:0.0~1.0。默认值为0.7。

  • 归一化:控制相似性系数/距离是否被归一化(如果相似性系数被归一化,则系数的范围会在[0,1]之间)。归一化值会被运算符自动用来匹配字符串,也就是说,该参数只有在使用相似性函数时才有意义。默认值是true

使用方法

创建插件

CREATE EXTENSION pg_similarity;

在运行时设置参数

SHOW pg_similarity.levenshtein_threshold;
 pg_similarity.levenshtein_threshold
-------------------------------------
 0.7
(1 row)

SET pg_similarity.levenshtein_threshold TO 0.5;
SHOW pg_similarity.levenshtein_threshold;
 pg_similarity.levenshtein_threshold
-------------------------------------
 0.5
(1 row)
SET pg_similarity.cosine_tokenizer TO camelcase;
SET pg_similarity.euclidean_is_normalized TO false;

创建测试表

CREATE TABLE foo (a TEXT);
INSERT INTO foo VALUES
  ('Euler'),
  ('Oiler'),
  ('Euler Taveira de Oliveira'),
  ('Maria Taveira dos Santos'),
  ('Carlos Santos Silva');
CREATE TABLE bar (b TEXT);
INSERT INTO bar VALUES
  ('Euler T. de Oliveira'),
  ('Euller'),
  ('Oliveira, Euler Taveira'),
  ('Sr. Oliveira');

使用jaro函数

SELECT a, b, jaro(a, b) FROM foo, bar;
             a             |            b            |        jaro
---------------------------+-------------------------+--------------------
 Euler                     | Euler T. de Oliveira    |               0.75
 Euler                     | Euller                  | 0.9444444444444444
 Euler                     | Oliveira, Euler Taveira | 0.6057971014492753
 Euler                     | Sr. Oliveira            | 0.5055555555555555
 Oiler                     | Euler T. de Oliveira    | 0.4722222222222222
 Oiler                     | Euller                  |                0.7
 Oiler                     | Oliveira, Euler Taveira |  0.672463768115942
 Oiler                     | Sr. Oliveira            | 0.6722222222222223
 Euler Taveira de Oliveira | Euler T. de Oliveira    | 0.7980701754385964
 Euler Taveira de Oliveira | Euller                  | 0.6777777777777777
 Euler Taveira de Oliveira | Oliveira, Euler Taveira | 0.7731884057971014
 Euler Taveira de Oliveira | Sr. Oliveira            | 0.5922222222222222
 Maria Taveira dos Santos  | Euler T. de Oliveira    | 0.6023504273504273
 Maria Taveira dos Santos  | Euller                  | 0.3055555555555556
 Maria Taveira dos Santos  | Oliveira, Euler Taveira | 0.5350241545893719
 Maria Taveira dos Santos  | Sr. Oliveira            | 0.6342592592592593
 Carlos Santos Silva       | Euler T. de Oliveira    | 0.5421052631578946
 Carlos Santos Silva       | Euller                  | 0.3128654970760234
 Carlos Santos Silva       | Oliveira, Euler Taveira | 0.6066615814899567
 Carlos Santos Silva       | Sr. Oliveira            | 0.5077276524644945
(20 rows)

levenshtein运算符

SHOW pg_similarity.levenshtein_threshold;
 pg_similarity.levenshtein_threshold
-------------------------------------
 0.5
(1 row)

SELECT a, b, lev(a,b) FROM foo, bar WHERE a ~== b;
             a             |          b           |        lev
---------------------------+----------------------+--------------------
 Euler                     | Euller               | 0.8333333333333334
 Oiler                     | Euller               |                0.5
 Euler Taveira de Oliveira | Euler T. de Oliveira |               0.76
(3 rows)
SET pg_similarity.levenshtein_threshold = 0.7;
SELECT a, b, lev(a,b) FROM foo, bar WHERE a ~== b;
             a             |          b           |        lev
---------------------------+----------------------+--------------------
 Euler                     | Euller               | 0.8333333333333334
 Euler Taveira de Oliveira | Euler T. de Oliveira |               0.76
(2 rows)

运算符效果比较

SELECT * FROM bar WHERE b ~@@ 'euler'; -- jaro-winkler operator
          b
----------------------
 Euler T. de Oliveira
 Euller
(2 rows)
SELECT * FROM bar WHERE b ~~~ 'euler'; -- qgram operator
 b
---
(0 rows)
SELECT * FROM bar WHERE b ~== 'euler'; -- levenshtein operator
   b
--------
 Euller
(1 row)
SELECT * FROM bar WHERE b ~## 'euler'; -- cosine operator
 b
---
(0 rows)