GIN索引

本文介绍PolarDB通用倒排索引GIN(Generalized Inverted Index)。

GIN是一种存储(Key和Posting list)对集合的索引结构,其中Key是一个键值,Posting list代表一组包含该Key的位置。例如,在'hello', '14:2 23:4'中,表示字符串hello在位置14:223:4中出现过,这些位置实际上对应于元组的tid(行号,包括数据块ID,大小为32 bit,以及item point,大小为16 bit)。通过这种索引结构,可以快速查找到包含指定关键字的元组,因此GIN索引特别适用于多值类型元素的搜索。

操作符

操作符

示例

<@

select * from test where id <@ array[1,2];

@>

select * from test where id @> array[1,2];

=

select * from test where id = array[1,2];

&&

select * from test where id && array[1,2];

您也可以通过btree_gin插件,支持btree相关的操作符类。

索引结构

GIN索引结构图

组成结构

说明

Entry

GIN索引中的一个元素。

Entry tree

在Entry上构建的B树。

Posting list

Entry物理位置的链表。

Posting tree

Posting list构建的B树。

Pending list

索引元组的临时存储链表,用于fastupdate插入。

应用场景

搜索多值类型

  1. 创建表及插入数据。

    CREATE TABLE t_gin1 (id int, arr int[]);
    do language plpgsql $$
      declare
      begin
        for i in 1..10000 loop
        insert into t_gin1 select i, array(select random()*1000 from generate_series(1,10));
      end loop;
    end;
    $$;
  2. 查询数据。

    select * from t_gin1 limit 3;
     id | arr
      1 | {819,751,515,192,909,255,683,39,258,423}
      2 | {844,43,834,710,447,344,480,749,814,18}
      3 | {339,69,28,428,859,510,560,465,906,788}
  3. 创建索引。

    CREATE INDEX idx_t_gin1_1 ON t_gin1 USING gin (arr);
  4. 执行计划。

    EXPLAIN (analyze, verbose, timing, costs, buffers) SELECT * FROM t_gin1 WHERE arr && array[1,2];
    Bitmap Heap Scan on public.t_gin1  (cost=7.08..126.44 rows=204 width=65) (actual time=0.042..0.161 rows=204 loops=1)
      Output: id, arr
      Recheck Cond: (t_gin1.arr && '{1,2}'::integer[])
      Heap Blocks: exact=106
      Buffers: shared hit=111 (main=111 vm=0 fsm=0)
      ->  Bitmap Index Scan on idx_t_gin1_1  (cost=0.00..7.03 rows=204 width=0) (actual time=0.026..0.026 rows=204 loops=1)
         Index Cond: (t_gin1.arr && '{1,2}'::integer[])
         Buffers: shared hit=5 (main=5 vm=0 fsm=0)
    Query Identifier: -6536619930866726302
    Planning:
      Buffers: shared hit=44 (main=44 vm=0 fsm=0) dirtied=2 (main=0 vm=0 fsm=0)
    Planning Time: 0.160 ms
    Execution Time: 0.196 ms

按照任意列进行搜索

  1. 安装插件。

    create extension btree_gin;
  2. 创建表及插入数据。

    CREATE TABLE t_gin2 (id int, c1 int);
    INSERT INTO t_gin2 SELECT generate_series(1,100000), random()*10 ;
  3. 创建索引。

    CREATE INDEX idx_t_gin2_1 on t_gin2 using gin (c1);
  4. 执行计划。

    EXPLAIN (analyze,verbose,timing,costs,buffers) SELECT * FROM t_gin2 WHERE c1=1;
    Bitmap Heap Scan on public.t_gin2  (cost=81.66..745.49 rows=9827 width=8) (actual time=0.743..2.775 rows=9964 loops=1)
      Output: id, c1
      Recheck Cond: (t_gin2.c1 = 1)
      Heap Blocks: exact=541
      Buffers: shared hit=546 (main=546 vm=0 fsm=0)
      ->  Bitmap Index Scan on idx_t_gin2_1  (cost=0.00..79.20 rows=9827 width=0) (actual time=0.691..0.692 rows=9964 loops=1)
         Index Cond: (t_gin2.c1 = 1)
         Buffers: shared hit=5 (main=5 vm=0 fsm=0)
    Query Identifier: -7119324855706379201
    Planning:
      Buffers: shared hit=43 (main=43 vm=0 fsm=0) dirtied=1 (main=0 vm=0 fsm=0)
    Planning Time: 0.136 ms
    Execution Time: 3.490 ms

查找的数据比较稀疏

  1. 创建表及插入数据。

    CREATE TABLE t_gin3 (id int, c1 int, c2 int, c3 int, c4 int, c5 int, c6 int, c7 int, c8 int, c9 int);
    INSERT INTO t_gin3 SELECT generate_series(1,100000), random()*10, random()*20, random()*30, random()*40, random()*50, random()*60, random()*70, random()*80, random()*90;
  2. 创建索引。

    CREATE INDEX idx_t_gin3_1 ON t_gin3 USING gin (c1,c2,c3, c4, c5, c6,c7,c8,c9);
  3. 执行计划。

    EXPLAIN (analyze,verbose,timing,costs, buffers) SELECT * FROm t_gin3 WHERE c1=1 OR c2=1 AND c3=1 OR c4=1 AND (c6=1 OR c7=2) OR c8=9 OR c9=10;
    Bitmap Heap Scan on public.t_gin3  (cost=141.08..1445.89 rows=12051 width=40) (actual time=1.539..4.541 rows=12240 loops=1)
      Output: id, c1, c2, c3, c4, c5, c6, c7, c8, c9
      Recheck Cond: ((t_gin3.c1 = 1) OR ((t_gin3.c2 = 1) AND (t_gin3.c3 = 1)) OR (((t_gin3.c4 = 1) AND (t_gin3.c6 = 1)) OR ((t_gin3.c4 = 1) AND (t_gin3.c7 = 2))) OR (t_gin3.c8 = 9) OR (t_gin3.c9 = 10))
      Heap Blocks: exact=935
      Buffers: shared hit=968 (main=968 vm=0 fsm=0)
      ->  BitmapOr  (cost=141.08..141.08 rows=12327 width=0) (actual time=1.441..1.445 rows=0 loops=1)
         Buffers: shared hit=33 (main=33 vm=0 fsm=0)
         ->  Bitmap Index Scan on idx_t_gin3_1  (cost=0.00..79.12 rows=9670 width=0) (actual time=0.816..0.817 rows=9844 loops=1)
               Index Cond: (t_gin3.c1 = 1)
               Buffers: shared hit=6 (main=6 vm=0 fsm=0)
         ->  Bitmap Index Scan on idx_t_gin3_1  (cost=0.00..7.09 rows=159 width=0) (actual time=0.262..0.262 rows=178 loops=1)
               Index Cond: ((t_gin3.c2 = 1) AND (t_gin3.c3 = 1))
               Buffers: shared hit=8 (main=8 vm=0 fsm=0)
         ->  BitmapOr  (cost=17.85..17.85 rows=82 width=0) (actual time=0.222..0.224 rows=0 loops=1)
               Buffers: shared hit=13 (main=13 vm=0 fsm=0)
               ->  Bitmap Index Scan on idx_t_gin3_1  (cost=0.00..5.94 rows=44 width=0) (actual time=0.119..0.119 rows=49 loops=1)
                     Index Cond: ((t_gin3.c4 = 1) AND (t_gin3.c6 = 1))
                     Buffers: shared hit=6 (main=6 vm=0 fsm=0)
               ->  Bitmap Index Scan on idx_t_gin3_1  (cost=0.00..5.88 rows=38 width=0) (actual time=0.103..0.103 rows=38 loops=1)
                     Index Cond: ((t_gin3.c4 = 1) AND (t_gin3.c7 = 2))
                     Buffers: shared hit=7 (main=7 vm=0 fsm=0)
         ->  Bitmap Index Scan on idx_t_gin3_1  (cost=0.00..12.93 rows=1283 width=0) (actual time=0.072..0.072 rows=1289 loops=1)
               Index Cond: (t_gin3.c8 = 9)
               Buffers: shared hit=3 (main=3 vm=0 fsm=0)
         ->  Bitmap Index Scan on idx_t_gin3_1  (cost=0.00..11.80 rows=1133 width=0) (actual time=0.063..0.063 rows=1143 loops=1)
               Index Cond: (t_gin3.c9 = 10)
               Buffers: shared hit=3 (main=3 vm=0 fsm=0)
    Query Identifier: -7228854819162599043
    Planning:
      Buffers: shared hit=111 (main=111 vm=0 fsm=0) dirtied=4 (main=0 vm=0 fsm=0)
    Planning Time: 0.220 ms
    Execution Time: 5.459 ms