AnalyticDB PostgreSQL7.0版支持了列组(Column Group)统计信息功能,您可以指定收集多个目标列的联合统计信息,在代价估算时打破独立分布的假设,提高代价估算的准确性,进而提升查询性能。

注意事项

列组统计信息功能目前仅支持Planner优化器(原Legacy优化器),如何设置优化器,请参见参数配置

语法

CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
    [ ( statistics_kind [, ... ] ) ]
    ON column_name, column_name [, ...]
    FROM table_name

参数介绍

参数 说明
statistics_name 列组统计信息的名称。
statistics_kind 需要收集的统计信息种类,目前支持以下三种:
  • ndistinct:多列组合值中不同值(DISTINCT)的数量。
  • dependencies:多列之间的Functional Dependency,可以理解为多列之间的相关性。
  • mcv:多列组合值中出现最多的值及其频率。
不指定该参数的话,以上三种类型的统计信息都会采集。
column_name 列组统计信息包含的列的名称,至少需要指定两个列。
table_name 上述列所属的表名。

示例

  1. 创建表ndistinct,并插入数据。
    CREATE TABLE ndistinct(a int, b int, c int, d int) DISTRIBUTED BY (d);
    INSERT INTO ndistinct(a, b) SELECT i/100, i/200, i/100, i FROM generate_series(1, 10000000) i;

    其中a、b列的值有强相关性,不符合独立分布的假设。

  2. 创建统计估算行数和实际行数的函数。
    CREATE FUNCTION check_estimated_rows(text) returns table (estimated int, actual int)
    LANGUAGE plpgsql AS
    $$
    DECLARE
        ln text;
        tmp text[];
        first_row bool := true;
    BEGIN
        FOR ln in
            execute format('explain analyze %s', $1)
        LOOP
            IF first_row then
                first_row := false;
                tmp := regexp_match(ln, 'rows=(\d*) .* rows=(\d*)');
                return query select tmp[1]::int, tmp[2]::int;
            END IF;
        END LOOP;
    END;
    $$;
  3. 查看估算行数和实际行数。
    SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM ndistinct GROUP BY a, b');

    返回信息如下:

     estimated | actual
    -----------+--------
       1000000 | 100001
    (1 row)

    本次示例中估算行数为1000000行,实际行数为100001行,差距较大。

  4. 查看执行计划。
    EXPLAIN ANALYZE SELECT count(*) FROM ndistinct GROUP BY a,b,c;

    执行计划返回信息如下,用于和创建列组统计信息后的执行计划进行对比:

                                                                            QUERY PLAN
    -------------------------------------------------------------------------------------------------------------------------------------------------------------
     Gather Motion 3:1  (slice1; segments: 3)  (cost=137814.33..154481.00 rows=1000000 width=20) (actual time=3986.143..4014.601 rows=100001 loops=1)
       ->  HashAggregate  (cost=137814.33..141147.67 rows=333333 width=20) (actual time=3977.037..4002.222 rows=33661 loops=1)
             Group Key: a, b, c
             Peak Memory Usage: 0 kB
             ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=0.00..104481.00 rows=3333333 width=12) (actual time=0.079..1492.983 rows=3366100 loops=1)
                   Hash Key: a, b, c
                   ->  Seq Scan on ndistinct  (cost=0.00..37814.33 rows=3333333 width=12) (actual time=0.050..1101.632 rows=3334839 loops=1)
     Planning Time: 0.161 ms
       (slice0)    Executor memory: 12336K bytes.
       (slice1)    Executor memory: 13884K bytes avg x 3 workers, 13899K bytes max (seg0).  Work_mem: 16401K bytes max.
       (slice2)    Executor memory: 37K bytes avg x 3 workers, 37K bytes max (seg0).
     Memory used:  128000kB
     Optimizer: Postgres query optimizer
     Execution Time: 4041.613 ms
    (14 rows)
  5. 创建列组统计信息,并执行ANALYZE收集统计信息。
    CREATE STATISTICS s_a_b(ndistinct) ON a,b FROM ndistinct;
    ANALYZE ndistinct;
  6. 再次查看估算行数和实际行数。
    SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM ndistinct GROUP BY a, b');

    返回信息如下:

     estimated | actual
    -----------+--------
         99431 | 100001
    (1 row)

    本次示例中估算行数为99431行,实际行数为100001行,相比较创建列组统计信息前,估算的行数更准确。

  7. 查看执行计划。
    EXPLAIN ANALYZE SELECT count(*) FROM ndistinct GROUP BY a,b,c;

    执行计划返回信息如下,从执行计划可以看出,因为估算更准确,所以执行计划中生成了Partial HashAggregate,进而缩短了查询时间:

                                                                               QUERY PLAN
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------
     Gather Motion 3:1  (slice1; segments: 3)  (cost=75124.91..76782.09 rows=99431 width=20) (actual time=2854.765..2879.734 rows=100001 loops=1)
       ->  Finalize HashAggregate  (cost=75124.91..75456.34 rows=33144 width=20) (actual time=2853.610..2868.194 rows=33661 loops=1)
             Group Key: a, b, c
             Peak Memory Usage: 0 kB
             ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=71147.67..74130.60 rows=99431 width=20) (actual time=2269.435..2759.413 rows=100983 loops=1)
                   Hash Key: a, b, c
                   ->  Partial HashAggregate  (cost=71147.67..72141.98 rows=99431 width=20) (actual time=2744.039..2794.808 rows=100001 loops=1)
                         Group Key: a, b, c
                         Peak Memory Usage: 0 kB
                         ->  Seq Scan on ndistinct  (cost=0.00..37814.33 rows=3333333 width=12) (actual time=0.028..454.030 rows=3334839 loops=1)
     Planning Time: 0.173 ms
       (slice0)    Executor memory: 4670K bytes.
       (slice1)    Executor memory: 3134K bytes avg x 3 workers, 3149K bytes max (seg0).  Work_mem: 5649K bytes max.
       (slice2)    Executor memory: 13848K bytes avg x 3 workers, 13848K bytes max (seg0).  Work_mem: 14353K bytes max.
     Memory used:  128000kB
     Optimizer: Postgres query optimizer
     Execution Time: 2893.470 ms
    (17 rows)