AnalyticDB PostgreSQL版7.0版支持了列组(Column Group)统计信息功能,您可以指定收集多个目标列的联合统计信息,在代价估算时打破独立分布的假设,提高代价估算的准确性,进而提升查询性能。
注意事项
列组统计信息功能目前仅支持Planner优化器(原Legacy优化器),如何设置优化器,请参见参数配置。
语法
CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
[ ( statistics_kind [, ... ] ) ]
ON column_name, column_name [, ...]
FROM table_name
参数介绍
参数 |
说明 |
statistics_name |
列组统计信息的名称。 |
statistics_kind |
需要收集的统计信息种类,目前支持以下三种:
- ndistinct:多列组合值中不同值(DISTINCT)的数量。
- dependencies:多列之间的Functional Dependency,可以理解为多列之间的相关性。
- mcv:多列组合值中出现最多的值及其频率。
不指定该参数的话,以上三种类型的统计信息都会采集。
|
column_name |
列组统计信息包含的列的名称,至少需要指定两个列。 |
table_name |
上述列所属的表名。 |
示例
- 创建表ndistinct,并插入数据。
CREATE TABLE ndistinct(a int, b int, c int, d int) DISTRIBUTED BY (d);
INSERT INTO ndistinct(a, b) SELECT i/100, i/200, i/100, i FROM generate_series(1, 10000000) i;
其中a、b列的值有强相关性,不符合独立分布的假设。
- 创建统计估算行数和实际行数的函数。
CREATE FUNCTION check_estimated_rows(text) returns table (estimated int, actual int)
LANGUAGE plpgsql AS
$$
DECLARE
ln text;
tmp text[];
first_row bool := true;
BEGIN
FOR ln in
execute format('explain analyze %s', $1)
LOOP
IF first_row then
first_row := false;
tmp := regexp_match(ln, 'rows=(\d*) .* rows=(\d*)');
return query select tmp[1]::int, tmp[2]::int;
END IF;
END LOOP;
END;
$$;
- 查看估算行数和实际行数。
SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM ndistinct GROUP BY a, b');
返回信息如下:
estimated | actual
-----------+--------
1000000 | 100001
(1 row)
本次示例中估算行数为1000000行,实际行数为100001行,差距较大。
- 查看执行计划。
EXPLAIN ANALYZE SELECT count(*) FROM ndistinct GROUP BY a,b,c;
执行计划返回信息如下,用于和创建列组统计信息后的执行计划进行对比:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3) (cost=137814.33..154481.00 rows=1000000 width=20) (actual time=3986.143..4014.601 rows=100001 loops=1)
-> HashAggregate (cost=137814.33..141147.67 rows=333333 width=20) (actual time=3977.037..4002.222 rows=33661 loops=1)
Group Key: a, b, c
Peak Memory Usage: 0 kB
-> Redistribute Motion 3:3 (slice2; segments: 3) (cost=0.00..104481.00 rows=3333333 width=12) (actual time=0.079..1492.983 rows=3366100 loops=1)
Hash Key: a, b, c
-> Seq Scan on ndistinct (cost=0.00..37814.33 rows=3333333 width=12) (actual time=0.050..1101.632 rows=3334839 loops=1)
Planning Time: 0.161 ms
(slice0) Executor memory: 12336K bytes.
(slice1) Executor memory: 13884K bytes avg x 3 workers, 13899K bytes max (seg0). Work_mem: 16401K bytes max.
(slice2) Executor memory: 37K bytes avg x 3 workers, 37K bytes max (seg0).
Memory used: 128000kB
Optimizer: Postgres query optimizer
Execution Time: 4041.613 ms
(14 rows)
- 创建列组统计信息,并执行ANALYZE收集统计信息。
CREATE STATISTICS s_a_b(ndistinct) ON a,b FROM ndistinct;
ANALYZE ndistinct;
- 再次查看估算行数和实际行数。
SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM ndistinct GROUP BY a, b');
返回信息如下:
estimated | actual
-----------+--------
99431 | 100001
(1 row)
本次示例中估算行数为99431行,实际行数为100001行,相比较创建列组统计信息前,估算的行数更准确。
- 查看执行计划。
EXPLAIN ANALYZE SELECT count(*) FROM ndistinct GROUP BY a,b,c;
执行计划返回信息如下,从执行计划可以看出,因为估算更准确,所以执行计划中生成了Partial HashAggregate,进而缩短了查询时间:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3) (cost=75124.91..76782.09 rows=99431 width=20) (actual time=2854.765..2879.734 rows=100001 loops=1)
-> Finalize HashAggregate (cost=75124.91..75456.34 rows=33144 width=20) (actual time=2853.610..2868.194 rows=33661 loops=1)
Group Key: a, b, c
Peak Memory Usage: 0 kB
-> Redistribute Motion 3:3 (slice2; segments: 3) (cost=71147.67..74130.60 rows=99431 width=20) (actual time=2269.435..2759.413 rows=100983 loops=1)
Hash Key: a, b, c
-> Partial HashAggregate (cost=71147.67..72141.98 rows=99431 width=20) (actual time=2744.039..2794.808 rows=100001 loops=1)
Group Key: a, b, c
Peak Memory Usage: 0 kB
-> Seq Scan on ndistinct (cost=0.00..37814.33 rows=3333333 width=12) (actual time=0.028..454.030 rows=3334839 loops=1)
Planning Time: 0.173 ms
(slice0) Executor memory: 4670K bytes.
(slice1) Executor memory: 3134K bytes avg x 3 workers, 3149K bytes max (seg0). Work_mem: 5649K bytes max.
(slice2) Executor memory: 13848K bytes avg x 3 workers, 13848K bytes max (seg0). Work_mem: 14353K bytes max.
Memory used: 128000kB
Optimizer: Postgres query optimizer
Execution Time: 2893.470 ms
(17 rows)