Column group statistics

更新时间:
复制 MD 格式

Column group statistics let AnalyticDB for PostgreSQL V7.0 collect statistics across multiple columns together, rather than treating each column in isolation. This removes the independent-distribution assumption the query planner makes by default, so the planner produces more accurate row count estimates and chooses better execution plans.

Usage notes

Column group statistics work only with the planner optimizer (formerly known as legacy optimizer). To configure the optimizer, see Configure parameters.

Background

By default, the query planner assumes that column values are statistically independent. When a query filters or groups on multiple correlated columns—for example, a and b where a is derived from b—the planner multiplies individual selectivity estimates together, which can produce a large gap between the estimated and actual row count. The misestimate causes the planner to pick suboptimal join orders and aggregation strategies.

Column group statistics address this by teaching the planner about the actual correlation between columns. Create them on column combinations that appear together in WHERE filters or GROUP BY clauses and where you observe a large gap between estimated and actual row counts in EXPLAIN ANALYZE output.

Create column group statistics only on column combinations you actually use for grouping or filtering, and only when EXPLAIN ANALYZE shows a significant gap between estimated and actual row counts. Collecting statistics on every possible column combination increases planning time and storage without benefit.

Syntax

CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
    [ ( statistics_kind [, ... ] ) ]
    ON column_name, column_name [, ...]
    FROM table_name

Parameters

ParameterDescription
statistics_nameName of the column group statistics object.
statistics_kindType of statistics to collect. Specify one or more values separated by commas. If omitted, all three types are collected. See Statistics kinds.
column_nameColumn to include. Specify at least two columns.
table_nameTable that contains the columns.

Statistics kinds

KindWhat it tracksUse when
ndistinctNumber of distinct value combinations across the specified columnsGROUP BY on correlated columns produces a bad row count estimate
dependenciesFunctional dependencies between columns (for example, if knowing column a determines the value of column b)WHERE clauses filter on columns that are functionally dependent, causing the planner to misestimate result size.
mcvMost common value combinations and their frequenciesQueries that filter on correlated columns where both ndistinct and dependencies give insufficient precision.

Example

This example shows how column group statistics improve row count estimates and execution plan quality for a table with correlated columns.

Set up the table and helper function

Create a table where columns a and b are strongly correlated—both derived from the same sequence value—and insert 10,000,000 rows.

CREATE TABLE ndistinct(a int, b int, c int, d int) DISTRIBUTED BY (d);
INSERT INTO ndistinct(a, b) SELECT i/100, i/200, i/100, i FROM generate_series(1, 10000000) i;

Create a helper function that extracts the planner's estimated row count and the actual row count from EXPLAIN ANALYZE output.

CREATE FUNCTION check_estimated_rows(text) returns table (estimated int, actual int)
LANGUAGE plpgsql AS
$$
DECLARE
    ln text;
    tmp text[];
    first_row bool := true;
BEGIN
    FOR ln in
        execute format('explain analyze %s', $1)
    LOOP
        IF first_row then
            first_row := false;
            tmp := regexp_match(ln, 'rows=(\d*) .* rows=(\d*)');
            return query select tmp[1]::int, tmp[2]::int;
        END IF;
    END LOOP;
END;
$$;

Check estimates before creating statistics

Run the helper function to compare estimated and actual row counts for a GROUP BY on a and b.

SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM ndistinct GROUP BY a, b');

Result:

 estimated | actual
-----------+--------
   1000000 | 100001
(1 row)

The planner estimates 1,000,000 rows but the query returns 100,001. A large difference exists because the planner treats a and b as independent and multiplies their individual selectivities.

Check the execution plan before creating statistics.

EXPLAIN ANALYZE SELECT count(*) FROM ndistinct GROUP BY a,b,c;

Result:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=137814.33..154481.00 rows=1000000 width=20) (actual time=3986.143..4014.601 rows=100001 loops=1)
   ->  HashAggregate  (cost=137814.33..141147.67 rows=333333 width=20) (actual time=3977.037..4002.222 rows=33661 loops=1)
         Group Key: a, b, c
         Peak Memory Usage: 0 kB
         ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=0.00..104481.00 rows=3333333 width=12) (actual time=0.079..1492.983 rows=3366100 loops=1)
               Hash Key: a, b, c
               ->  Seq Scan on ndistinct  (cost=0.00..37814.33 rows=3333333 width=12) (actual time=0.050..1101.632 rows=3334839 loops=1)
 Planning Time: 0.161 ms
   (slice0)    Executor memory: 12336K bytes.
   (slice1)    Executor memory: 13884K bytes avg x 3 workers, 13899K bytes max (seg0).  Work_mem: 16401K bytes max.
   (slice2)    Executor memory: 37K bytes avg x 3 workers, 37K bytes max (seg0).
 Memory used:  128000kB
 Optimizer: Postgres query optimizer
 Execution Time: 4041.613 ms
(14 rows)

The planner uses a plain HashAggregate with no partial aggregation. Execution time: 4,041 ms.

Create column group statistics

Create an ndistinct statistics object on columns a and b, then run ANALYZE to collect statistics.

CREATE STATISTICS s_a_b(ndistinct) ON a,b FROM ndistinct;
ANALYZE ndistinct;

Verify the improved estimates

Run the helper function again to check row count estimates.

SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM ndistinct GROUP BY a, b');

Result:

 estimated | actual
-----------+--------
     99431 | 100001
(1 row)

The estimated number of rows is 99,431, and the actual number of rows is 100,001. The estimated number of rows is more accurate after the column group statistical item is created.

Check the execution plan after creating statistics.

EXPLAIN ANALYZE SELECT count(*) FROM ndistinct GROUP BY a,b,c;

Result:

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=75124.91..76782.09 rows=99431 width=20) (actual time=2854.765..2879.734 rows=100001 loops=1)
   ->  Finalize HashAggregate  (cost=75124.91..75456.34 rows=33144 width=20) (actual time=2853.610..2868.194 rows=33661 loops=1)
         Group Key: a, b, c
         Peak Memory Usage: 0 kB
         ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=71147.67..74130.60 rows=99431 width=20) (actual time=2269.435..2759.413 rows=100983 loops=1)
               Hash Key: a, b, c
               ->  Partial HashAggregate  (cost=71147.67..72141.98 rows=99431 width=20) (actual time=2744.039..2794.808 rows=100001 loops=1)
                     Group Key: a, b, c
                     Peak Memory Usage: 0 kB
                     ->  Seq Scan on ndistinct  (cost=0.00..37814.33 rows=3333333 width=12) (actual time=0.028..454.030 rows=3334839 loops=1)
 Planning Time: 0.173 ms
   (slice0)    Executor memory: 4670K bytes.
   (slice1)    Executor memory: 3134K bytes avg x 3 workers, 3149K bytes max (seg0).  Work_mem: 5649K bytes max.
   (slice2)    Executor memory: 13848K bytes avg x 3 workers, 13848K bytes max (seg0).  Work_mem: 14353K bytes max.
 Memory used:  128000kB
 Optimizer: Postgres query optimizer
 Execution Time: 2893.470 ms
(17 rows)

With accurate row count estimates, the planner now knows the aggregation produces roughly 99,000 distinct groups instead of 1,000,000. This lets it introduce Partial HashAggregate on each segment before redistributing data, pre-aggregating locally and reducing the volume of data sent over the network. Execution time drops from 4,041 ms to 2,893 ms.