Execution plan optimization

更新时间:
复制 MD 格式

Run EXPLAIN ANALYZE on a slow query, then use the following diagnostic questions to pinpoint the bottleneck. Each question maps directly to a symptom you will see in the plan output.

Are the slow operators obvious at the top of the plan?

Read the plan from top to bottom. The top nodes show the highest cumulative time — the entry point for narrowing down where time is spent. Find the operator with the largest gap between its start and end actual time values and focus your analysis there.

Are the row estimates way off?

Compare the rows estimate with actual rows for each operator. A large discrepancy means the optimizer made decisions based on incorrect assumptions — often because table statistics are stale.

Check EXPLAIN ANALYZE output for row estimate mismatches, especially on operators that feed into joins or aggregations.

To refresh statistics, run:

ANALYZE <table_name>;

Are there operators that rarely belong in analytical workloads?

Nested Loop joins and combinations of Sort and GroupByAgg are rarely appropriate for analytical processing (AP) queries on large datasets. If you see these in a plan with high actual row counts, investigate whether a hash join or hash aggregate is feasible.

Are motion operators doing unnecessary work?

Motion operators — Broadcast Motion and Redistribute Motion — move data between segments and are a common source of overhead. Ask:

  • Can you optimize the distribution key? If a Broadcast Motion scans a large table and broadcasts it to all segments, the inner table is likely misidentified as small. Updating statistics or changing the distribution key can switch the plan to a Redistribute Motion, which is more efficient for large tables.

  • Is a replicated table an option? Small dimension tables that are frequently broadcast are good candidates for replication, which eliminates the motion entirely.

  • Are the inner and outer tables joined in the right order? The build side of a hash join should be the smaller table. If the optimizer picks the wrong side, stale statistics are usually the cause.

Are scan operators using indexes and partition pruning?

For partitioned tables, check that the plan shows Dynamic Seq Scan with selective partition pruning. If all partitions are scanned when only a subset is needed, verify that the filter column matches the partition key.

For large tables, check whether an index scan is available and whether the optimizer is using it.

Is memory pressure causing spills to disk?

Look for Workfile: (N spilling) in the plan output — this indicates operators that overflowed to disk. The Work_mem wanted line quantifies how much memory is needed to avoid the spill:

work_mem: 33535270kB  Segments: 96  Max: 349326kB (segment 33)  Workfile: (96 spilling)
Work_mem wanted: 26684983K bytes avg, 26684983K bytes max (seg0) to lessen workfile I/O affecting 96 workers.

Use the Work_mem wanted value to set statement_mem to a higher value for the session or query, reducing disk I/O.

Example: diagnosing a slow UPDATE

The following execution plan was produced by running EXPLAIN ANALYZE on a slow UPDATE statement. It demonstrates how to apply each diagnostic question.

Update (cost=0.00..1274.11 rows=1 width=1) (actual time=995096.707..1517097.191 rows=245136 loops=1)
  Executor Memory: 1kB  Segments: 96  Max: 1kB (segment 0)
  ->  Partition Selector for t2 (cost=0.00..1274.04 rows=1 width=842) (actual time=995096.480..1514408.806 rows=245136 loops=1)
    ->  Redistribute Motion 96:96 (slice2; segments: 96) (cost=0.00..1274.04 rows=1 width=838) (actual time=995096.440..1513830.155 rows=245136 loops=1)
          Hash Key: t2.c1, t2.c2
      ->  Split (cost=0.00..1274.04 rows=1 width=838) (actual time=995080.103..1496878.037 rows=245136 loops=1)
            Executor Memory: 1kB  Segments: 96  Max: 1kB (segment 0)
        ->  Hash Join (cost=0.00..1274.04 rows=1 width=1484) (actual time=995080.071..1496625.817 rows=122568 loops=1)
              Hash Cond: ((t1.c1)::text = (t2.c1)::text)
              Executor Memory: 33535270kB  Segments: 96  Max: 349326kB (segment 33)
              work_mem: 33535270kB  Segments: 96  Max: 349326kB (segment 33)  Workfile: (96 spilling)
              Work_mem wanted: 26684983K bytes avg, 26684983K bytes max (seg0) to lessen workfile I/O affecting 96 workers.
          ->  Seq Scan on t1 (cost=0.00..672.28 rows=121412 width=736) (actual time=672.771..1039.167 rows=122568 loops=1)
                Filter: ((t1.c2 = '2019-05-17'::date) AND ((t1.c3)::text = '0'::text))
          ->  Hash (cost=431.00..431.00 rows=1 width=762) (actual time=994417.443..994417.443 rows=34583155 loops=1)
            ->  Broadcast Motion 96:96 (slice1; segments: 96) (cost=0.00..431.00 rows=1 width=762) (actual time=25.562..912862.203 rows=34583155 loops=1)
              ->  Sequence (cost=0.00..431.00 rows=1 width=762) (actual time=34.475..4822.173 rows=361460 loops=1)
                ->  Partition Selector for t2 (dynamic scan id: 1) (cost=10.00..100.00 rows=2 width=4) (never executed)
                      Partitions selected: 27 (out of 27)
                ->  Dynamic Seq Scan on t2 (dynamic scan id: 1) (cost=0.00..431.00 rows=1 width=762) (actual time=34.441..4680.938 rows=361460 loops=1)
                      Partitions scanned:  Avg 27.0 (out of 27) x 96 workers.  Max 27 parts (seg0).

Step 1: Find the slowest operator

Reading from top to bottom, the Hash Join has the largest actual time span (995080.071..1496625.817), making it the primary bottleneck.

Step 2: Check for disk spills

The Hash Join output shows:

Workfile: (96 spilling)
Work_mem wanted: 26684983K bytes avg, 26684983K bytes max (seg0) to lessen workfile I/O affecting 96 workers.

All 96 segments spilled to disk. The Work_mem wanted line shows the memory required to eliminate the spill.

Step 3: Check row estimates vs. actual rows

The Broadcast Motion fed 34,583,155 actual rows into the hash build phase — but the optimizer estimated only rows=1. This discrepancy means the optimizer incorrectly treated t2 as a tiny table and chose it as the inner (build) side of the hash join.

The root cause is stale statistics on t1. Without accurate statistics for t1, the optimizer could not correctly estimate join cardinality and made the wrong build-side decision, broadcasting the much larger t2 table to all 96 segments instead.

Step 4: Identify the root cause

Because statistics on t1 were not up to date, the optimizer considered t2 as a small table and selected it as the inner table for the hash join. The optimizer then broadcast t2 — a large partitioned table with over 34 million rows — to all 96 segments. That volume exceeded available memory, causing t2 data to spill to disk on every segment and making the query slow.

Resolution

Collect fresh statistics on t2:

ANALYZE t2;

After running ANALYZE, the optimizer will have accurate row count estimates and is likely to pick a more efficient join plan — avoiding the oversized broadcast and the disk spill.