Troubleshoot and resolve compute skew
In AnalyticDB for PostgreSQL, compute skew occurs when uneven data distribution or poorly designed query logic causes the load on some compute nodes to be significantly higher than on others. This imbalance can lead to reduced resource utilization, decreased query performance, and problems such as memory overflow or disk I/O bottlenecks on high-load nodes. This topic describes how to detect and mitigate compute skew in a Massively Parallel Processing (MPP) database to improve overall cluster performance and stability.
Common symptoms
The following symptoms may appear during monitoring or in error messages:
CPU: When an SQL statement is executed, the CPU usage of one or more nodes is significantly higher than on other nodes.
Memory: When an SQL statement is executed, the memory usage of one or more nodes is significantly higher than on other nodes. This may be accompanied by the error
ERROR: Canceling query because of high VMEM usage.Disk: When an SQL statement is executed, the size of spill files on one or more nodes is significantly larger than on other nodes. This may be accompanied by the error
ERROR: workfile per segment size limit exceeded.
Diagnostic methods
You can use different diagnostic methods based on the execution status of the SQL statement.
Normal SQL execution
When the SQL statement executes normally, you can use EXPLAIN ANALYZE to view detailed information about the statement.
You can add the following statement before your SQL statement to print the performance statistics for each compute node (Segment).
SET gp_enable_explain_allstat TO ON;For example:
SET gp_enable_explain_allstat TO ON; EXPLAIN ANALYZE <sql>;In the output, locate the
allstatfield.The format is
allstat: seg_firststart_total_ntuples. Each Segment provides three metrics: firststart (in ms), total (in ms), and ntuples (number of rows processed). The statistics for different Segments are separated by forward slashes (/). The metrics for each Segment are separated by underscores (_). The//endat the end indicates the end of the statistics block. If a Segment processes significantly more rows than other nodes, compute skew exists.For example:
allstat: .../seg1_0.618ms_3.569ms_100000/...//end
Example
The output of EXPLAIN ANALYZE is as follows:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=0.00..862.00 rows=1 width=8) (actual time=54.648..54.649 rows=1 loops=1)
allstat: seg_firststart_total_ntuples/seg-1_0.433 ms_55 ms_1//end
-> Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..862.00 rows=2 width=1) (actual time=40.810..54.268 rows=599999 loops=1)
allstat: seg_firststart_total_ntuples/seg-1_0.435 ms_54 ms_599999//end
-> Hash Left Join (cost=0.00..862.00 rows=1 width=1) (actual time=20.366..54.055 rows=266546 loops=1)
Hash Cond: (skew_hashjoin1.a = skew_hashjoin2.c)
Extra Text: (seg1) Hash chain length 100000.0 avg, 100000 max, using 1 of 131072 buckets.
allstat: seg_firststart_total_ntuples/seg0_0.611 ms_41 ms_166579/seg1_0.618 ms_54 ms_266546/seg2_0.658 ms_40 ms_166874//end
-> Redistribute Motion 3:3 (slice2; segments: 3) (cost=0.00..431.00 rows=1 width=8) (actual time=0.023..36.498 rows=166874 loops=1)
Hash Key: skew_hashjoin1.a
allstat: seg_firststart_total_ntuples/seg0_4.405 ms_37 ms_166579/seg1_7.684 ms_5.770 ms_166547/seg2_4.415 ms_36 ms_166874//end
-> Seq Scan on skew_hashjoin1 (cost=0.00..431.00 rows=1 width=8) (actual time=0.055..4.708 rows=166874 loops=1)
allstat: seg_firststart_total_ntuples/seg0_1.946 ms_4.398 ms_166579/seg1_1.941 ms_4.789 ms_166547/seg2_1.943 ms_4.708 ms_166874//end
-> Hash (cost=431.00..431.00 rows=1 width=8) (actual time=7.064..7.064 rows=100000 loops=1)
Buckets: 1 Batches: 1 Memory Usage: 5305kB
allstat: seg_firststart_total_ntuples/seg0_0.612 ms_3.792 ms_0/seg1_0.618 ms_7.064 ms_100000/seg2_0.659 ms_3.755 ms_0//end
-> Redistribute Motion 3:3 (slice3; segments: 3) (cost=0.00..431.00 rows=1 width=8) (actual time=2.895..3.569 rows=100000 loops=1)
Hash Key: skew_hashjoin2.c
allstat: seg_firststart_total_ntuples/seg0_0.612 ms_3.790 ms_0/seg1_0.618 ms_3.569 ms_100000/seg2_0.659 ms_3.752 ms_0//end
-> Seq Scan on skew_hashjoin2 (cost=0.00..431.00 rows=1 width=8) (actual time=0.050..0.915 rows=33462 loops=1)
allstat: seg_firststart_total_ntuples/seg0_1.946 ms_0.915 ms_33462/seg1_1.925 ms_1.013 ms_33327/seg2_1.923 ms_0.989 ms_33211//end
Optimizer: GPORCA
Planning Time: 4.612 ms
(slice0) Executor memory: 134K bytes.
(slice1) Executor memory: 2691K bytes avg x 3 workers, 7300K bytes max (seg1). Work_mem: 5305K bytes max.
(slice2) Executor memory: 233K bytes avg x 3 workers, 233K bytes max (seg0).
(slice3) Executor memory: 233K bytes avg x 3 workers, 233K bytes max (seg0).
Memory used: 4194304kB
Execution Time: 55.433 ms
(29 rows)Locate the allstat field. The statistics are as follows:
allstat: seg_firststart_total_ntuples/seg0_0.612 ms_3.790 ms_0/seg1_0.618 ms_3.569 ms_100000/seg2_0.659 ms_3.752 ms_0//endThis means:
seg0: firststart=0.612 ms, total=3.790 ms, ntuples=0
seg1: firststart=0.618 ms, total=3.569 ms, ntuples=100000
seg2: firststart=0.659 ms, total=3.752 ms, ntuples=0
The Redistribute Motion operation concentrates the data on seg1, which processed 100,000 rows, while the other Segments processed no data. This indicates that compute skew has occurred.
SQL execution error
If an SQL statement fails with the following error and monitoring shows that the memory usage of some nodes is significantly higher than that of others, you can follow these steps to diagnose the issue.
ERROR: Canceling query because of high VMEM usage. Used: 6975MB, available 25MB, red zone: 6300MB (seg0 slice1 xxx.xxx.xxx.xxx:xxxx pid=xxx)This error does not always indicate compute skew. You must confirm the diagnosis using monitoring data. Compute skew does not always cause this error. This section uses this error as an example to describe the diagnostic steps.
You can use
EXPLAINto view the execution plan.EXPLAIN statement;The following is the execution plan for the SQL statement that caused the error:
QUERY PLAN ------------------------------------------------------------------------------------------------------------ Aggregate (cost=0.00..862.00 rows=1 width=8) -> Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..862.00 rows=2 width=1) -> Hash Left Join (cost=0.00..862.00 rows=1 width=1) Hash Cond: (skew_hashjoin1.a = skew_hashjoin2.c) -> Redistribute Motion 3:3 (slice2; segments: 3) (cost=0.00..431.00 rows=1 width=8) Hash Key: skew_hashjoin1.a -> Seq Scan on skew_hashjoin1 (cost=0.00..431.00 rows=1 width=8) -> Hash (cost=431.00..431.00 rows=1 width=8) -> Redistribute Motion 3:3 (slice3; segments: 3) (cost=0.00..431.00 rows=1 width=8) Hash Key: skew_hashjoin2.c -> Seq Scan on skew_hashjoin2 (cost=0.00..431.00 rows=1 width=8) Optimizer: GPORCA (12 rows)Reading a query plan with EXPLAIN: The query plan shows that the query is executed in three slices. Slice 1 performs a
Hash Left Joinoperation. The left table in slice 2 is redistributed based on the distribution keyskew_hashjoin1.a, and the right table in slice 3 is redistributed based on the keyskew_hashjoin2.c.Verify the data distribution.
Analyze the data distribution of the join keys to check for hot spot values. If the query result shows that the
cntvalue for one or more key values is much higher than for others, this confirms that data skew exists, which leads to compute skew.SELECT count(*) AS cnt, distribute_key_1, distribute_key_2, ..., distribute_key_n FROM [ table | sub_select ] GROUP BY distribute_key_1, distribute_key_2, ... distribute_key_n ORDER BY cnt DESC LIMIT N;For example:
-- Verify the data distribution of the left table in slice1 after the shuffle SELECT count(*) AS cnt, skew_hashjoin1.a FROM skew_hashjoin1 GROUP BY skew_hashjoin1.a ORDER BY cnt DESC LIMIT 10; -- Verify the data distribution of the right table in slice2 after the shuffle SELECT count(*) AS cnt, skew_hashjoin2.c FROM skew_hashjoin2 GROUP BY skew_hashjoin2.c ORDER BY cnt DESC LIMIT 10;The following example shows the returned results. The results show that column c in
skew_hashjoin2has value skew. This confirms that the SQL statement has compute skew.-- Verify the data distribution of the left table in slice1 after the shuffle cnt | a -----+---- 1 | 10 1 | 11 1 | 13 1 | 14 1 | 17 1 | 21 1 | 52 1 | 56 1 | 58 1 | 9 (10 rows) -- Verify the data distribution of the right table in slice2 after the shuffle cnt | c -----------+--- 100000000 | 1 (1 row)
Solutions
In a distributed scenario, compute skew is usually caused by value skew in the distribution key when data is redistributed. Therefore, solutions are categorized into two types: resolving value skew and resolving redistribution issues.
Resolve value skew
Business optimization
The purpose of redistribution is to place identical join keys on the same node during a
JOINoperation. Therefore, if the skewed values in the join key have matching data, you should optimize the data from a business perspective. Avoid concentrating key values at the data source.Handle NULL value skew
If the skewed join key has no matching data, such as a NULL value in an
Inner Join, you can map the NULL values to a range of values that will not be matched at runtime. For example, if all join keys are positive numbers, you can map the NULL values to random negative numbers. This allows the data to be evenly distributed across all nodes during redistribution without affecting theJOINresult. Finally, you can convert all negative values back to NULL.If NULL value skew occurs during a
JOINoperation, you can check the corresponding business SQL statement. For example, you can useEXISTSand try to avoid using theINexpression.
Resolve redistribution issues
If a table does not have data skew, you can create small tables as replicated tables to avoid the redistribution of join keys at runtime.