Troubleshoot and resolve compute skew

更新时间:
复制 MD 格式

In AnalyticDB for PostgreSQL, compute skew occurs when uneven data distribution or poorly designed query logic causes the load on some compute nodes to be significantly higher than on others. This imbalance can lead to reduced resource utilization, decreased query performance, and problems such as memory overflow or disk I/O bottlenecks on high-load nodes. This topic describes how to detect and mitigate compute skew in a Massively Parallel Processing (MPP) database to improve overall cluster performance and stability.

Common symptoms

The following symptoms may appear during monitoring or in error messages:

  • CPU: When an SQL statement is executed, the CPU usage of one or more nodes is significantly higher than on other nodes.

  • Memory: When an SQL statement is executed, the memory usage of one or more nodes is significantly higher than on other nodes. This may be accompanied by the error ERROR: Canceling query because of high VMEM usage.

  • Disk: When an SQL statement is executed, the size of spill files on one or more nodes is significantly larger than on other nodes. This may be accompanied by the error ERROR: workfile per segment size limit exceeded.

Diagnostic methods

You can use different diagnostic methods based on the execution status of the SQL statement.

Normal SQL execution

When the SQL statement executes normally, you can use EXPLAIN ANALYZE to view detailed information about the statement.

  1. You can add the following statement before your SQL statement to print the performance statistics for each compute node (Segment).

    SET gp_enable_explain_allstat TO ON;

    For example:

    SET gp_enable_explain_allstat TO ON;
    EXPLAIN ANALYZE <sql>;
  2. In the output, locate the allstat field.

    The format is allstat: seg_firststart_total_ntuples. Each Segment provides three metrics: firststart (in ms), total (in ms), and ntuples (number of rows processed). The statistics for different Segments are separated by forward slashes (/). The metrics for each Segment are separated by underscores (_). The //end at the end indicates the end of the statistics block. If a Segment processes significantly more rows than other nodes, compute skew exists.

    For example:

    allstat: .../seg1_0.618ms_3.569ms_100000/...//end  

Example

The output of EXPLAIN ANALYZE is as follows:

                                                                            QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=0.00..862.00 rows=1 width=8) (actual time=54.648..54.649 rows=1 loops=1)
   allstat: seg_firststart_total_ntuples/seg-1_0.433 ms_55 ms_1//end
   ->  Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..862.00 rows=2 width=1) (actual time=40.810..54.268 rows=599999 loops=1)
         allstat: seg_firststart_total_ntuples/seg-1_0.435 ms_54 ms_599999//end
         ->  Hash Left Join  (cost=0.00..862.00 rows=1 width=1) (actual time=20.366..54.055 rows=266546 loops=1)
               Hash Cond: (skew_hashjoin1.a = skew_hashjoin2.c)
               Extra Text: (seg1)   Hash chain length 100000.0 avg, 100000 max, using 1 of 131072 buckets.
               allstat: seg_firststart_total_ntuples/seg0_0.611 ms_41 ms_166579/seg1_0.618 ms_54 ms_266546/seg2_0.658 ms_40 ms_166874//end
               ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=0.00..431.00 rows=1 width=8) (actual time=0.023..36.498 rows=166874 loops=1)
                     Hash Key: skew_hashjoin1.a
                     allstat: seg_firststart_total_ntuples/seg0_4.405 ms_37 ms_166579/seg1_7.684 ms_5.770 ms_166547/seg2_4.415 ms_36 ms_166874//end
                     ->  Seq Scan on skew_hashjoin1  (cost=0.00..431.00 rows=1 width=8) (actual time=0.055..4.708 rows=166874 loops=1)
                           allstat: seg_firststart_total_ntuples/seg0_1.946 ms_4.398 ms_166579/seg1_1.941 ms_4.789 ms_166547/seg2_1.943 ms_4.708 ms_166874//end
               ->  Hash  (cost=431.00..431.00 rows=1 width=8) (actual time=7.064..7.064 rows=100000 loops=1)
                     Buckets: 1  Batches: 1  Memory Usage: 5305kB
                     allstat: seg_firststart_total_ntuples/seg0_0.612 ms_3.792 ms_0/seg1_0.618 ms_7.064 ms_100000/seg2_0.659 ms_3.755 ms_0//end
                     ->  Redistribute Motion 3:3  (slice3; segments: 3)  (cost=0.00..431.00 rows=1 width=8) (actual time=2.895..3.569 rows=100000 loops=1)
                           Hash Key: skew_hashjoin2.c
                           allstat: seg_firststart_total_ntuples/seg0_0.612 ms_3.790 ms_0/seg1_0.618 ms_3.569 ms_100000/seg2_0.659 ms_3.752 ms_0//end
                           ->  Seq Scan on skew_hashjoin2  (cost=0.00..431.00 rows=1 width=8) (actual time=0.050..0.915 rows=33462 loops=1)
                                 allstat: seg_firststart_total_ntuples/seg0_1.946 ms_0.915 ms_33462/seg1_1.925 ms_1.013 ms_33327/seg2_1.923 ms_0.989 ms_33211//end
 Optimizer: GPORCA
 Planning Time: 4.612 ms
   (slice0)    Executor memory: 134K bytes.
   (slice1)    Executor memory: 2691K bytes avg x 3 workers, 7300K bytes max (seg1).  Work_mem: 5305K bytes max.
   (slice2)    Executor memory: 233K bytes avg x 3 workers, 233K bytes max (seg0).
   (slice3)    Executor memory: 233K bytes avg x 3 workers, 233K bytes max (seg0).
 Memory used:  4194304kB
 Execution Time: 55.433 ms
(29 rows)

Locate the allstat field. The statistics are as follows:

allstat: seg_firststart_total_ntuples/seg0_0.612 ms_3.790 ms_0/seg1_0.618 ms_3.569 ms_100000/seg2_0.659 ms_3.752 ms_0//end

This means:

  • seg0: firststart=0.612 ms, total=3.790 ms, ntuples=0

  • seg1: firststart=0.618 ms, total=3.569 ms, ntuples=100000

  • seg2: firststart=0.659 ms, total=3.752 ms, ntuples=0

The Redistribute Motion operation concentrates the data on seg1, which processed 100,000 rows, while the other Segments processed no data. This indicates that compute skew has occurred.

SQL execution error

If an SQL statement fails with the following error and monitoring shows that the memory usage of some nodes is significantly higher than that of others, you can follow these steps to diagnose the issue.

ERROR:  Canceling query because of high VMEM usage. Used: 6975MB, available 25MB, red zone: 6300MB  (seg0 slice1 xxx.xxx.xxx.xxx:xxxx pid=xxx)
Important

This error does not always indicate compute skew. You must confirm the diagnosis using monitoring data. Compute skew does not always cause this error. This section uses this error as an example to describe the diagnostic steps.

  1. You can use EXPLAIN to view the execution plan.

    EXPLAIN statement;

    The following is the execution plan for the SQL statement that caused the error:

                                                     QUERY PLAN                                                 
    ------------------------------------------------------------------------------------------------------------
     Aggregate  (cost=0.00..862.00 rows=1 width=8)
       ->  Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..862.00 rows=2 width=1)
             ->  Hash Left Join  (cost=0.00..862.00 rows=1 width=1)
                   Hash Cond: (skew_hashjoin1.a = skew_hashjoin2.c)
                   ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=0.00..431.00 rows=1 width=8)
                         Hash Key: skew_hashjoin1.a
                         ->  Seq Scan on skew_hashjoin1  (cost=0.00..431.00 rows=1 width=8)
                   ->  Hash  (cost=431.00..431.00 rows=1 width=8)
                         ->  Redistribute Motion 3:3  (slice3; segments: 3)  (cost=0.00..431.00 rows=1 width=8)
                               Hash Key: skew_hashjoin2.c
                               ->  Seq Scan on skew_hashjoin2  (cost=0.00..431.00 rows=1 width=8)
     Optimizer: GPORCA
    (12 rows)

    Reading a query plan with EXPLAIN: The query plan shows that the query is executed in three slices. Slice 1 performs a Hash Left Join operation. The left table in slice 2 is redistributed based on the distribution key skew_hashjoin1.a, and the right table in slice 3 is redistributed based on the key skew_hashjoin2.c.

  2. Verify the data distribution.

    Analyze the data distribution of the join keys to check for hot spot values. If the query result shows that the cnt value for one or more key values is much higher than for others, this confirms that data skew exists, which leads to compute skew.

    SELECT count(*) AS cnt, 
           distribute_key_1, 
           distribute_key_2, 
           ..., 
           distribute_key_n
    FROM [ table | sub_select ]
    GROUP BY distribute_key_1, distribute_key_2, ... distribute_key_n
    ORDER BY cnt DESC
    LIMIT N;

    For example:

    -- Verify the data distribution of the left table in slice1 after the shuffle
    SELECT count(*) AS cnt, 
           skew_hashjoin1.a
    FROM skew_hashjoin1
    GROUP BY skew_hashjoin1.a
    ORDER BY cnt DESC
    LIMIT 10;
    
    -- Verify the data distribution of the right table in slice2 after the shuffle
    SELECT count(*) AS cnt, 
           skew_hashjoin2.c
    FROM skew_hashjoin2
    GROUP BY skew_hashjoin2.c
    ORDER BY cnt DESC
    LIMIT 10;

    The following example shows the returned results. The results show that column c in skew_hashjoin2 has value skew. This confirms that the SQL statement has compute skew.

    -- Verify the data distribution of the left table in slice1 after the shuffle
    cnt | a  
    -----+----
       1 | 10
       1 | 11
       1 | 13
       1 | 14
       1 | 17
       1 | 21
       1 | 52
       1 | 56
       1 | 58
       1 | 9
    (10 rows)
    
    -- Verify the data distribution of the right table in slice2 after the shuffle
        cnt    | c 
    -----------+---
     100000000 | 1
    (1 row)

Solutions

In a distributed scenario, compute skew is usually caused by value skew in the distribution key when data is redistributed. Therefore, solutions are categorized into two types: resolving value skew and resolving redistribution issues.

Resolve value skew

  • Business optimization

    The purpose of redistribution is to place identical join keys on the same node during a JOIN operation. Therefore, if the skewed values in the join key have matching data, you should optimize the data from a business perspective. Avoid concentrating key values at the data source.

  • Handle NULL value skew

    • If the skewed join key has no matching data, such as a NULL value in an Inner Join, you can map the NULL values to a range of values that will not be matched at runtime. For example, if all join keys are positive numbers, you can map the NULL values to random negative numbers. This allows the data to be evenly distributed across all nodes during redistribution without affecting the JOIN result. Finally, you can convert all negative values back to NULL.

    • If NULL value skew occurs during a JOIN operation, you can check the corresponding business SQL statement. For example, you can use EXISTS and try to avoid using the IN expression.

Resolve redistribution issues

If a table does not have data skew, you can create small tables as replicated tables to avoid the redistribution of join keys at runtime.