使用分区连接提升分区表的连接查询性能-云原生数据库PolarDB-阿里云

PolarDB PostgreSQL版支持分区连接（Partition-Wise Join）功能，可以减少分区之间的无效连接，提升连接查询的性能。

概述

分区连接用于两个分区表之间Join优化。当分区表之间使用分区键进行Join时，可以通过分区连接减少分区之间无效的连接，提升连接查询的性能。

使用说明

可通过如下语句开启分区连接功能：

set enable_partitionwise_join to on;

示例

下文通过两个简单易理解的示例来详细介绍分区连接。

示例中包含两个表measurement和sales，具体如下：

CREATE TABLE measurement(
    city_id         int not null,
    logdate         date not null,
    peaktemp        int,
    unitsales       int
) PARTITION BY RANGE (logdate);
CREATE TABLE measurement_y2023q1 PARTITION OF measurement
    FOR VALUES FROM ('2023-01-01') TO ('2023-04-01');
CREATE TABLE measurement_y2023q2 PARTITION OF measurement
    FOR VALUES FROM ('2023-04-01') TO ('2023-07-01');
CREATE TABLE measurement_y2023q3 PARTITION OF measurement
    FOR VALUES FROM ('2023-07-01') TO ('2023-10-01');
CREATE TABLE measurement_y2023q4 PARTITION OF measurement
    FOR VALUES FROM ('2023-10-01') TO ('2024-04-01');
    
CREATE TABLE sales (
    dept_no     integer,
    part_no     varchar(2),
    country     varchar(20),
    date        date,
    amount      decimal
) PARTITION BY RANGE (date);
CREATE TABLE sales_y2023q1 PARTITION OF sales
    FOR VALUES FROM ('2023-01-01') TO ('2023-04-01');
CREATE TABLE sales_y2023q2 PARTITION OF sales
    FOR VALUES FROM ('2023-04-01') TO ('2023-07-01');
CREATE TABLE sales_y2023q3 PARTITION OF sales
    FOR VALUES FROM ('2023-07-01') TO ('2023-10-01');
CREATE TABLE sales_y2023q4 PARTITION OF sales
    FOR VALUES FROM ('2023-10-01') TO ('2024-04-01');

从以上建表语句可以看到：

表measurement有四个分区，即measurement_y2023q1,measurement_y2023q2, measurement_y2023q3, measurement_y2023q4，分别对应了2023年的四个季度。
表sales也有四个分区，即sales_y2023q1,sales_y2023q2, sales_y2023q3, sales_y2023q4，分别对应了2023年的四个季度。

此时，执行表measurement和表sales的连接查询SQL语句，并查看其查询计划：

explain select a.* from sales a join measurement b on a.date = b.logdate where b.unitsales > 10;

当未开启分区连接功能时，表measurement 和表sales 全连接，查询计划如下：

                                            QUERY PLAN                                            
--------------------------------------------------------------------------------------------------
 Aggregate  (cost=871.75..871.76 rows=1 width=8)
   ->  Merge Join  (cost=448.58..812.79 rows=23587 width=32)
         Merge Cond: (a.date = b.logdate)
         ->  Sort  (cost=185.83..191.03 rows=2080 width=40)
               Sort Key: a.date
               ->  Append  (cost=0.00..71.20 rows=2080 width=40)
                     ->  Seq Scan on sales_y2023q1 a  (cost=0.00..15.20 rows=520 width=40)
                     ->  Seq Scan on sales_y2023q2 a_1  (cost=0.00..15.20 rows=520 width=40)
                     ->  Seq Scan on sales_y2023q3 a_2  (cost=0.00..15.20 rows=520 width=40)
                     ->  Seq Scan on sales_y2023q4 a_3  (cost=0.00..15.20 rows=520 width=40)
         ->  Sort  (cost=262.75..268.42 rows=2268 width=8)
               Sort Key: b.logdate
               ->  Append  (cost=0.00..136.34 rows=2268 width=8)
                     ->  Seq Scan on measurement_y2023q1 b  (cost=0.00..31.25 rows=567 width=8)
                           Filter: (unitsales > 10)
                     ->  Seq Scan on measurement_y2023q2 b_1  (cost=0.00..31.25 rows=567 width=8)
                           Filter: (unitsales > 10)
                     ->  Seq Scan on measurement_y2023q3 b_2  (cost=0.00..31.25 rows=567 width=8)
                           Filter: (unitsales > 10)
                     ->  Seq Scan on measurement_y2023q4 b_3  (cost=0.00..31.25 rows=567 width=8)
                           Filter: (unitsales > 10)
(21 rows)

可以看到，该查询计划是使用表measurement的所有数据和表sales的所有数据进行全连接。但此时是存在无效连接的，比如sales_y2023q1和measurement_y2023q3他们之间的Join一定是空的，因为连接条件是分区键相等，而sales_y2023q1和measurement_y2023q3的分区键是不相等的。只有比如当sales_y2023q1和measurement_y2023q1连接后分区键相等才会有结果。

此时如果开启分区连接功能：

set enable_partitionwise_join to on;

然后再执行同样的表measurement和表sales的连接查询SQL语句，其查询计划如下：

explain select a.* from sales a join measurement b on a.date = b.logdate where b.unitsales > 10;
                                       QUERY PLAN                                       
----------------------------------------------------------------------------------------
 Append  (cost=21.70..453.33 rows=5896 width=128)
   ->  Hash Join  (cost=21.70..105.96 rows=1474 width=128)
         Hash Cond: (b.logdate = a.date)
         ->  Seq Scan on measurement_y2023q1 b  (cost=0.00..31.25 rows=567 width=8)
               Filter: (unitsales > 10)
         ->  Hash  (cost=15.20..15.20 rows=520 width=128)
               ->  Seq Scan on sales_y2023q1 a  (cost=0.00..15.20 rows=520 width=128)
   ->  Hash Join  (cost=21.70..105.96 rows=1474 width=128)
         Hash Cond: (b_1.logdate = a_1.date)
         ->  Seq Scan on measurement_y2023q2 b_1  (cost=0.00..31.25 rows=567 width=8)
               Filter: (unitsales > 10)
         ->  Hash  (cost=15.20..15.20 rows=520 width=128)
               ->  Seq Scan on sales_y2023q2 a_1  (cost=0.00..15.20 rows=520 width=128)
   ->  Hash Join  (cost=21.70..105.96 rows=1474 width=128)
         Hash Cond: (b_2.logdate = a_2.date)
         ->  Seq Scan on measurement_y2023q3 b_2  (cost=0.00..31.25 rows=567 width=8)
               Filter: (unitsales > 10)
         ->  Hash  (cost=15.20..15.20 rows=520 width=128)
               ->  Seq Scan on sales_y2023q3 a_2  (cost=0.00..15.20 rows=520 width=128)
   ->  Hash Join  (cost=21.70..105.96 rows=1474 width=128)
         Hash Cond: (b_3.logdate = a_3.date)
         ->  Seq Scan on measurement_y2023q4 b_3  (cost=0.00..31.25 rows=567 width=8)
               Filter: (unitsales > 10)
         ->  Hash  (cost=15.20..15.20 rows=520 width=128)
               ->  Seq Scan on sales_y2023q4 a_3  (cost=0.00..15.20 rows=520 width=128)
(25 rows)

可以看到，当开启分区连接后，分布连接优化的效果很明显：sales_y2023q2只需要和measurement_y2023q2连接，sales_y2023q3只需要和measurement_y2023q3连接，sales_y2023q4只需要和measurement_y2023q4连接。分区之间无效的连接被大大减少，从而显著提升连接查询的性能。