使用实时物化视图加速带可变参数的查询
物化视图或实时物化视图都需要针对查询提前创建,在查询语句带有可变参数的场景中(例如查询某一天或者某一个小时内所有订单的数量),您可以通过实时物化视图配合自动查询改写功能来加速带有可变参数的查询。
操作方法
以下内容以TPC-H Q1的查询为例,介绍如何通过实时物化视图与自动查询改写功能对带有可变参数的查询进行加速。
该查询语句引用于TPC-H。
本文的TPC-H的实现基于TPC-H的基准测试,并不能与已发布的TPC-H基准测试结果相比较,本文中的测试并不符合TPC-H基准测试的所有要求。
建表语句如下:
CREATE TABLE lineitem (
l_orderkey bigint not null,
l_partkey integer not null,
l_suppkey integer not null,
l_linenumber integer not null,
l_quantity numeric not null,
l_extendedprice numeric not null,
l_discount numeric not null,
l_tax numeric not null,
l_returnflag "char" not null,
l_linestatus "char" not null,
l_shipdate date not null,
l_commitdate date not null,
l_receiptdate date not null,
l_shipinstruct char(25) not null,
l_shipmode char(10) not null,
l_comment varchar(44) not null
) DISTRIBUTED BY (l_orderkey);
TPC-H Q1查询语句如下:
SELECT l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
FROM
lineitem
WHERE
l_shipdate <= date '1998-12-01' - interval '$1' day --动态条件 $1 = 取值范围为[60,120]
GROUP BY
l_returnflag,
l_linestatus
ORDER BY
l_returnflag,
l_linestatus
LIMIT 1;
通过TPC-H Q1的查询可以看出,WHERE条件中的l_shipdate的范围根据业务会在[60,120]
之间变化。
根据TPC-H Q1查询语句,物化视图设计如下:
CREATE INCREMENTAL MATERIALIZED VIEW q1_mv
AS
SELECT
l_returnflag,
l_linestatus,
l_shipdate,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
count(*) as count_order,
sum(l_extendedprice) as sum_price,
sum(l_discount) as sum_disc,
count(l_quantity) as count_qty,
count(l_extendedprice) as count_price,
count(l_discount) as count_disc
FROM
lineitem
WHERE
l_shipdate <= date '1998-12-01' - interval '60' day
GROUP BY
l_returnflag,
l_linestatus,
l_shipdate
DISTRIBUTED BY (l_returnflag, l_linestatus);
以上操作中自动查询改写主要使用了如下特性:
重聚合补偿
按照SQL执行规则,SQL会先执行WHERE条件过滤,再执行GROUP BY。TPC-H Q1的原始查询GROUP BY只有
l_returnflag,l_linestatus
2个字段,而为了支持WHERE条件中包含可变参数,创建的实时物化视图GROUP BY中包含了l_returnflag,l_linestatus
和l_shipdate
3个字段。在执行真实查询时,在物化视图上应用动态的WHERE条件,然后执行重聚合,最终得到GROUP BYl_returnflag,l_linestatus
2个字段的结果。说明由于avg函数不支持重聚合,所以这里利用了自动查询改写的表达式补偿机制,在物化视图中创建了支持重聚合的sum和count,通过
avg = sum/count
这一特性完成改写。WHERE条件范围匹配
虽然在创建物化视图时不指定WHERE条件情况下,自动查询改写也会补充上条件,但是该业务限制了查询条件动态变化的范围为
[60,120]
。在创建物化视图时,您可以将WHERE条件范围缩小至60,物化视图即可包含60~120间的任意取值。自动改写查询会判断实际查询SQL的条件是否属于物化视图的子集,如果属于,将进行改写并补偿需要增加的条件。
执行查询计划,查看物化视图的使用情况。查询计划示例如下:
EXPLAIN SELECT
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
FROM
lineitem
WHERE
l_shipdate <= date '1998-12-01' - interval '100' day
GROUP BY
l_returnflag,
l_linestatus
ORDER BY
l_returnflag,
l_linestatus
LIMIT 1;
返回示例如下:
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Limit (cost=1.01..1.13 rows=1 width=234)
-> Gather Motion 3:1 (slice1; segments: 3) (cost=1.01..1.13 rows=1 width=234)
Merge Key: l_returnflag, l_linestatus
-> Limit (cost=1.01..1.11 rows=1 width=234)
-> GroupAggregate (cost=1.01..1.11 rows=1 width=234)
Group Key: l_returnflag, l_linestatus
-> Sort (cost=1.01..1.01 rows=1 width=194)
Sort Key: l_returnflag, l_linestatus
-> Seq Scan on q1_mv (cost=0.00..1.00 rows=1 width=194)
Filter: (l_shipdate <= '1998-08-23 00:00:00'::timestamp without time zone)
Optimizer: Postgres query optimizer
(11 rows)
测试结果
TPC-H Q1查询语句经过实时物化视图与自动查询改写的加速后,查询速度得到了巨大的提升。测试数据表明,在16个计算节点组成的AnalyticDB PostgreSQL版实例上进行1 TB数据的TPC-H测试,Q1的查询使用实时物化视图配合自动查询改写,可以将查询时间从约340s优化至0.04s,查询速度提升接近一万倍。