窗口函数
窗口函数常用于计算分组排名,移动平均,累计和等复杂计算。本文介绍云原生数据仓库 AnalyticDB MySQL 版窗口函数的用法与示例。
排序函数
CUME_DIST:返回一组数值中每个值的累计分布。
RANK:返回数据集中每个值的排名。
DENSE_RANK:返回一组数值中每个数值的排名。
NTILE:将每个窗口分区的数据分散到桶号从1到n的n个桶中。
ROW_NUMBER:根据行在窗口分区内的顺序,为每行数据返回一个唯一的有序行号,行号从1开始。
PERCENT_RANK:返回数据集中每个数据的排名百分比,其结果由
(r - 1) / (n - 1)
计算得出。其中r为RANK()计算的当前行排名, n为当前窗口分区内总的行数。
值函数
FIRST_VALUE:返回窗口分区第1行的值。
LAST_VALUE返回窗口分区最后1行的值。
LAG:返回窗口内距离当前行之前偏移offset后的值。
LEAD:返回窗口内距离当前行偏移offset后的值。
NTH_VALUE:返回窗口内偏移指定offset后的值,偏移量从1开始。
概述
窗口函数基于查询结果的行数据进行计算,窗口函数运行在HAVING
子句之后、 ORDER BY
子句之前。窗口函数需要特殊的关键字OVER
子句来指定窗口即触发一个窗口函数。
分析型数据库MySQL版支持三种类型的窗口函数:聚合函数、排序函数和值函数。
语法
function over ([partition by a] order by b RANGE|ROWS BETWEEN start AND end)
窗口函数包含以下三个部分。
分区规范(可选):用于将输入行分散到不同的分区中,过程和
GROUP BY
子句的分散过程相似。排序规范:决定输入数据行在窗口函数中执行的顺序。
窗口区间:指定计算数据的窗口边界。
窗口区间支持
RANGE
、ROWS
两种模式:RANGE
按照计算列值的范围进行定义。ROWS
按照计算列的行数进行范围定义。RANGE
、ROWS
中可以使用BETWEEN start AND end
指定边界可取值。BETWEEN start AND end
取值为:CURRENT ROW
,当前行。N PRECEDING
,前n
行。UNBOUNDED PRECEDING
,直到第1
行。N FOLLOWING
,后n
行。UNBOUNDED FOLLOWING
,直到最后1
行。
例如,以下查询根据当前窗口的每行数据计算profit
的部分总和。
select year,country,profit,sum(profit) over (partition by country order by year ROWS BETWEEN UNBOUNDED PRECEDING and CURRENT ROW) as slidewindow from testwindow;
+------+---------+--------+-------------+
| year | country | profit | slidewindow |
+------+---------+--------+-------------+
| 2001 | USA | 50 | 50 |
| 2001 | USA | 1500 | 1550 |
| 2000 | Germany | 75 | 75 |
| 2000 | Germany | 75 | 150 |
| 2001 | Germany | 79 | 229 |
| 2000 | Finland | 1500 | 1500 |
| 2001 | Finland | 10 | 1510 |
而以下查询只能计算出profit
的总和。
select country,sum(profit) over (partition by country) from testwindow;
+---------+-----------------------------------------+
| country | sum(profit) OVER (PARTITION BY country) |
+---------+-----------------------------------------+
| Germany | 229 |
| Germany | 229 |
| Germany | 229 |
| USA | 1550 |
| USA | 1550 |
| Finland | 1510 |
| Finland | 1510 |
注意事项
边界值的取值有如下要求:
start
不能为UNBOUNDED FOLLOWING
,否则提示Window frame start cannot be UNBOUNDED FOLLOWING
错误。end
不能为UNBOUNDED PRECEDING
,否则提示Window frame end cannot be UNBOUNDED PRECEDING
错误。start
为CURRENT ROW
并且end
为N PRECEDING
时,将提示Window frame starting from CURRENT ROW cannot end with PRECEDING
错误。start
为N FOLLOWING
并且end
为N PRECEDING
时,将提示Window frame starting from FOLLOWING cannot end with PRECEDING
错误。start
为N FOLLOWING
并且end
为CURRENT ROW
,将提示Window frame starting from FOLLOWING cannot end with CURRENT ROW
错误。
当模式为RANGE
时:
start
或者end
为N PRECEDING
时,将提示Window frame RANGE PRECEDING is only supported with UNBOUNDED
错误。start
或者end
为N FOLLOWING
时,将提示Window frame RANGE FOLLOWING is only supported with UNBOUNDED
错误。
准备工作
本文中的窗口函数均以testwindow
表为测试数据。
create table testwindow(year int, country varchar(20), product varchar(20), profit int) distributed by hash(year);
insert into testwindow values (2000,'Finland','Computer',1500);
insert into testwindow values (2001,'Finland','Phone',10);
insert into testwindow values (2000,'Germany','Calculator',75);
insert into testwindow values (2000,'Germany','Calculator',75);
insert into testwindow values (2001,'Germany','Calculator',79);
insert into testwindow values (2001,'USA','Calculator',50);
insert into testwindow values (2001,'USA','Computer',1500);
SELECT * FROM testwindow;
+------+---------+------------+--------+
| year | country | product | profit |
+------+---------+------------+--------+
| 2000 | Finland | Computer | 1500 |
| 2001 | Finland | Phone | 10 |
| 2000 | Germany | Calculator | 75 |
| 2000 | Germany | Calculator | 75 |
| 2001 | Germany | Calculator | 79 |
| 2001 | USA | Calculator | 50 |
| 2001 | USA | Computer | 1500 |
聚合函数
所有聚合函数都可以通过添加OVER
子句来作为窗口函数使用,聚合函数将基于当前滑动窗口内的数据行计算每一行数据。
例如,通过以下查询循环显示每个店员每天的订单额总和。
SELECT clerk, orderdate, orderkey, totalprice,sum(totalprice) OVER (PARTITION BY clerk ORDER BY orderdate) AS rolling_sum FROM orders ORDER BY clerk, orderdate, orderkey
CUME_DIST
CUME_DIST()
命令说明:返回一组数值中每个值的累计分布。
返回结果:在窗口分区中对窗口进行排序后的数据集,包括当前行和当前行之前的数据行数。排序中任何关联值均会计算成相同的分布值。
返回值类型:DOUBLE。
示例:
select year,country,product,profit,cume_dist() over (partition by country order by profit) as cume_dist from testwindow; +------+---------+------------+--------+--------------------+ | year | country | product | profit | cume_dist | +------+---------+------------+--------+--------------------+ | 2001 | USA | Calculator | 50 | 0.5 | | 2001 | USA | Computer | 1500 | 1.0 | | 2001 | Finland | Phone | 10 | 0.5 | | 2000 | Finland | Computer | 1500 | 1.0 | | 2000 | Germany | Calculator | 75 | 0.6666666666666666 | | 2000 | Germany | Calculator | 75 | 0.6666666666666666 | | 2001 | Germany | Calculator | 79 | 1.0 |
RANK
RANK()
命令说明:返回数据集中每个值的排名。
排名值是将当前行之前的行数加1,不包含当前行。因此,排序的关联值可能产生顺序上的空隙,而且这个排名会对每个窗口分区进行计算。
返回值类型:BIGINT。
示例:
select year,country,product,profit,rank() over (partition by country order by profit) as rank from testwindow; +------+---------+------------+--------+------+ | year | country | product | profit | rank | +------+---------+------------+--------+------+ | 2001 | Finland | Phone | 10 | 1 | | 2000 | Finland | Computer | 1500 | 2 | | 2001 | USA | Calculator | 50 | 1 | | 2001 | USA | Computer | 1500 | 2 | | 2000 | Germany | Calculator | 75 | 1 | | 2000 | Germany | Calculator | 75 | 1 | | 2001 | Germany | Calculator | 79 | 3 |
DENSE_RANK
DENSE_RANK()
命令说明:返回一组数值中每个数值的排名。
DENSE_RANK()
与RANK()
功能相似,但是DENSE_RANK()
关联值不会产生顺序上的空隙。返回值类型:BIGINT。
示例:
select year,country,product,profit,dense_rank() over (partition by country order by profit) as dense_rank from testwindow; +------+---------+------------+--------+------------+ | year | country | product | profit | dense_rank | +------+---------+------------+--------+------------+ | 2001 | Finland | Phone | 10 | 1 | | 2000 | Finland | Computer | 1500 | 2 | | 2001 | USA | Calculator | 50 | 1 | | 2001 | USA | Computer | 1500 | 2 | | 2000 | Germany | Calculator | 75 | 1 | | 2000 | Germany | Calculator | 75 | 1 | | 2001 | Germany | Calculator | 79 | 2 |
NTILE
NTILE(n)
命令说明:将每个窗口分区的数据分散到桶号从
1
到n
的n
个桶中。桶号值最多间隔
1
,如果窗口分区中的数据行数不能均匀地分散到每一个桶中,则剩余值将从第1
个桶开始,每1
个桶分1
行数据。例如,有6行数据和4个桶, 最终桶号值为1 1 2 2 3 4
。返回值类型:BIGINT。
示例:
select year,country,product,profit,ntile(2) over (partition by country order by profit) as ntile2 from testwindow; +------+---------+------------+--------+--------+ | year | country | product | profit | ntile2 | +------+---------+------------+--------+--------+ | 2001 | USA | Calculator | 50 | 1 | | 2001 | USA | Computer | 1500 | 2 | | 2001 | Finland | Phone | 10 | 1 | | 2000 | Finland | Computer | 1500 | 2 | | 2000 | Germany | Calculator | 75 | 1 | | 2000 | Germany | Calculator | 75 | 1 | | 2001 | Germany | Calculator | 79 | 2 |
ROW_NUMBER
ROW_NUMBER()
命令说明:根据行在窗口分区内的顺序,为每行数据返回一个唯一的有序行号,行号从
1
开始。返回值类型:BIGINT。
示例:
SELECT year, country, product, profit, ROW_NUMBER() OVER(PARTITION BY country) AS row_num1 FROM testwindow; +------+---------+------------+--------+----------+ | year | country | product | profit | row_num1 | +------+---------+------------+--------+----------+ | 2001 | USA | Calculator | 50 | 1 | | 2001 | USA | Computer | 1500 | 2 | | 2000 | Germany | Calculator | 75 | 1 | | 2000 | Germany | Calculator | 75 | 2 | | 2001 | Germany | Calculator | 79 | 3 | | 2000 | Finland | Computer | 1500 | 1 | | 2001 | Finland | Phone | 10 | 2 |
PERCENT_RANK
PERCENT_RANK()
命令说明:返回数据集中每个数据的排名百分比,其结果由
(r - 1) / (n - 1)
计算得出。其中,r
为RANK()
计算的当前行排名,n
为当前窗口分区内总的行数。返回值类型:DOUBLE。
示例:
select year,country,product,profit,PERCENT_RANK() over (partition by country order by profit) as ntile3 from testwindow; +------+---------+------------+--------+--------+ | year | country | product | profit | ntile3 | +------+---------+------------+--------+--------+ | 2001 | Finland | Phone | 10 | 0.0 | | 2000 | Finland | Computer | 1500 | 1.0 | | 2001 | USA | Calculator | 50 | 0.0 | | 2001 | USA | Computer | 1500 | 1.0 | | 2000 | Germany | Calculator | 75 | 0.0 | | 2000 | Germany | Calculator | 75 | 0.0 | | 2001 | Germany | Calculator | 79 | 1.0 |
FIRST_VALUE
FIRST_VALUE(x)
命令说明:返回窗口分区第一行的值。
返回值类型:与输入参数类型相同。
示例:
select year,country,product,profit,first_value(profit) over (partition by country order by profit) as firstValue from testwindow; +------+---------+------------+--------+------------+ | year | country | product | profit | firstValue | +------+---------+------------+--------+------------+ | 2000 | Germany | Calculator | 75 | 75 | | 2000 | Germany | Calculator | 75 | 75 | | 2001 | Germany | Calculator | 79 | 75 | | 2001 | USA | Calculator | 50 | 50 | | 2001 | USA | Computer | 1500 | 50 | | 2001 | Finland | Phone | 10 | 10 | | 2000 | Finland | Computer | 1500 | 10 |
LAST_VALUE
LAST_VALUE(x)
命令说明:返回窗口分区最后一行的值。LAST_VALUE默认统计范围是 rows between unbounded preceding and current row,即取当前行数据与当前行之前的数据进行比较。如果像FIRST_VALUE那样直接在每行数据中显示最后一行数据,需要在 order by 条件的后面加上语句:rows between unbounded preceding and unbounded following。
返回值类型:与输入参数类型相同。
示例1:
select year,country,product,profit,last_value(profit) over (partition by country order by profit) as firstValue from testwindow; +----------------+-------------------+-------------------+------------------+----------------------+ | year | country | product | profit | firstValue | +----------------+-------------------+-------------------+------------------+----------------------+ | 2001 | USA | Calculator | 50 | 50 | | 2001 | USA | Computer | 1500 | 1500 | | 2001 | Finland | Phone | 10 | 10 | | 2000 | Finland | Computer | 1500 | 1500 | | 2000 | Germany | Calculator | 75 | 75 | | 2000 | Germany | Calculator | 75 | 75 | | 2001 | Germany | Calculator | 79 | 79 |
示例2:
select year,country,product,profit,last_value(profit) over (partition by country order by profitrows between unbounded preceding and unbounded following) as lastValue from testwindow; +------+---------+-----------+--------+-----------+ | year | country | product | profit | lastValue | +------+---------+-----------+--------+-----------+ | 2001 | Finland | Phone | 10 | 1500 | | 2000 | Finland | Computer | 1500 | 1500 | | 2000 | Germany | Calculator| 75 | 79 | | 2000 | Germany | Calculator| 75 | 79 | | 2001 | Germany | Calculator| 79 | 79 | | 2001 | USA | Calculator| 50 | 1500 | | 2001 | USA | Computer | 1500 | 1500 | +------+---------+-----------+--------+-----------+
LAG
LAG(x[, offset[, default_value]])
命令说明:返回窗口内距离当前行之前偏移
offset
后的值。偏移量起始值是
0
,也就是当前数据行。偏移量可以是标量表达式,默认offset
是1
。如果偏移量的值是
null
或者大于窗口长度,则返回default_value
;如果没有指定default_value
,则返回null
。返回值类型:与输入参数类型相同。
示例:
select year,country,product,profit,lag(profit) over (partition by country order by profit) as lag from testwindow; +------+---------+------------+--------+------+ | year | country | product | profit | lag | +------+---------+------------+--------+------+ | 2001 | USA | Calculator | 50 | NULL | | 2001 | USA | Computer | 1500 | 50 | | 2000 | Germany | Calculator | 75 | NULL | | 2000 | Germany | Calculator | 75 | 75 | | 2001 | Germany | Calculator | 79 | 75 | | 2001 | Finland | Phone | 10 | NULL | | 2000 | Finland | Computer | 1500 | 10 |
LEAD
LEAD(x[,offset[, default_value]])
命令说明:返回窗口内距离当前行偏移
offset
后的值。偏移量
offset
起始值是0
,也就是当前数据行。偏移量可以是标量表达式,默认offset
是1
。如果偏移量的值是
null
或者大于窗口长度,则返回default_value
;如果没有指定default_value
,则返回null
。返回值类型:与输入参数类型相同。
示例:
select year,country,product,profit,lead(profit) over (partition by country order by profit) as lead from testwindow; +------+---------+------------+--------+------+ | year | country | product | profit | lead | +------+---------+------------+--------+------+ | 2000 | Germany | Calculator | 75 | 75 | | 2000 | Germany | Calculator | 75 | 79 | | 2001 | Germany | Calculator | 79 | NULL | | 2001 | Finland | Phone | 10 | 1500 | | 2000 | Finland | Computer | 1500 | NULL | | 2001 | USA | Calculator | 50 | 1500 | | 2001 | USA | Computer | 1500 | NULL |
NTH_VALUE
NTH_VALUE(x, offset)
命令说明:返回窗口内偏移指定
offset
后的值,偏移量从1
开始。如果偏移量
offset
是null
或者大于窗口内值的个数,则返回null
;如果偏移量offset
为0
或者负数,则系统提示报错。返回值类型:与输入参数类型相同。
示例:
select year,country,product,profit,nth_value(profit,1) over (partition by country order by profit) as nth_value from testwindow; +------+---------+------------+--------+-----------+ | year | country | product | profit | nth_value | +------+---------+------------+--------+-----------+ | 2001 | Finland | Phone | 10 | 10 | | 2000 | Finland | Computer | 1500 | 10 | | 2001 | USA | Calculator | 50 | 50 | | 2001 | USA | Computer | 1500 | 50 | | 2000 | Germany | Calculator | 75 | 75 | | 2000 | Germany | Calculator | 75 | 75 | | 2001 | Germany | Calculator | 79 | 75 |