窗口函数

窗口函数常用于计算分组排名,移动平均,累计和等复杂计算。本文介绍云原生数据仓库 AnalyticDB MySQL 版窗口函数的用法与示例。

  • 聚合函数

  • 排序函数

    • CUME_DIST:返回一组数值中每个值的累计分布。

    • RANK:返回数据集中每个值的排名。

    • DENSE_RANK:返回一组数值中每个数值的排名。

    • NTILE:将每个窗口分区的数据分散到桶号从1到n的n个桶中。

    • ROW_NUMBER:根据行在窗口分区内的顺序,为每行数据返回一个唯一的有序行号,行号从1开始。

    • PERCENT_RANK:返回数据集中每个数据的排名百分比,其结果由(r - 1) / (n - 1)计算得出。其中r为RANK()计算的当前行排名, n为当前窗口分区内总的行数。

  • 值函数

    • FIRST_VALUE:返回窗口分区第1行的值。

    • LAST_VALUE返回窗口分区最后1行的值。

    • LAG:返回窗口内距离当前行之前偏移offset后的值。

    • LEAD:返回窗口内距离当前行偏移offset后的值。

    • NTH_VALUE:返回窗口内偏移指定offset后的值,偏移量从1开始。

概述

窗口函数基于查询结果的行数据进行计算,窗口函数运行在HAVING子句之后、 ORDER BY子句之前。窗口函数需要特殊的关键字OVER子句来指定窗口即触发一个窗口函数。

分析型数据库MySQL版支持三种类型的窗口函数:聚合函数、排序函数和值函数。

语法

function over ([partition by a] order by b RANGE|ROWS BETWEEN start AND end)                

窗口函数包含以下三个部分。

  • 分区规范(可选):用于将输入行分散到不同的分区中,过程和GROUP BY子句的分散过程相似。

  • 排序规范:决定输入数据行在窗口函数中执行的顺序。

  • 窗口区间:指定计算数据的窗口边界。

    窗口区间支持RANGEROWS两种模式:

    • RANGE按照计算列值的范围进行定义。

    • ROWS按照计算列的行数进行范围定义。

    • RANGEROWS中可以使用BETWEEN start AND end指定边界可取值。BETWEEN start AND end取值为:

      • CURRENT ROW,当前行。

      • N PRECEDING,前n行。

      • UNBOUNDED PRECEDING,直到第1行。

      • N FOLLOWING,后n行。

      • UNBOUNDED FOLLOWING,直到最后1行。

例如,以下查询根据当前窗口的每行数据计算profit的部分总和。

select year,country,profit,sum(profit) over (partition by country order by year ROWS BETWEEN UNBOUNDED PRECEDING and CURRENT ROW) as slidewindow from testwindow;
+------+---------+--------+-------------+
| year | country | profit | slidewindow |
+------+---------+--------+-------------+
| 2001 | USA     |     50 |          50 |
| 2001 | USA     |   1500 |        1550 |
| 2000 | Germany |     75 |          75 |
| 2000 | Germany |     75 |         150 |
| 2001 | Germany |     79 |         229 |
| 2000 | Finland |   1500 |        1500 |
| 2001 | Finland |     10 |        1510 |        

而以下查询只能计算出profit的总和。

select country,sum(profit) over (partition by country) from testwindow;
+---------+-----------------------------------------+
| country | sum(profit) OVER (PARTITION BY country) |
+---------+-----------------------------------------+
| Germany |                                     229 |
| Germany |                                     229 |
| Germany |                                     229 |
| USA     |                                    1550 |
| USA     |                                    1550 |
| Finland |                                    1510 |
| Finland |                                    1510 |        

注意事项

边界值的取值有如下要求:

  • start不能为UNBOUNDED FOLLOWING,否则提示Window frame start cannot be UNBOUNDED FOLLOWING错误。

  • end不能为UNBOUNDED PRECEDING,否则提示Window frame end cannot be UNBOUNDED PRECEDING错误。

  • startCURRENT ROW并且endN PRECEDING时,将提示Window frame starting from CURRENT ROW cannot end with PRECEDING错误。

  • startN FOLLOWING并且endN PRECEDING时,将提示Window frame starting from FOLLOWING cannot end with PRECEDING错误。

  • startN FOLLOWING并且endCURRENT ROW,将提示Window frame starting from FOLLOWING cannot end with CURRENT ROW错误。

当模式为RANGE时:

  • start或者endN PRECEDING时,将提示Window frame RANGE PRECEDING is only supported with UNBOUNDED错误。

  • start或者endN FOLLOWING时,将提示Window frame RANGE FOLLOWING is only supported with UNBOUNDED错误。

准备工作

本文中的窗口函数均以testwindow表为测试数据。

create table testwindow(year int, country varchar(20), product varchar(20), profit int) distributed by hash(year);        
insert into testwindow values (2000,'Finland','Computer',1500);
insert into testwindow values (2001,'Finland','Phone',10);
insert into testwindow values (2000,'Germany','Calculator',75);
insert into testwindow values (2000,'Germany','Calculator',75);
insert into testwindow values (2001,'Germany','Calculator',79);
insert into testwindow values (2001,'USA','Calculator',50);
insert into testwindow values (2001,'USA','Computer',1500);        
SELECT * FROM testwindow;
+------+---------+------------+--------+
| year | country | product    | profit |
+------+---------+------------+--------+
| 2000 | Finland | Computer   |   1500 |
| 2001 | Finland | Phone      |     10 |
| 2000 | Germany | Calculator |     75 |
| 2000 | Germany | Calculator |     75 |
| 2001 | Germany | Calculator |     79 |
| 2001 | USA     | Calculator |     50 |
| 2001 | USA     | Computer   |   1500 |        

聚合函数

所有聚合函数都可以通过添加OVER子句来作为窗口函数使用,聚合函数将基于当前滑动窗口内的数据行计算每一行数据。

例如,通过以下查询循环显示每个店员每天的订单额总和。

SELECT clerk, orderdate, orderkey, totalprice,sum(totalprice) OVER (PARTITION BY clerk ORDER BY orderdate) AS rolling_sum FROM orders ORDER BY clerk, orderdate, orderkey            

CUME_DIST

CUME_DIST()           
  • 命令说明:返回一组数值中每个值的累计分布。

    返回结果:在窗口分区中对窗口进行排序后的数据集,包括当前行和当前行之前的数据行数。排序中任何关联值均会计算成相同的分布值。

  • 返回值类型:DOUBLE。

  • 示例:

    select year,country,product,profit,cume_dist() over (partition by country order by profit) as cume_dist from testwindow;
    +------+---------+------------+--------+--------------------+
    | year | country | product    | profit | cume_dist          |
    +------+---------+------------+--------+--------------------+
    | 2001 | USA     | Calculator |     50 |                0.5 |
    | 2001 | USA     | Computer   |   1500 |                1.0 |
    | 2001 | Finland | Phone      |     10 |                0.5 |
    | 2000 | Finland | Computer   |   1500 |                1.0 |
    | 2000 | Germany | Calculator |     75 | 0.6666666666666666 |
    | 2000 | Germany | Calculator |     75 | 0.6666666666666666 |
    | 2001 | Germany | Calculator |     79 |                1.0 |                

RANK

RANK()            
  • 命令说明:返回数据集中每个值的排名。

    排名值是将当前行之前的行数加1,不包含当前行。因此,排序的关联值可能产生顺序上的空隙,而且这个排名会对每个窗口分区进行计算。

  • 返回值类型:BIGINT。

  • 示例:

    select year,country,product,profit,rank() over (partition by country order by profit) as rank from testwindow;
    +------+---------+------------+--------+------+
    | year | country | product    | profit | rank |
    +------+---------+------------+--------+------+
    | 2001 | Finland | Phone      |     10 |    1 |
    | 2000 | Finland | Computer   |   1500 |    2 |
    | 2001 | USA     | Calculator |     50 |    1 |
    | 2001 | USA     | Computer   |   1500 |    2 |
    | 2000 | Germany | Calculator |     75 |    1 |
    | 2000 | Germany | Calculator |     75 |    1 |
    | 2001 | Germany | Calculator |     79 |    3 |                    

DENSE_RANK

DENSE_RANK()            
  • 命令说明:返回一组数值中每个数值的排名。

    DENSE_RANK()RANK()功能相似,但是DENSE_RANK()关联值不会产生顺序上的空隙。

  • 返回值类型:BIGINT。

  • 示例:

    select year,country,product,profit,dense_rank() over (partition by country order by profit) as dense_rank from testwindow;
    +------+---------+------------+--------+------------+
    | year | country | product    | profit | dense_rank |
    +------+---------+------------+--------+------------+
    | 2001 | Finland | Phone      |     10 |          1 |
    | 2000 | Finland | Computer   |   1500 |          2 |
    | 2001 | USA     | Calculator |     50 |          1 |
    | 2001 | USA     | Computer   |   1500 |          2 |
    | 2000 | Germany | Calculator |     75 |          1 |
    | 2000 | Germany | Calculator |     75 |          1 |
    | 2001 | Germany | Calculator |     79 |          2 |                   

NTILE

NTILE(n)            
  • 命令说明:将每个窗口分区的数据分散到桶号从1nn个桶中。

    桶号值最多间隔1,如果窗口分区中的数据行数不能均匀地分散到每一个桶中,则剩余值将从第1个桶开始,每1个桶分1行数据。例如,有6行数据和4个桶, 最终桶号值为1 1 2 2 3 4

  • 返回值类型:BIGINT。

  • 示例:

    select year,country,product,profit,ntile(2) over (partition by country order by profit) as ntile2 from testwindow;
    +------+---------+------------+--------+--------+
    | year | country | product    | profit | ntile2 |
    +------+---------+------------+--------+--------+
    | 2001 | USA     | Calculator |     50 |      1 |
    | 2001 | USA     | Computer   |   1500 |      2 |
    | 2001 | Finland | Phone      |     10 |      1 |
    | 2000 | Finland | Computer   |   1500 |      2 |
    | 2000 | Germany | Calculator |     75 |      1 |
    | 2000 | Germany | Calculator |     75 |      1 |
    | 2001 | Germany | Calculator |     79 |      2 |                    

ROW_NUMBER

ROW_NUMBER()            
  • 命令说明:根据行在窗口分区内的顺序,为每行数据返回一个唯一的有序行号,行号从1开始。

  • 返回值类型:BIGINT。

  • 示例:

    SELECT year, country, product, profit, ROW_NUMBER() OVER(PARTITION BY country) AS row_num1 FROM testwindow;
    +------+---------+------------+--------+----------+
    | year | country | product    | profit | row_num1 |
    +------+---------+------------+--------+----------+
    | 2001 | USA     | Calculator |     50 |        1 |
    | 2001 | USA     | Computer   |   1500 |        2 |
    | 2000 | Germany | Calculator |     75 |        1 |
    | 2000 | Germany | Calculator |     75 |        2 |
    | 2001 | Germany | Calculator |     79 |        3 |
    | 2000 | Finland | Computer   |   1500 |        1 |
    | 2001 | Finland | Phone      |     10 |        2 |                    

PERCENT_RANK

PERCENT_RANK()            
  • 命令说明:返回数据集中每个数据的排名百分比,其结果由(r - 1) / (n - 1)计算得出。其中,rRANK()计算的当前行排名, n为当前窗口分区内总的行数。

  • 返回值类型:DOUBLE。

  • 示例:

    select year,country,product,profit,PERCENT_RANK() over (partition by country order by profit) as ntile3 from testwindow;
    +------+---------+------------+--------+--------+
    | year | country | product    | profit | ntile3 |
    +------+---------+------------+--------+--------+
    | 2001 | Finland | Phone      |     10 |    0.0 |
    | 2000 | Finland | Computer   |   1500 |    1.0 |
    | 2001 | USA     | Calculator |     50 |    0.0 |
    | 2001 | USA     | Computer   |   1500 |    1.0 |
    | 2000 | Germany | Calculator |     75 |    0.0 |
    | 2000 | Germany | Calculator |     75 |    0.0 |
    | 2001 | Germany | Calculator |     79 |    1.0 |                    

FIRST_VALUE

FIRST_VALUE(x)        
  • 命令说明:返回窗口分区第一行的值。

  • 返回值类型:与输入参数类型相同。

  • 示例:

    select year,country,product,profit,first_value(profit) over (partition by country order by profit) as firstValue from testwindow;
    +------+---------+------------+--------+------------+
    | year | country | product    | profit | firstValue |
    +------+---------+------------+--------+------------+
    | 2000 | Germany | Calculator |     75 |         75 |
    | 2000 | Germany | Calculator |     75 |         75 |
    | 2001 | Germany | Calculator |     79 |         75 |
    | 2001 | USA     | Calculator |     50 |         50 |
    | 2001 | USA     | Computer   |   1500 |         50 |
    | 2001 | Finland | Phone      |     10 |         10 |
    | 2000 | Finland | Computer   |   1500 |         10 |                

LAST_VALUE

LAST_VALUE(x)            
  • 命令说明:返回窗口分区最后一行的值。LAST_VALUE默认统计范围是 rows between unbounded preceding and current row,即取当前行数据与当前行之前的数据进行比较。如果像FIRST_VALUE那样直接在每行数据中显示最后一行数据,需要在 order by 条件的后面加上语句:rows between unbounded preceding and unbounded following。

  • 返回值类型:与输入参数类型相同。

  • 示例1:

    select year,country,product,profit,last_value(profit) over (partition by country order by profit) as firstValue from testwindow;
    +----------------+-------------------+-------------------+------------------+----------------------+
    | year           | country           | product           | profit           | firstValue           |
    +----------------+-------------------+-------------------+------------------+----------------------+
    |           2001 | USA               | Calculator        |               50 |                   50 |
    |           2001 | USA               | Computer          |             1500 |                 1500 |
    |           2001 | Finland           | Phone             |               10 |                   10 |
    |           2000 | Finland           | Computer          |             1500 |                 1500 |
    |           2000 | Germany           | Calculator        |               75 |                   75 |
    |           2000 | Germany           | Calculator        |               75 |                   75 |
    |           2001 | Germany           | Calculator        |               79 |                   79 |                 
  • 示例2:

    select year,country,product,profit,last_value(profit) over (partition by country order by profitrows between unbounded preceding and unbounded following) as lastValue from testwindow;
    +------+---------+-----------+--------+-----------+
    | year | country | product   | profit | lastValue |
    +------+---------+-----------+--------+-----------+
    | 2001 | Finland | Phone     |   10   |   1500    |
    | 2000 | Finland | Computer  |  1500  |   1500    |
    | 2000 | Germany | Calculator|   75   |    79     |
    | 2000 | Germany | Calculator|   75   |    79     |
    | 2001 | Germany | Calculator|   79   |    79     |
    | 2001 | USA     | Calculator|   50   |   1500    |
    | 2001 | USA     | Computer  |  1500  |   1500    |
    +------+---------+-----------+--------+-----------+

LAG

LAG(x[, offset[, default_value]])           
  • 命令说明:返回窗口内距离当前行之前偏移offset后的值。

    偏移量起始值是0,也就是当前数据行。偏移量可以是标量表达式,默认offset1

    如果偏移量的值是null或者大于窗口长度,则返回default_value;如果没有指定default_value,则返回null

  • 返回值类型:与输入参数类型相同。

  • 示例:

    select year,country,product,profit,lag(profit) over (partition by country order by profit) as lag from testwindow;
    +------+---------+------------+--------+------+
    | year | country | product    | profit | lag  |
    +------+---------+------------+--------+------+
    | 2001 | USA     | Calculator |     50 | NULL |
    | 2001 | USA     | Computer   |   1500 |   50 |
    | 2000 | Germany | Calculator |     75 | NULL |
    | 2000 | Germany | Calculator |     75 |   75 |
    | 2001 | Germany | Calculator |     79 |   75 |
    | 2001 | Finland | Phone      |     10 | NULL |
    | 2000 | Finland | Computer   |   1500 |   10 |                    

LEAD

LEAD(x[,offset[, default_value]])            
  • 命令说明:返回窗口内距离当前行偏移offset后的值。

    偏移量offset起始值是0,也就是当前数据行。偏移量可以是标量表达式,默认offset1

    如果偏移量的值是null或者大于窗口长度,则返回default_value;如果没有指定default_value,则返回null

  • 返回值类型:与输入参数类型相同。

  • 示例:

    select year,country,product,profit,lead(profit) over (partition by country order by profit) as lead from testwindow;
    +------+---------+------------+--------+------+
    | year | country | product    | profit | lead |
    +------+---------+------------+--------+------+
    | 2000 | Germany | Calculator |     75 |   75 |
    | 2000 | Germany | Calculator |     75 |   79 |
    | 2001 | Germany | Calculator |     79 | NULL |
    | 2001 | Finland | Phone      |     10 | 1500 |
    | 2000 | Finland | Computer   |   1500 | NULL |
    | 2001 | USA     | Calculator |     50 | 1500 |
    | 2001 | USA     | Computer   |   1500 | NULL |                    

NTH_VALUE

NTH_VALUE(x, offset)            
  • 命令说明:返回窗口内偏移指定offset后的值,偏移量从1开始。

    如果偏移量offsetnull或者大于窗口内值的个数,则返回null;如果偏移量offset0或者负数,则系统提示报错。

  • 返回值类型:与输入参数类型相同。

  • 示例:

    select year,country,product,profit,nth_value(profit,1) over (partition by country order by profit) as nth_value from testwindow;
    +------+---------+------------+--------+-----------+
    | year | country | product    | profit | nth_value |
    +------+---------+------------+--------+-----------+
    | 2001 | Finland | Phone      |     10 |        10 |
    | 2000 | Finland | Computer   |   1500 |        10 |
    | 2001 | USA     | Calculator |     50 |        50 |
    | 2001 | USA     | Computer   |   1500 |        50 |
    | 2000 | Germany | Calculator |     75 |        75 |
    | 2000 | Germany | Calculator |     75 |        75 |
    | 2001 | Germany | Calculator |     79 |        75 |