窗口函数语法与用法示例-云原生数据仓库AnalyticDB-阿里云

窗口函数常用于计算分组排名，移动平均，累计和等复杂计算。本文介绍云原生数据仓库 AnalyticDB MySQL 版窗口函数的用法与示例。

聚合函数
排序函数
- CUME_DIST：返回一组数值中每个值的累计分布。
- RANK：返回数据集中每个值的排名。
- DENSE_RANK：返回一组数值中每个数值的排名。
- NTILE：将每个窗口分区的数据分散到桶号从1到n的n个桶中。
- ROW_NUMBER：根据行在窗口分区内的顺序，为每行数据返回一个唯一的有序行号，行号从1开始。
- PERCENT_RANK：返回数据集中每个数据的排名百分比，其结果由(r - 1) / (n - 1)计算得出。其中r为RANK()计算的当前行排名， n为当前窗口分区内总的行数。
值函数
- FIRST_VALUE：返回窗口分区第1行的值。
- LAST_VALUE返回窗口分区最后1行的值。
- LAG：返回窗口内距离当前行之前偏移offset后的值。
- LEAD：返回窗口内距离当前行偏移offset后的值。
- NTH_VALUE：返回窗口内偏移指定offset后的值，偏移量从1开始。

概述

窗口函数基于查询结果的行数据进行计算，窗口函数运行在HAVING子句之后、 ORDER BY子句之前。窗口函数需要特殊的关键字OVER子句来指定窗口即触发一个窗口函数。

分析型数据库MySQL版支持三种类型的窗口函数：聚合函数、排序函数和值函数。

语法

function over ([partition by a] order by b RANGE|ROWS BETWEEN start AND end)

窗口函数包含以下三个部分。

分区规范（可选）：用于将输入行分散到不同的分区中，过程和GROUP BY子句的分散过程相似。
排序规范：决定输入数据行在窗口函数中执行的顺序。
窗口区间：指定计算数据的窗口边界。
窗口区间支持RANGE、ROWS两种模式：
- RANGE按照计算列值的范围进行定义。
- ROWS按照计算列的行数进行范围定义。
- RANGE、ROWS中可以使用BETWEEN start AND end指定边界可取值。BETWEEN start AND end取值为：
  - CURRENT ROW，当前行。
  - N PRECEDING，前n行。
  - UNBOUNDED PRECEDING，直到第1行。
  - N FOLLOWING，后n行。
  - UNBOUNDED FOLLOWING，直到最后1行。

例如，以下查询根据当前窗口的每行数据计算profit的部分总和。

select year,country,profit,sum(profit) over (partition by country order by year ROWS BETWEEN UNBOUNDED PRECEDING and CURRENT ROW) as slidewindow from testwindow;
+------+---------+--------+-------------+
| year | country | profit | slidewindow |
+------+---------+--------+-------------+
| 2001 | USA     |     50 |          50 |
| 2001 | USA     |   1500 |        1550 |
| 2000 | Germany |     75 |          75 |
| 2000 | Germany |     75 |         150 |
| 2001 | Germany |     79 |         229 |
| 2000 | Finland |   1500 |        1500 |
| 2001 | Finland |     10 |        1510 |

而以下查询只能计算出profit的总和。

select country,sum(profit) over (partition by country) from testwindow;
+---------+-----------------------------------------+
| country | sum(profit) OVER (PARTITION BY country) |
+---------+-----------------------------------------+
| Germany |                                     229 |
| Germany |                                     229 |
| Germany |                                     229 |
| USA     |                                    1550 |
| USA     |                                    1550 |
| Finland |                                    1510 |
| Finland |                                    1510 |

注意事项

边界值的取值有如下要求：

start不能为UNBOUNDED FOLLOWING，否则提示Window frame start cannot be UNBOUNDED FOLLOWING错误。
end不能为UNBOUNDED PRECEDING，否则提示Window frame end cannot be UNBOUNDED PRECEDING错误。
start为CURRENT ROW并且end为N PRECEDING时，将提示Window frame starting from CURRENT ROW cannot end with PRECEDING错误。
start为N FOLLOWING并且end为N PRECEDING时，将提示Window frame starting from FOLLOWING cannot end with PRECEDING错误。
start为N FOLLOWING并且end为CURRENT ROW，将提示Window frame starting from FOLLOWING cannot end with CURRENT ROW错误。

当模式为RANGE时：

start或者end为N PRECEDING时，将提示Window frame RANGE PRECEDING is only supported with UNBOUNDED错误。
start或者end为N FOLLOWING时，将提示Window frame RANGE FOLLOWING is only supported with UNBOUNDED错误。

准备工作

本文中的窗口函数均以testwindow表为测试数据。

create table testwindow(year int, country varchar(20), product varchar(20), profit int) distributed by hash(year);

insert into testwindow values (2000,'Finland','Computer',1500);
insert into testwindow values (2001,'Finland','Phone',10);
insert into testwindow values (2000,'Germany','Calculator',75);
insert into testwindow values (2000,'Germany','Calculator',75);
insert into testwindow values (2001,'Germany','Calculator',79);
insert into testwindow values (2001,'USA','Calculator',50);
insert into testwindow values (2001,'USA','Computer',1500);

SELECT * FROM testwindow;
+------+---------+------------+--------+
| year | country | product    | profit |
+------+---------+------------+--------+
| 2000 | Finland | Computer   |   1500 |
| 2001 | Finland | Phone      |     10 |
| 2000 | Germany | Calculator |     75 |
| 2000 | Germany | Calculator |     75 |
| 2001 | Germany | Calculator |     79 |
| 2001 | USA     | Calculator |     50 |
| 2001 | USA     | Computer   |   1500 |

聚合函数

所有聚合函数都可以通过添加OVER子句来作为窗口函数使用，聚合函数将基于当前滑动窗口内的数据行计算每一行数据。

例如，通过以下查询循环显示每个店员每天的订单额总和。

SELECT clerk, orderdate, orderkey, totalprice,sum(totalprice) OVER (PARTITION BY clerk ORDER BY orderdate) AS rolling_sum FROM orders ORDER BY clerk, orderdate, orderkey

CUME_DIST

CUME_DIST()

命令说明：返回一组数值中每个值的累计分布。
返回结果：在窗口分区中对窗口进行排序后的数据集，包括当前行和当前行之前的数据行数。排序中任何关联值均会计算成相同的分布值。
返回值类型：DOUBLE。

示例：

select year,country,product,profit,cume_dist() over (partition by country order by profit) as cume_dist from testwindow;
+------+---------+------------+--------+--------------------+
| year | country | product    | profit | cume_dist          |
+------+---------+------------+--------+--------------------+
| 2001 | USA     | Calculator |     50 |                0.5 |
| 2001 | USA     | Computer   |   1500 |                1.0 |
| 2001 | Finland | Phone      |     10 |                0.5 |
| 2000 | Finland | Computer   |   1500 |                1.0 |
| 2000 | Germany | Calculator |     75 | 0.6666666666666666 |
| 2000 | Germany | Calculator |     75 | 0.6666666666666666 |
| 2001 | Germany | Calculator |     79 |                1.0 |

RANK

RANK()

命令说明：返回数据集中每个值的排名。
排名值是将当前行之前的行数加1，不包含当前行。因此，排序的关联值可能产生顺序上的空隙，而且这个排名会对每个窗口分区进行计算。
返回值类型：BIGINT。

示例：

select year,country,product,profit,rank() over (partition by country order by profit) as rank from testwindow;
+------+---------+------------+--------+------+
| year | country | product    | profit | rank |
+------+---------+------------+--------+------+
| 2001 | Finland | Phone      |     10 |    1 |
| 2000 | Finland | Computer   |   1500 |    2 |
| 2001 | USA     | Calculator |     50 |    1 |
| 2001 | USA     | Computer   |   1500 |    2 |
| 2000 | Germany | Calculator |     75 |    1 |
| 2000 | Germany | Calculator |     75 |    1 |
| 2001 | Germany | Calculator |     79 |    3 |

DENSE_RANK

DENSE_RANK()

命令说明：返回一组数值中每个数值的排名。
DENSE_RANK()与RANK()功能相似，但是DENSE_RANK()关联值不会产生顺序上的空隙。
返回值类型：BIGINT。

示例：

select year,country,product,profit,dense_rank() over (partition by country order by profit) as dense_rank from testwindow;
+------+---------+------------+--------+------------+
| year | country | product    | profit | dense_rank |
+------+---------+------------+--------+------------+
| 2001 | Finland | Phone      |     10 |          1 |
| 2000 | Finland | Computer   |   1500 |          2 |
| 2001 | USA     | Calculator |     50 |          1 |
| 2001 | USA     | Computer   |   1500 |          2 |
| 2000 | Germany | Calculator |     75 |          1 |
| 2000 | Germany | Calculator |     75 |          1 |
| 2001 | Germany | Calculator |     79 |          2 |

NTILE

NTILE(n)

命令说明：将每个窗口分区的数据分散到桶号从1到n的n个桶中。
桶号值最多间隔1，如果窗口分区中的数据行数不能均匀地分散到每一个桶中，则剩余值将从第1个桶开始，每1个桶分1行数据。例如，有6行数据和4个桶，最终桶号值为1 1 2 2 3 4。
返回值类型：BIGINT。

示例：

select year,country,product,profit,ntile(2) over (partition by country order by profit) as ntile2 from testwindow;
+------+---------+------------+--------+--------+
| year | country | product    | profit | ntile2 |
+------+---------+------------+--------+--------+
| 2001 | USA     | Calculator |     50 |      1 |
| 2001 | USA     | Computer   |   1500 |      2 |
| 2001 | Finland | Phone      |     10 |      1 |
| 2000 | Finland | Computer   |   1500 |      2 |
| 2000 | Germany | Calculator |     75 |      1 |
| 2000 | Germany | Calculator |     75 |      1 |
| 2001 | Germany | Calculator |     79 |      2 |

ROW_NUMBER

ROW_NUMBER()

命令说明：根据行在窗口分区内的顺序，为每行数据返回一个唯一的有序行号，行号从1开始。
返回值类型：BIGINT。

示例：

SELECT year, country, product, profit, ROW_NUMBER() OVER(PARTITION BY country) AS row_num1 FROM testwindow;
+------+---------+------------+--------+----------+
| year | country | product    | profit | row_num1 |
+------+---------+------------+--------+----------+
| 2001 | USA     | Calculator |     50 |        1 |
| 2001 | USA     | Computer   |   1500 |        2 |
| 2000 | Germany | Calculator |     75 |        1 |
| 2000 | Germany | Calculator |     75 |        2 |
| 2001 | Germany | Calculator |     79 |        3 |
| 2000 | Finland | Computer   |   1500 |        1 |
| 2001 | Finland | Phone      |     10 |        2 |

PERCENT_RANK

PERCENT_RANK()

命令说明：返回数据集中每个数据的排名百分比，其结果由(r - 1) / (n - 1)计算得出。其中，r为RANK()计算的当前行排名， n为当前窗口分区内总的行数。
返回值类型：DOUBLE。

示例：

select year,country,product,profit,PERCENT_RANK() over (partition by country order by profit) as ntile3 from testwindow;
+------+---------+------------+--------+--------+
| year | country | product    | profit | ntile3 |
+------+---------+------------+--------+--------+
| 2001 | Finland | Phone      |     10 |    0.0 |
| 2000 | Finland | Computer   |   1500 |    1.0 |
| 2001 | USA     | Calculator |     50 |    0.0 |
| 2001 | USA     | Computer   |   1500 |    1.0 |
| 2000 | Germany | Calculator |     75 |    0.0 |
| 2000 | Germany | Calculator |     75 |    0.0 |
| 2001 | Germany | Calculator |     79 |    1.0 |

FIRST_VALUE

FIRST_VALUE(x)

命令说明：返回窗口分区第一行的值。
返回值类型：与输入参数类型相同。

示例：

select year,country,product,profit,first_value(profit) over (partition by country order by profit) as firstValue from testwindow;
+------+---------+------------+--------+------------+
| year | country | product    | profit | firstValue |
+------+---------+------------+--------+------------+
| 2000 | Germany | Calculator |     75 |         75 |
| 2000 | Germany | Calculator |     75 |         75 |
| 2001 | Germany | Calculator |     79 |         75 |
| 2001 | USA     | Calculator |     50 |         50 |
| 2001 | USA     | Computer   |   1500 |         50 |
| 2001 | Finland | Phone      |     10 |         10 |
| 2000 | Finland | Computer   |   1500 |         10 |

LAST_VALUE

LAST_VALUE(x)

命令说明：返回窗口分区最后一行的值。LAST_VALUE默认统计范围是 rows between unbounded preceding and current row，即取当前行数据与当前行之前的数据进行比较。如果像FIRST_VALUE那样直接在每行数据中显示最后一行数据，需要在 order by 条件的后面加上语句：rows between unbounded preceding and unbounded following。
返回值类型：与输入参数类型相同。

示例1：

select year,country,product,profit,last_value(profit) over (partition by country order by profit) as firstValue from testwindow;
+----------------+-------------------+-------------------+------------------+----------------------+
| year           | country           | product           | profit           | firstValue           |
+----------------+-------------------+-------------------+------------------+----------------------+
|           2001 | USA               | Calculator        |               50 |                   50 |
|           2001 | USA               | Computer          |             1500 |                 1500 |
|           2001 | Finland           | Phone             |               10 |                   10 |
|           2000 | Finland           | Computer          |             1500 |                 1500 |
|           2000 | Germany           | Calculator        |               75 |                   75 |
|           2000 | Germany           | Calculator        |               75 |                   75 |
|           2001 | Germany           | Calculator        |               79 |                   79 |

示例2：

select year,country,product,profit,last_value(profit) over (partition by country order by profitrows between unbounded preceding and unbounded following) as lastValue from testwindow;
+------+---------+-----------+--------+-----------+
| year | country | product   | profit | lastValue |
+------+---------+-----------+--------+-----------+
| 2001 | Finland | Phone     |   10   |   1500    |
| 2000 | Finland | Computer  |  1500  |   1500    |
| 2000 | Germany | Calculator|   75   |    79     |
| 2000 | Germany | Calculator|   75   |    79     |
| 2001 | Germany | Calculator|   79   |    79     |
| 2001 | USA     | Calculator|   50   |   1500    |
| 2001 | USA     | Computer  |  1500  |   1500    |
+------+---------+-----------+--------+-----------+

LAG

LAG(x[, offset[, default_value]])

命令说明：返回窗口内距离当前行之前偏移offset后的值。
偏移量起始值是0，也就是当前数据行。偏移量可以是标量表达式，默认offset是1 。
如果偏移量的值是null或者大于窗口长度，则返回default_value；如果没有指定default_value，则返回null。
返回值类型：与输入参数类型相同。

示例：

select year,country,product,profit,lag(profit) over (partition by country order by profit) as lag from testwindow;
+------+---------+------------+--------+------+
| year | country | product    | profit | lag  |
+------+---------+------------+--------+------+
| 2001 | USA     | Calculator |     50 | NULL |
| 2001 | USA     | Computer   |   1500 |   50 |
| 2000 | Germany | Calculator |     75 | NULL |
| 2000 | Germany | Calculator |     75 |   75 |
| 2001 | Germany | Calculator |     79 |   75 |
| 2001 | Finland | Phone      |     10 | NULL |
| 2000 | Finland | Computer   |   1500 |   10 |

LEAD

LEAD(x[,offset[, default_value]])

命令说明：返回窗口内距离当前行偏移offset后的值。
偏移量offset起始值是0，也就是当前数据行。偏移量可以是标量表达式，默认offset是1 。
如果偏移量的值是null或者大于窗口长度，则返回default_value；如果没有指定default_value，则返回null。
返回值类型：与输入参数类型相同。

示例：

select year,country,product,profit,lead(profit) over (partition by country order by profit) as lead from testwindow;
+------+---------+------------+--------+------+
| year | country | product    | profit | lead |
+------+---------+------------+--------+------+
| 2000 | Germany | Calculator |     75 |   75 |
| 2000 | Germany | Calculator |     75 |   79 |
| 2001 | Germany | Calculator |     79 | NULL |
| 2001 | Finland | Phone      |     10 | 1500 |
| 2000 | Finland | Computer   |   1500 | NULL |
| 2001 | USA     | Calculator |     50 | 1500 |
| 2001 | USA     | Computer   |   1500 | NULL |

NTH_VALUE

NTH_VALUE(x, offset)

命令说明：返回窗口内偏移指定offset后的值，偏移量从1开始。
如果偏移量offset是null或者大于窗口内值的个数，则返回null；如果偏移量offset为0或者负数，则系统提示报错。
返回值类型：与输入参数类型相同。

示例：

select year,country,product,profit,nth_value(profit,1) over (partition by country order by profit) as nth_value from testwindow;
+------+---------+------------+--------+-----------+
| year | country | product    | profit | nth_value |
+------+---------+------------+--------+-----------+
| 2001 | Finland | Phone      |     10 |        10 |
| 2000 | Finland | Computer   |   1500 |        10 |
| 2001 | USA     | Calculator |     50 |        50 |
| 2001 | USA     | Computer   |   1500 |        50 |
| 2000 | Germany | Calculator |     75 |        75 |
| 2000 | Germany | Calculator |     75 |        75 |
| 2001 | Germany | Calculator |     79 |        75 |