基本功能

使用GROUP BY子句可以对SELECT结果进行分组。GROUP BY子句支持任意的表达式,既可以使用列名,也可以使用序号(从 1 开始)。

下列示例中的查询语句是等价的(列nationkey的位置为 2)。

--- 使用列序号
SELECT count(*), nationkey FROM customer GROUP BY 2;

--- 使用列名
SELECT count(*), nationkey FROM customer GROUP BY nationkey;

没有在输出列表中指定的列也可以用于GROUP BY子句,如下所示:

--- 列mktsegment没有在SELECT列表中指定,
--- 结果集中不包括mktsegment列的内容。
SELECT count(*) FROM customer GROUP BY mktsegment;

 _col0
-------
 29968
 30142
 30189
 29949
 29752
(5 rows)

需要注意的是,在SELECT语句中使用GROUP BY子句时,其输出表达式只能是聚合函数或者是GROUP BY子句中使用的列。

复杂分组操作

Presto 支持如下 3 中复杂的聚合语法,可以在一个查询中实现多个列集合的聚合分析:

  • GROUPING SETS(分组集)
  • CUBE(多维立方)
  • ROLLUP(汇总)

Presto 的复杂分组操作只支持使用列名和序号,不支持表达式。

GROUPING SETS

GROUPING SETS可以在一条查询语句中完成多个列的分组聚合。没有在分组列表中的列使用NULL进行填充。

表 shipping 是一个包含 5 个列的数据表,如下所示:

SELECT * FROM shipping;

 origin_state | origin_zip | destination_state | destination_zip | package_weight
--------------+------------+-------------------+-----------------+----------------
 California   |      94131 | New Jersey        |            8648 |             13
 California   |      94131 | New Jersey        |            8540 |             42
 New Jersey   |       7081 | Connecticut       |            6708 |            225
 California   |      90210 | Connecticut       |            6927 |           1337
 California   |      94131 | Colorado          |           80302 |              5
 New York     |      10002 | New Jersey        |            8540 |              3
(6 rows)

现在希望在一个查询中获取如下几个分组结果:

  • 按 origin_state 进行分组,获取 package_weight 的总和;
  • 按 origin_state 和 origin_zip 分组,获取 package_weight 的总和;
  • 按 destination_state 分组,获取 package_weight 的总和。

使用GROUPING SETS可以在一条语句中获取上述 3 个分组的结果集,如下所示:

SELECT origin_state, origin_zip, destination_state, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
    (origin_state),
    (origin_state, origin_zip),
    (destination_state));
	
 origin_state | origin_zip | destination_state | _col0
--------------+------------+-------------------+-------
 New Jersey   | NULL       | NULL              |   225
 California   | NULL       | NULL              |  1397
 New York     | NULL       | NULL              |     3
 California   |      90210 | NULL              |  1337
 California   |      94131 | NULL              |    60
 New Jersey   |       7081 | NULL              |   225
 New York     |      10002 | NULL              |     3
 NULL         | NULL       | Colorado          |     5
 NULL         | NULL       | New Jersey        |    58
 NULL         | NULL       | Connecticut       |  1562
(10 rows)

上述查询逻辑也可以通过UNION ALL多个GROUP BY查询实现:

SELECT origin_state, NULL, NULL, sum(package_weight)
FROM shipping GROUP BY origin_state

UNION ALL

SELECT origin_state, origin_zip, NULL, sum(package_weight)
FROM shipping GROUP BY origin_state, origin_zip

UNION ALL

SELECT NULL, NULL, destination_state, sum(package_weight)
FROM shipping GROUP BY destination_state;

但是,使用GROUPING SETS语法往往能获得更好的性能,因为,该语法在执行时,只会读取一次基表数据,而使用上述UNION ALL方式,会读取3次,因此,如果在查询期间基表数据有变化,使用UNION ALL的方式容易出现不一致的结果。

CUBE

使用CUBE可以获得给定列列表所有可能的分组结果。如下所示:

SELECT origin_state, destination_state, sum(package_weight)
FROM shipping
GROUP BY CUBE (origin_state, destination_state);

 origin_state | destination_state | _col0
--------------+-------------------+-------
 California   | New Jersey        |    55
 California   | Colorado          |     5
 New York     | New Jersey        |     3
 New Jersey   | Connecticut       |   225
 California   | Connecticut       |  1337
 California   | NULL              |  1397
 New York     | NULL              |     3
 New Jersey   | NULL              |   225
 NULL         | New Jersey        |    58
 NULL         | Connecticut       |  1562
 NULL         | Colorado          |     5
 NULL         | NULL              |  1625
(12 rows)

该查询等价于如下语句:

SELECT origin_state, destination_state, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
    (origin_state, destination_state),
    (origin_state),
    (destination_state),
    ());

ROLLUP

使用ROLLUP可以获得给定列集和的小记结果。如下所示:

SELECT origin_state, origin_zip, sum(package_weight)
FROM shipping
GROUP BY ROLLUP (origin_state, origin_zip);

 origin_state | origin_zip | _col2
--------------+------------+-------
 California   |      94131 |    60
 California   |      90210 |  1337
 New Jersey   |       7081 |   225
 New York     |      10002 |     3
 California   | NULL       |  1397
 New York     | NULL       |     3
 New Jersey   | NULL       |   225
 NULL         | NULL       |  1625
(8 rows)
上述示例等价与如下语句:
SELECT origin_state, origin_zip, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS ((origin_state, origin_zip), (origin_state), ());

综合使用

下列 3 个语句是等价的:

SELECT origin_state, destination_state, origin_zip, sum(package_weight)
FROM shipping
GROUP BY
    GROUPING SETS ((origin_state, destination_state)),
    ROLLUP (origin_zip);
SELECT origin_state, destination_state, origin_zip, sum(package_weight)
FROM shipping
GROUP BY
    GROUPING SETS ((origin_state, destination_state)),
    GROUPING SETS ((origin_zip), ());
SELECT origin_state, destination_state, origin_zip, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
    (origin_state, destination_state, origin_zip),
    (origin_state, destination_state));

输出结果如下:

 origin_state | destination_state | origin_zip | _col3
--------------+-------------------+------------+-------
 New York     | New Jersey        |      10002 |     3
 California   | New Jersey        |      94131 |    55
 New Jersey   | Connecticut       |       7081 |   225
 California   | Connecticut       |      90210 |  1337
 California   | Colorado          |      94131 |     5
 New York     | New Jersey        | NULL       |     3
 New Jersey   | Connecticut       | NULL       |   225
 California   | Colorado          | NULL       |     5
 California   | Connecticut       | NULL       |  1337
 California   | New Jersey        | NULL       |    55
(10 rows)

GROUP BY子句中,可以使用ALLDISTINCT修饰符,用来说明是否可以生成重复的统计维度。如下所示:

SELECT origin_state, destination_state, origin_zip, sum(package_weight)
FROM shipping
GROUP BY ALL
    CUBE (origin_state, destination_state),
    ROLLUP (origin_state, origin_zip);

等价于:

SELECT origin_state, destination_state, origin_zip, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
    (origin_state, destination_state, origin_zip),
    (origin_state, origin_zip),
    (origin_state, destination_state, origin_zip),
    (origin_state, origin_zip),
    (origin_state, destination_state),
    (origin_state),
    (origin_state, destination_state),
    (origin_state),
    (origin_state, destination_state),
    (origin_state),
    (destination_state),
    ());

其中有多个维度是重复的。如果使用DISTINCT,则只会输出不重复的维度。

SELECT origin_state, destination_state, origin_zip, sum(package_weight)
FROM shipping
GROUP BY DISTINCT
    CUBE (origin_state, destination_state),
    ROLLUP (origin_state, origin_zip);

等价于:

SELECT origin_state, destination_state, origin_zip, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
    (origin_state, destination_state, origin_zip),
    (origin_state, origin_zip),
    (origin_state, destination_state),
    (origin_state),
    (destination_state),
    ());
说明 GROUP BY 默认使用的修饰符为 ALL。

GROUPING 函数

Presto 提供了一个grouping函数,用于生成一个标记数,该数中的每一个比特位表示对应的列是否出现在该分组条件中。语法如下:

grouping(col1, ..., colN) -> bigint

grouping通常与GROUPING SETS,ROLLUP,CUBEGROUP BY一起使用。grouping中的列必须与GROUPING SETS,ROLLUP,CUBEGROUP BY中指定的列一一对应。

SELECT origin_state, origin_zip, destination_state, sum(package_weight),
       grouping(origin_state, origin_zip, destination_state)
FROM shipping
GROUP BY GROUPING SETS (
        (origin_state),
        (origin_state, origin_zip),
        (destination_state));
		
origin_state | origin_zip | destination_state | _col3 | _col4
--------------+------------+-------------------+-------+-------
California   | NULL       | NULL              |  1397 |     3   --- 011
New Jersey   | NULL       | NULL              |   225 |     3   --- 011
New York     | NULL       | NULL              |     3 |     3   --- 011
California   |      94131 | NULL              |    60 |     1   --- 001
New Jersey   |       7081 | NULL              |   225 |     1   --- 001
California   |      90210 | NULL              |  1337 |     1   --- 001
New York     |      10002 | NULL              |     3 |     1   --- 001
NULL         | NULL       | New Jersey        |    58 |     6   --- 100
NULL         | NULL       | Connecticut       |  1562 |     6   --- 100
NULL         | NULL       | Colorado          |     5 |     6   --- 100
(10 rows)

如上所示,grouping函数返回的是右对齐的比特标记数,0表示列存在,1表示列不存在。