EXPLAIN

在实际开发过程中,通常需要分析查询语句或表结构来分析性能瓶颈,MaxCompute SQL为您提供explain语句实现此功能。本文为您介绍explain的功能、命令格式及使用示例。

功能介绍

EXPLAIN语句可以显示MaxCompute SQL对应的DML语句执行计划(执行SQL语义的程序)的结构,帮助您了解SQL语句的处理过程,为优化SQL语句提供帮助。一个查询语句作业会对应多个Job,一个Job对应多个Task。

说明

如果查询语句足够复杂,EXPLAIN的结果较多,超过4 MB则会触发API的限制,无法得到完整的EXPLAIN结果。此时您可以拆分查询语句,对各部分分别执行EXPLAIN语句,以了解Job的结构。

命令格式

EXPLAIN <dml query>;

dml query:必填。SELECT语句,更多信息请参见SELECT语法

返回说明

EXPLAIN的执行结果包含如下信息:

  • Job间的依赖关系

    例如job0 is root job。如果查询只需要一个Job(job0),只会显示一行信息。

  • Task间的依赖关系

    In Job job0:
    root Tasks: M1, M2
    J3_1_2_Stg1 depends on: M1, M2

    job0包含三个Task,M1M2J3_1_2_Stg1。系统会先执行M1M2两个Task,执行完成后,再执行J3_1_2_Stg1

    Task的命名规则如下:

    • 在MaxCompute中,共有四种Task类型:MapTask、ReduceTask、JoinTask和LocalWork。Task名称的第一个字母表示了当前Task的类型,例如M2Stg1就是一个MapTask。

    • 紧跟着第一个字母后的数字,代表了当前Task的ID。这个ID在当前查询对应的所有Task中是唯一的。

    • 用下划线(_)分隔的数字代表当前Task的直接依赖,例如J3_1_2_Stg1表示当前Task ID为3,依赖ID为1(M1)和ID为2(M2)的两个Task。

  • Task中所有Operator的依赖结构。

    Operator串描述了一个Task的执行语义。结构示例如下:

    In Task M2:
        Data source: mf_mc_bj.sale_detail_jt/sale_date=2013/region=china  # "Data source"描述了当前Task的输入内容。
        TS: mf_mc_bj.sale_detail_jt/sale_date=2013/region=china           # TableScanOperator
            FIL: ISNOTNULL(customer_id)                                   # FilterOperator
                RS: order: +                                              # ReduceSinkOperator
                    nullDirection: *
                    optimizeOrderBy: False
                    valueDestLimit: 0
                    dist: HASH
                    keys:
                          customer_id
                    values:
                          customer_id (string)
                          total_price (double)
                    partitions:
                          customer_id
    
    
    In Task J3_1_2:
        JOIN:                                                           # JoinOperator
             StreamLineRead1 INNERJOIN StreamLineRead2
             keys:
                 0:customer_id
                 1:customer_id
    
            AGGREGATE: group by:customer_id                            # GroupByOperator
             UDAF: SUM(total_price) (__agg_0_sum)[Complete],SUM(total_price) (__agg_1_sum)[Complete]
                RS: order: +
                    nullDirection: *
                    optimizeOrderBy: True
                    valueDestLimit: 10
                    dist: HASH
                    keys:
                          customer_id
                    values:
                          customer_id (string)
                          __agg_0 (double)
                          __agg_1 (double)
                    partitions:
    
    
    In Task R4_3:
        SEL: customer_id,__agg_0,__agg_1                               # SelectOperator
            LIM:limit 10                                               # LimitOperator
                FS: output: Screen                                     # FileSinkOperator
                    schema:
                      customer_id (string) AS ashop
                      __agg_0 (double) AS ap
                      __agg_1 (double) AS bp

    各Operator的含义如下:

    • TableScanOperator(TS):描述查询语句中的FROM语句块的逻辑。EXPLAIN结果中会显示输入表的名称(Alias)。

    • SelectOperator(SEL):描述查询语句中的SELECT语句块的逻辑。EXPLAIN结果中会显示向下一个Operator传递的列,多个列由逗号分隔。

      • 如果是列的引用,则显示为<alias>.<column_name>

      • 如果是表达式的结果,则显示为函数形式,例如func1(arg1_1, arg1_2, func2(arg2_1, arg2_2))

      • 如果是常量,则直接显示常量值。

    • FilterOperator(FIL):描述查询语句中的WHERE语句块的逻辑。EXPLAIN结果中会显示一个WHERE条件表达式,形式类似SelectOperator的显示规则。

    • JoinOperator(JOIN):描述查询语句中的JOIN语句块的逻辑。EXPLAIN结果中会显示哪些表以哪种方式JOIN在一起。

    • GroupByOperator(例如AGGREGATE):描述聚合操作的逻辑。如果查询中使用了聚合函数,就会出现该结构,EXPLAIN结果中会显示聚合函数的内容。

    • ReduceSinkOperator(RS):描述Task间数据分发操作的逻辑。如果当前Task的结果会传递给另一个Task,则必然需要在当前Task的最后,使用ReduceSinkOperator执行数据分发操作。EXPLAIN的结果中会显示输出结果的排序方式、分发的Key、Value以及用来求Hash值的列。

    • FileSinkOperator(FS):描述最终数据的存储操作。如果查询中有INSERT语句块,EXPLAIN结果中会显示目标表名称。

    • LimitOperator(LIM):描述查询语句中的LIMIT语句块的逻辑。EXPLAIN结果中会显示LIMIT数。

    • MapjoinOperator(HASHJOIN):类似JoinOperator,描述大表的JOIN操作。

示例数据

为便于理解,本文为您提供源数据,基于源数据提供相关示例。创建表sale_detail和sale_detail_jt,并添加数据,命令示例如下:

--创建分区表sale_detail和sale_detail_jt。
CREATE TABLE if NOT EXISTS sale_detail
(
shop_name     STRING,
customer_id   STRING,
total_price   DOUBLE
)
PARTITIONED BY (sale_date STRING, region STRING);

CREATE TABLE if NOT EXISTS sale_detail_jt
(
shop_name     STRING,
customer_id   STRING,
total_price   DOUBLE
)
PARTITIONED BY (sale_date STRING, region STRING);

--向源表增加分区。
ALTER TABLE sale_detail ADD PARTITION (sale_date='2013', region='china') PARTITION (sale_date='2014', region='shanghai');
ALTER TABLE sale_detail_jt ADD PARTITION (sale_date='2013', region='china');

--向源表追加数据。
INSERT INTO sale_detail PARTITION (sale_date='2013', region='china') VALUES ('s1','c1',100.1),('s2','c2',100.2),('s3','c3',100.3);
INSERT INTO sale_detail PARTITION (sale_date='2014', region='shanghai') VALUES ('null','c5',null),('s6','c6',100.4),('s7','c7',100.5);
INSERT INTO sale_detail_jt PARTITION (sale_date='2013', region='china') VALUES ('s1','c1',100.1),('s2','c2',100.2),('s5','c2',100.2);

--查询表sale_detail和sale_detail_jt中的数据,命令示例如下:
SET odps.sql.allow.fullscan=true;
SELECT * FROM sale_detail;
--返回结果
+------------+-------------+-------------+------------+------------+
| shop_name  | customer_id | total_price | sale_date  | region     |
+------------+-------------+-------------+------------+------------+
| s1         | c1          | 100.1       | 2013       | china      |
| s2         | c2          | 100.2       | 2013       | china      |
| s3         | c3          | 100.3       | 2013       | china      |
| null       | c5          | NULL        | 2014       | shanghai   |
| s6         | c6          | 100.4       | 2014       | shanghai   |
| s7         | c7          | 100.5       | 2014       | shanghai   |
+------------+-------------+-------------+------------+------------+

SET odps.sql.allow.fullscan=true;
SELECT * FROM sale_detail_jt;
-- 返回结果
+------------+-------------+-------------+------------+------------+
| shop_name  | customer_id | total_price | sale_date  | region     |
+------------+-------------+-------------+------------+------------+
| s1         | c1          | 100.1       | 2013       | china      |
| s2         | c2          | 100.2       | 2013       | china      |
| s5         | c2          | 100.2       | 2013       | china      |
+------------+-------------+-------------+------------+------------+

--创建做关联的表。
SET odps.sql.allow.fullscan=true;
CREATE TABLE shop AS SELECT shop_name, customer_id, total_price FROM sale_detail;

使用示例

下述示例均基于示例数据执行。

  • 示例1

    • 查询语句:

      SELECT a.customer_id AS ashop, SUM(a.total_price) AS ap,COUNT(b.total_price) AS bp 
      FROM (SELECT * FROM sale_detail_jt WHERE sale_date='2013' AND region='china') a 
      INNER JOIN (SELECT * FROM sale_detail WHERE sale_date='2013' AND region='china') b 
      ON a.customer_id=b.customer_id 
      GROUP BY a.customer_id 
      ORDER BY a.customer_id 
      LIMIT 10;
    • 获取查询语句语义,命令如下:

      EXPLAIN 
      SELECT a.customer_id AS ashop, SUM(a.total_price) AS ap,COUNT(b.total_price) AS bp 
      FROM (SELECT * FROM sale_detail_jt WHERE sale_date='2013' AND region='china') a 
      INNER JOIN (SELECT * FROM sale_detail WHERE sale_date='2013' AND region='china') b 
      ON a.customer_id=b.customer_id 
      GROUP BY a.customer_id 
      ORDER BY a.customer_id 
      LIMIT 10;

      返回结果如下:

      job0 is root job
      
      In Job job0:
      root Tasks: M1
      M2_1 depends on: M1
      R3_2 depends on: M2_1
      R4_3 depends on: R3_2
      
      In Task M1:
          Data source: doc_****.default.sale_detail/sale_date=2013/region=china
          TS: doc_****.default.sale_detail/sale_date=2013/region=china
              Statistics: Num rows: 3.0, Data size: 324.0
              FIL: ISNOTNULL(customer_id)
                  Statistics: Num rows: 2.7, Data size: 291.6
                  RS: valueDestLimit: 0
                      dist: BROADCAST
                      keys:
                      values:
                            customer_id (string)
                            total_price (double)
                      partitions:
      
                      Statistics: Num rows: 2.7, Data size: 291.6
      
      In Task M2_1:
          Data source: doc_****.default.sale_detail_jt/sale_date=2013/region=china
          TS: doc_****.default.sale_detail_jt/sale_date=2013/region=china
              Statistics: Num rows: 3.0, Data size: 324.0
              FIL: ISNOTNULL(customer_id)
                  Statistics: Num rows: 2.7, Data size: 291.6
                  HASHJOIN:
                           Filter1 INNERJOIN StreamLineRead1
                           keys:
                               0:customer_id
                               1:customer_id
                           non-equals:
                               0:
                               1:
                           bigTable: Filter1
      
                      Statistics: Num rows: 3.6450000000000005, Data size: 787.32
                      RS: order: +
                          nullDirection: *
                          optimizeOrderBy: False
                          valueDestLimit: 0
                          dist: HASH
                          keys:
                                customer_id
                          values:
                                customer_id (string)
                                total_price (double)
                                total_price (double)
                          partitions:
                                customer_id
      
                          Statistics: Num rows: 3.6450000000000005, Data size: 422.82000000000005
      
      In Task R3_2:
          AGGREGATE: group by:customer_id
           UDAF: SUM(total_price) (__agg_0_sum)[Complete],COUNT(total_price) (__agg_1_count)[Complete]
              Statistics: Num rows: 1.0, Data size: 116.0
              RS: order: +
                  nullDirection: *
                  optimizeOrderBy: True
                  valueDestLimit: 10
                  dist: HASH
                  keys:
                        customer_id
                  values:
                        customer_id (string)
                        __agg_0 (double)
                        __agg_1 (bigint)
                  partitions:
      
                  Statistics: Num rows: 1.0, Data size: 116.0
      
      In Task R4_3:
          SEL: customer_id,__agg_0,__agg_1
              Statistics: Num rows: 1.0, Data size: 116.0
              SEL: customer_id ashop, __agg_0 ap, __agg_1 bp, customer_id
                  Statistics: Num rows: 1.0, Data size: 216.0
                  FS: output: Screen
                      schema:
                        ashop (string)
                        ap (double)
                        bp (bigint)
      
                      Statistics: Num rows: 1.0, Data size: 116.0
      
      OK
  • 示例2

    • 查询语句:

      SELECT /*+ mapjoin(a) */
             a.customer_id AS ashop, SUM(a.total_price) AS ap,COUNT(b.total_price) AS bp 
      FROM (SELECT * FROM sale_detail_jt 
      WHERE sale_date='2013' AND region='china') a 
      INNER JOIN (SELECT * FROM sale_detail WHERE sale_date='2013' AND region='china') b 
      ON a.total_price<b.total_price 
      GROUP BY a.customer_id 
      ORDER BY a.customer_id 
      LIMIT 10;
    • 获取查询语句语义:

      EXPLAIN 
      SELECT /*+ mapjoin(a) */
             a.customer_id AS ashop, SUM(a.total_price) AS ap,COUNT(b.total_price) AS bp 
      FROM (SELECT * FROM sale_detail_jt 
      WHERE sale_date='2013' AND region='china') a 
      INNER JOIN (SELECT * FROM sale_detail WHERE sale_date='2013' AND region='china') b 
      ON a.total_price<b.total_price 
      GROUP BY a.customer_id 
      ORDER BY a.customer_id 
      LIMIT 10;

      返回结果如下:

      job0 is root job
      
      In Job job0:
      root Tasks: M1
      M2_1 depends on: M1
      R3_2 depends on: M2_1
      R4_3 depends on: R3_2
      
      In Task M1:
          Data source: doc_****.sale_detail_jt/sale_date=2013/region=china
          TS: doc_****.sale_detail_jt/sale_date=2013/region=china
              Statistics: Num rows: 3.0, Data size: 324.0
              RS: valueDestLimit: 0
                  dist: BROADCAST
                  keys:
                  values:
                        customer_id (string)
                        total_price (double)
                  partitions:
      
                  Statistics: Num rows: 3.0, Data size: 324.0
      
      In Task M2_1:
          Data source: doc_****.sale_detail/sale_date=2013/region=china
          TS: doc_****.sale_detail/sale_date=2013/region=china
              Statistics: Num rows: 3.0, Data size: 24.0
              HASHJOIN:
                       StreamLineRead1 INNERJOIN TableScan2
                       keys:
                           0:
                           1:
                       non-equals:
                           0:
                           1:
                       bigTable: TableScan2
      
                  Statistics: Num rows: 9.0, Data size: 1044.0
                  FIL: LT(total_price,total_price)
                      Statistics: Num rows: 6.75, Data size: 783.0
                      AGGREGATE: group by:customer_id
                       UDAF: SUM(total_price) (__agg_0_sum)[Partial_1],COUNT(total_price) (__agg_1_count)[Partial_1]
                          Statistics: Num rows: 2.3116438356164384, Data size: 268.1506849315069
                          RS: order: +
                              nullDirection: *
                              optimizeOrderBy: False
                              valueDestLimit: 0
                              dist: HASH
                              keys:
                                    customer_id
                              values:
                                    customer_id (string)
                                    __agg_0_sum (double)
                                    __agg_1_count (bigint)
                              partitions:
                                    customer_id
      
                              Statistics: Num rows: 2.3116438356164384, Data size: 268.1506849315069
      
      In Task R3_2:
          AGGREGATE: group by:customer_id
           UDAF: SUM(__agg_0_sum)[Final] __agg_0,COUNT(__agg_1_count)[Final] __agg_1
              Statistics: Num rows: 1.6875, Data size: 195.75
              RS: order: +
                  nullDirection: *
                  optimizeOrderBy: True
                  valueDestLimit: 10
                  dist: HASH
                  keys:
                        customer_id
                  values:
                        customer_id (string)
                        __agg_0 (double)
                        __agg_1 (bigint)
                  partitions:
      
                  Statistics: Num rows: 1.6875, Data size: 195.75
      
      In Task R4_3:
          SEL: customer_id,__agg_0,__agg_1
              Statistics: Num rows: 1.6875, Data size: 195.75
              SEL: customer_id ashop, __agg_0 ap, __agg_1 bp, customer_id
                  Statistics: Num rows: 1.6875, Data size: 364.5
                  FS: output: Screen
                      schema:
                        ashop (string)
                        ap (double)
                        bp (bigint)
      
                      Statistics: Num rows: 1.6875, Data size: 195.75
      
      OK