Common Table Expression (CTE)

更新时间:
复制 MD 格式

A COMMON TABLE EXPRESSION (CTE) is a temporarily named result set that simplifies SQL. MaxCompute supports standard SQL CTEs to improve the readability and execution efficiency of SQL statements. This topic describes the features, syntax, and examples of CTEs.

Overview

  • A CTE can be considered a temporary result set defined within the execution scope of a single DML statement. Similar to a derived table, a CTE is not stored as an object and persists only for the duration of the query. Using CTEs during development improves SQL readability and simplifies the maintenance of complex queries.

  • A CTE is a statement-level clause expression that begins with WITH, followed by the expression name. MaxCompute supports two forms of CTE:

    • NON RECURSIVE CTE: A CTE that does not reference itself and does not iterate. Use this to simplify queries that reuse the same subquery logic.

    • RECURSIVE CTE: A CTE that can reference itself iteratively, enabling recursive query capabilities in SQL. Typically used to traverse hierarchical data such as organizational charts.

  • Materialized CTE: When defining a CTE, use the MATERIALIZE HINT in the SELECT statement to cache the CTE result in a temporary table. Subsequent references read from the cache, avoiding memory limit issues in deeply nested CTE scenarios and improving performance.

NON RECURSIVE CTE

Syntax

WITH
  <cte_name> [(col_name [, col_name] ...)] AS (
    <cte_query>
  )
  [, <cte_name> [(col_name [, col_name] ...)] AS (
    <cte_query2>
  )
  , ...]

Parameters

Parameter

Required

Description

cte_name

Yes

The name of the CTE. Must be unique within the WITH clause. Any reference to cte_name later in the statement reads from this CTE.

col_name

No

Output column names for the CTE. If omitted, column names are inherited from the SELECT list in cte_query.

cte_query

Yes

A SELECT statement whose result set defines the CTE.

Example

The following query uses a UNION ALL to combine two JOIN operations. Both joins share the same left-hand subquery, which must be duplicated without a CTE:

INSERT OVERWRITE TABLE srcp PARTITION (p='abc')
SELECT * FROM (
    SELECT a.key, b.value
    FROM (
        SELECT * FROM src WHERE key IS NOT NULL) a
    JOIN (
        SELECT * FROM src2 WHERE value > 0) b
    ON a.key = b.key
) c
UNION ALL
SELECT * FROM (
    SELECT a.key, b.value
    FROM (
        SELECT * FROM src WHERE key IS NOT NULL) a
    LEFT OUTER JOIN (
        SELECT * FROM src3 WHERE value > 0) b
    ON a.key = b.key AND b.key IS NOT NULL
) d;

Rewriting with a CTE removes the duplication. The subquery a is defined once and reused by both joins:

WITH
  a AS (SELECT * FROM src WHERE key IS NOT NULL),
  b AS (SELECT * FROM src2 WHERE value > 0),
  c AS (SELECT * FROM src3 WHERE value > 0),
  d AS (SELECT a.key, b.value FROM a JOIN b ON a.key = b.key),
  e AS (SELECT a.key, c.value FROM a LEFT OUTER JOIN c ON a.key = c.key AND c.key IS NOT NULL)
INSERT OVERWRITE TABLE srcp PARTITION (p='abc')
SELECT * FROM d UNION ALL SELECT * FROM e;

RECURSIVE CTE

Syntax

WITH RECURSIVE <cte_name> [(col_name [, col_name] ...)] AS (
  <initial_part> UNION ALL <recursive_part>
)
SELECT ... FROM ...;

Parameters

Parameter

Required

Description

RECURSIVE

Yes

The recursive CTE clause must start with WITH RECURSIVE.

cte_name

Yes

The name of the CTE. Must be unique within the current WITH clause.

col_name

No

Output column names. If omitted, column names are inferred from initial_part.

initial_part

Yes

A SELECT statement that produces the seed dataset (iteration 0). Cannot reference cte_name.

recursive_part

Yes

A SELECT statement that references cte_name to compute the next iteration from the previous one.

UNION ALL

Yes

Connects initial_part and recursive_part. UNION (without ALL) is not supported.

Limitations

  • Recursive CTEs cannot appear in IN, EXISTS, or scalar subqueries.

  • The default maximum number of iterations is 10. Increase the limit by setting odps.sql.rcte.max.iterate.num (maximum value: 100).

  • Intermediate results are not saved between iterations. If the task fails, execution restarts from the beginning. For long-running recursive computations, either limit the number of iterations or store intermediate results in a temporary table.

  • Recursive CTEs are not supported in Query Acceleration (MaxQA/MCQA) mode.

    • In MCQA mode, if interactive_auto_rerun=true is set, the task falls back to normal mode. Otherwise, the task fails.

    • In MaxQA mode, auto-fallback is not supported. The job fails directly and must be manually submitted to a batch processing quota group for retry.

Examples

  • Example 1: Define a recursive CTE named cte_name.

    -- Method 1: Explicitly specify output column names
    WITH RECURSIVE cte_name(a, b) AS (
      SELECT 1L, 1L                                   -- initial_part: iteration 0
      UNION ALL
      SELECT a+1, b+1 FROM cte_name WHERE a+1 <= 5    -- recursive_part: references previous iteration
    )
    SELECT * FROM cte_name ORDER BY a LIMIT 100;
    
    -- Method 2: Infer column names from initial_part
    WITH RECURSIVE cte_name AS (
      SELECT 1L AS a, 1L AS b
      UNION ALL
      SELECT a+1, b+1 FROM cte_name WHERE a+1 <= 5
    )
    SELECT * FROM cte_name ORDER BY a LIMIT 100;
    Note
    • In recursive_part, set a termination condition to avoid an infinite loop. In this example, WHERE a + 1 <= 5 serves as the termination criterion. If the WHERE condition is not met, the dataset generated in the current iteration is empty, and the iteration stops.

    • If output column names are not explicitly specified, the system supports automatic inference. For example, in Method 2, the output column names of initial_part are used as the output column names of the recursive CTE.

    Result:

    +------------+------------+
    | a          | b          |
    +------------+------------+
    | 1          | 1          |
    | 2          | 2          |
    | 3          | 3          |
    | 4          | 4          |
    | 5          | 5          |
    +------------+------------+

  • Example 2: Recursive CTE in a subquery (compile error)

    Recursive CTEs are not allowed inside IN, EXISTS, or scalar subqueries. The following query fails to compile:

    WITH RECURSIVE cte_name(a, b) AS (
      SELECT 1L, 1L
      UNION ALL
      SELECT a+1, b+1 FROM cte_name WHERE a+1 <= 5)
    SELECT x, x IN (SELECT a FROM cte_name) FROM VALUES (1L), (2L) AS t(x);

    Error:

    FAILED: ODPS-0130071:[5,31] Semantic analysis exception - using Recursive-CTE cte_name in scalar/in/exists sub-query is not allowed, please check your query, the query text location is from [line 5, column 13] to [line 5, column 40]

  • Example 3: Traverse an organizational hierarchy

    Create an employees table and insert data:

    CREATE TABLE employees(name STRING, boss_name STRING);
    INSERT INTO TABLE employees VALUES
      ('zhang_3', null),
      ('li_4',    'zhang_3'),
      ('wang_5',  'zhang_3'),
      ('zhao_6',  'li_4'),
      ('qian_7',  'wang_5');

    Define a recursive CTE named company_hierarchy with three output columns: the employee's name, their manager's name, and their level in the hierarchy:

    WITH RECURSIVE company_hierarchy(name, boss_name, level) AS (
        SELECT name, boss_name, 0L FROM employees WHERE boss_name IS NULL
        UNION ALL
        SELECT e.name, e.boss_name, h.level + 1
          FROM employees e
          JOIN company_hierarchy h ON e.boss_name = h.name
    )
    SELECT * FROM company_hierarchy ORDER BY level, boss_name, name LIMIT 1000;

    Execution proceeds as follows:

    • Iteration 0 (initial_part): Selects employees where boss_name IS NULL, assigning level = 0. Result: ('zhang_3', NULL, 0).

    • Iteration 1 (recursive_part): Joins employees with the working table (iteration 0). The condition e.boss_name = h.name finds employees managed by zhang_3. Result: li_4 and wang_5 at level = 1.

    • Iteration 2: Finds employees managed by li_4 or wang_5. Result: zhao_6 and qian_7 at level = 2.

    • Iteration 3: No employees have zhao_6 or qian_7 as managers. The working table is empty, and iteration stops.

    Result:

    +---------+-----------+------------+
    | name    | boss_name | level      |
    +---------+-----------+------------+
    | zhang_3 | NULL      | 0          |
    | li_4    | zhang_3   | 1          |
    | wang_5  | zhang_3   | 1          |
    | zhao_6  | li_4      | 2          |
    | qian_7  | wang_5    | 2          |
    +---------+-----------+------------+

  • Example 4: Cyclic data causes an infinite loop

    Insert a record into the employees table from Example 3 where an employee is their own manager.

    INSERT INTO TABLE employees VALUES('qian_7', 'qian_7');

    This record declares that the manager of qian_7 is qian_7 themselves. Running the recursive CTE defined earlier results in an infinite loop. The system limits the maximum number of iterations. The query eventually fails.

    Error:

    FAILED: ODPS-0010000:System internal error - recursive-cte: company_hierarchy exceed max iterate number 10

Materialized CTE

For non-recursive CTEs, MaxCompute expands all CTEs inline when generating an execution plan. For example:

WITH v1 AS (SELECT SIN(1.0) AS a)
SELECT a FROM v1 UNION ALL SELECT a FROM v1;

Results:

+------------+
| a          |
+------------+
| 0.8414709848078965 |
| 0.8414709848078965 |
+------------+

This is equivalent to running SIN(1.0) twice:

SELECT a FROM (SELECT SIN(1.0) AS a)
UNION ALL
SELECT a FROM (SELECT SIN(1.0) AS a);

In complex scenarios with deeply nested CTEs, if all CTEs are expanded into the most basic leaf nodes, the result is a very large syntax tree. This may cause failures due to excessive syntax tree nodes during execution plan generation and may lead to memory limit issues. For example:

WITH
  v1 AS (SELECT 1L AS a, 2L AS b, 3L AS c),
  v2 AS (SELECT * FROM v1 UNION ALL SELECT * FROM v1 UNION ALL SELECT * FROM v1),
  v3 AS (SELECT * FROM v2 UNION ALL SELECT * FROM v2 UNION ALL SELECT * FROM v2),
  v4 AS (SELECT * FROM v3 UNION ALL SELECT * FROM v3 UNION ALL SELECT * FROM v3),
  v5 AS (SELECT * FROM v4 UNION ALL SELECT * FROM v4 UNION ALL SELECT * FROM v4),
  v6 AS (SELECT * FROM v5 UNION ALL SELECT * FROM v5 UNION ALL SELECT * FROM v5),
  v7 AS (SELECT * FROM v6 UNION ALL SELECT * FROM v6 UNION ALL SELECT * FROM v6),
  v8 AS (SELECT * FROM v7 UNION ALL SELECT * FROM v7 UNION ALL SELECT * FROM v7),
  v9 AS (SELECT * FROM v8 UNION ALL SELECT * FROM v8 UNION ALL SELECT * FROM v8)
SELECT * FROM v9;

To address this issue, MaxCompute provides the materialized CTE capability, which caches CTE computation results for reference by SQL outside the WITH clause without full expansion. This mechanism effectively avoids memory limit issues caused by nested CTE expansion and improves CTE statement performance.

Usage example

Add the /*+ MATERIALIZE */ hint to the top-level SELECT of a non-recursive CTE to cache its result in a temporary table. Subsequent references read from the cache instead of re-running the query:

WITH v1 AS (SELECT /*+ MATERIALIZE */ SIN(1.0) AS a)
SELECT a FROM v1 UNION ALL SELECT a FROM v1;

-- Result:
+------------+
| a          |
+------------+
| 0.8414709848078965 |
| 0.8414709848078965 |
+------------+

When the hint is effective, the Job Details tab in LogView shows multiple Fuxi Jobs, confirming that the intermediate result was stored.

Limitations

  • The MATERIALIZE hint must be applied to the top-level SELECT statement of a non-recursive CTE. Recursive CTEs do not need the MATERIALIZE hint.

    • Incorrect example: In the following CTE, the top-level statement is UNION rather than SELECT, so the MATERIALIZE hint does not take effect.

      WITH v1 AS (
        SELECT /*+ MATERIALIZE */ SIN(1.0) AS a
        UNION ALL
        SELECT /*+ MATERIALIZE */ SIN(1.0) AS a)
      SELECT a FROM v1 UNION ALL SELECT a FROM v1;
    • Correct example: Rewrite the incorrect example above by wrapping it in a subquery.

      WITH 
        v1 AS (SELECT /*+ MATERIALIZE */ * FROM 
                 (SELECT SIN(1.0) AS a
                  UNION ALL 
                  SELECT SIN(1.0) AS a)
              ) 
      SELECT a FROM v1 UNION ALL SELECT a FROM v1;
  • If the CTE uses a non-deterministic function such as RAND or a non-deterministic Java/Python UDF, materializing the CTE caches a single evaluation of the function. Subsequent references return the cached value, which changes the semantic behavior of queries that expect independent random values per call.

  • Materialized CTEs are not supported in Query Acceleration (MCQA). If the task runs in MCQA mode with interactive_auto_rerun=true, it falls back to normal mode. Otherwise, the task fails.