`split_size` hint

更新时间:
复制 MD 格式

When a MaxCompute SQL job runs slower than expected, the cause is often a mismatch between task count and available resources. The split_size hint controls how MaxCompute divides table data into chunks for parallel processing. Each chunk becomes one map task, so the split size directly determines job concurrency. Adjust this value to balance task count with available resources: smaller splits create more map tasks for higher parallelism, while larger splits reduce task count to ease scheduling pressure.

SELECT ...
FROM <table_name> /*+split_size(<value_in_mb>)*/
...

The default split size is 256 MB.

How it works

MaxCompute reads a table's data and divides it into splits of the specified size. The number of map tasks is approximately:

map tasks = table data size / split_size

For example, a 10 GB table with split_size(256) creates about 40 map tasks. Dropping to split_size(64) creates about 160 map tasks — four times the parallelism.

MaxCompute honors the split_size hint for most tables. One exception: if the query optimizer uses a clustered table's bucketing properties for optimization, it ignores the hint for that table.

Tune the split size

A mismatch between task count and available resources is one of the most common causes of slow SQL jobs:

  • Too many tasks for available resources -- Tasks queue for execution slots, and scheduling overhead increases. The job spends more time waiting than computing.

  • Too few tasks for available resources -- Resources sit idle while a small number of tasks process large data chunks sequentially. The job takes longer than necessary.

Scenario

Action

Why

Tasks are queuing for resources

Increase split_size (e.g., 512 or 1024)

Fewer map tasks reduce contention and scheduling overhead

Resources are idle while the job runs

Decrease split_size (e.g., 64 or 128)

More map tasks run in parallel, improving resource utilization

Set split_size in multiples of 256 MB (e.g., 256, 512, 1024) for optimal performance.

Examples

All examples place the hint immediately after the table name in the FROM clause.

Set the split size to 1 MB when reading src. The src table data is divided into 1 MB chunks, each processed by a separate map task. For a 10 GB table, this creates about 10,240 map tasks instead of the default 40:

SELECT a.key
FROM src a /*+split_size(1)*/
JOIN src2 b ON a.key = b.key;

Increase the split size to 512 MB to reduce task count on a large table. For a 10 GB table, this cuts the task count from 40 to about 20:

SELECT *
FROM large_table /*+split_size(512)*/
WHERE dt = '2024-01-01';

Behavior with clustered tables and multiple hints

  • Clustered tables: The query optimizer ignores the split_size hint for a clustered table if it uses the table's bucketing properties for optimization.

  • Multiple hints on the same table: If a query reads the same table multiple times with different split_size hints, MaxCompute uses the smallest value. For example:

    • split_size(1) and split_size(10) on the same table -- MaxCompute uses 1.

    • split_size(1) on one reference and no hint on another -- MaxCompute uses 1 (not the 256 MB default).

Reference

Property

Value

Description

Hint name

split_size

The name used in the hint comment

Syntax

/*+split_size(<value_in_mb>)*/

Inline comment syntax placed after the table name

Placement

After the table name in a query

Applies to the specific table it follows

Unit

MB

The split size value is interpreted in megabytes

Default

256

Used when no hint is specified

Recommended values

Multiples of 256 (e.g., 256, 512, 1024)

Aligns with internal block boundaries for optimal performance

Conflict resolution

Smallest value across all hints on the same table

Applies when the same table appears multiple times with different hints