Data reordering sorts table rows by specific columns to improve the compression ratio and reduce storage costs. When storage budgets are tight but compute resources are available, reorder large tables to reclaim storage space.
How it works
MaxCompute stores data in columnar format with built-in compression. When rows with similar column values are adjacent, the compression algorithm finds longer runs of repeated or near-identical values, producing smaller compressed blocks and a lower overall storage footprint.
Data reordering reads an existing table, sorts the rows by one or more columns, and writes the sorted result to a new table. The reordered table contains the same data but occupies less storage.
Before you reorder
Reduce storage first. Apply other data governance measures before reordering — shorten data lifecycle settings and delete unnecessary data. Reordering is most effective after the table footprint is already minimized.
Check column value distribution. Choose columns with high data volume for better compression results.
Evaluate cost against savings. Reordering consumes compute resources. On pay-as-you-go instances, those resources incur charges. Confirm that the storage savings justify the compute cost.
Target large tables. Reordering is most effective on tables with large amounts of data. Small tables typically show minimal storage improvement.
Sorting methods
MaxCompute supports three SQL approaches for sorting data. For large tables, use DISTRIBUTE BY ... SORT BY or CLUSTER BY instead of ORDER BY.
|
Method |
Scope |
Parallelism |
Best for |
Avoid when |
|
|
Local sort within each reducer |
Multiple reducers |
Large datasets requiring flexible sort order |
You need ascending-only and want simpler syntax |
|
|
Local sort in ascending order within each reducer |
Multiple reducers |
Large datasets with simpler syntax |
You need descending sort order or real-time inserts |
|
|
Global sort across all data |
Single reducer |
Small datasets requiring exact global order |
Tables are large — the single-reducer bottleneck makes this impractical |
DISTRIBUTE BY + SORT BY
DISTRIBUTE BY controls how mapper output is distributed to reducers — records with the same field value go to the same reducer. SORT BY then sorts the data within each reducer before output. This combination enables distributed processing across multiple reducers while achieving effective data ordering for compression.
For more information, see DISTRIBUTE BY and SORT BY.
CLUSTER BY
CLUSTER BY sorts a specific field in ascending order. It has a similar effect to using DISTRIBUTE BY together with SORT BY on the same column, but with simpler syntax.
After using CLUSTER BY to reorder data, real-time inserts may fail in certain scenarios.
ORDER BY
ORDER BY sorts all data through a single reducer, which guarantees global order but prevents distributed processing. The resulting compute bottleneck makes ORDER BY impractical for large tables.
For more information, see ORDER BY.
Example: reorder the TPC-DS customer table
The following example reorders the customer table from the MaxCompute TPC-DS public dataset (1 TB) using two methods — DISTRIBUTE BY ... SORT BY and CLUSTER BY — then compares the resulting storage sizes.
For more information about the TPC-DS public dataset, see TPC-DS data.
Step 1: Create destination tables
Create two tables — storage_test1 and storage_test2 — with the same schema as the customer table.
create table if not exists <storage_test1|storage_test2>(
c_customer_sk bigint,
c_customer_id char(16),
c_current_cdemo_sk bigint,
c_current_hdemo_sk bigint,
c_current_addr_sk bigint,
c_first_shipto_date_sk bigint,
c_first_sales_date_sk bigint,
c_salutation char(10),
c_first_name char(20),
c_last_name char(30),
c_preferred_cust_flag char(1),
c_birth_day bigint,
c_birth_month bigint,
c_birth_year bigint,
c_birth_country varchar(20),
c_login char(13),
c_email_address char(50),
c_last_review_date_sk char(10)
);
Step 2: Reorder data and write to destination tables
Check the value distribution in the table before reordering. Choose columns with high data volume. In this example, the c_email_address column contains a large amount of data.
Method 1: DISTRIBUTE BY + SORT BY
SET odps.namespace.schema = true;
insert overwrite table storage_test1
select * from bigdata_public_dataset.tpcds_1t.customer
distribute by c_email_address
sort by c_email_address;
Method 2: CLUSTER BY
SET odps.namespace.schema = true;
insert overwrite table storage_test2
select * from bigdata_public_dataset.tpcds_1t.customer
cluster by c_email_address;
Step 3: Compare storage sizes
Run DESC commands to view the size of each table. For more information about DESC, see DESC TABLE/VIEW.
SET odps.namespace.schema = true;
desc bigdata_public_dataset.tpcds_1t.customer;
desc storage_test1;
desc storage_test2;
Results
|
Table |
Size (bytes) |
Method |
Storage reduction |
|
|
469,939,587 |
N/A |
-- |
|
|
449,472,614 |
|
~4.355% |
|
|
449,565,021 |
|
~4.336% |
Storage reduction formulas:
DISTRIBUTE BY+SORT BY: (469,939,587 - 449,472,614) / 469,939,587 x 100% = ~4.355%CLUSTER BY: (469,939,587 - 449,565,021) / 469,939,587 x 100% = ~4.336%
Both methods produce similar storage savings. DISTRIBUTE BY + SORT BY achieved a slightly higher reduction in this test.