Data reordering

更新时间:
复制 MD 格式

Data reordering sorts table rows by specific columns to improve the compression ratio and reduce storage costs. When storage budgets are tight but compute resources are available, reorder large tables to reclaim storage space.

How it works

MaxCompute stores data in columnar format with built-in compression. When rows with similar column values are adjacent, the compression algorithm finds longer runs of repeated or near-identical values, producing smaller compressed blocks and a lower overall storage footprint.

Data reordering reads an existing table, sorts the rows by one or more columns, and writes the sorted result to a new table. The reordered table contains the same data but occupies less storage.

Before you reorder

  • Reduce storage first. Apply other data governance measures before reordering — shorten data lifecycle settings and delete unnecessary data. Reordering is most effective after the table footprint is already minimized.

  • Check column value distribution. Choose columns with high data volume for better compression results.

  • Evaluate cost against savings. Reordering consumes compute resources. On pay-as-you-go instances, those resources incur charges. Confirm that the storage savings justify the compute cost.

  • Target large tables. Reordering is most effective on tables with large amounts of data. Small tables typically show minimal storage improvement.

Sorting methods

MaxCompute supports three SQL approaches for sorting data. For large tables, use DISTRIBUTE BY ... SORT BY or CLUSTER BY instead of ORDER BY.

Method

Scope

Parallelism

Best for

Avoid when

DISTRIBUTE BY + SORT BY

Local sort within each reducer

Multiple reducers

Large datasets requiring flexible sort order

You need ascending-only and want simpler syntax

CLUSTER BY

Local sort in ascending order within each reducer

Multiple reducers

Large datasets with simpler syntax

You need descending sort order or real-time inserts

ORDER BY

Global sort across all data

Single reducer

Small datasets requiring exact global order

Tables are large — the single-reducer bottleneck makes this impractical

DISTRIBUTE BY + SORT BY

DISTRIBUTE BY controls how mapper output is distributed to reducers — records with the same field value go to the same reducer. SORT BY then sorts the data within each reducer before output. This combination enables distributed processing across multiple reducers while achieving effective data ordering for compression.

For more information, see DISTRIBUTE BY and SORT BY.

CLUSTER BY

CLUSTER BY sorts a specific field in ascending order. It has a similar effect to using DISTRIBUTE BY together with SORT BY on the same column, but with simpler syntax.

Note

After using CLUSTER BY to reorder data, real-time inserts may fail in certain scenarios.

ORDER BY

ORDER BY sorts all data through a single reducer, which guarantees global order but prevents distributed processing. The resulting compute bottleneck makes ORDER BY impractical for large tables.

For more information, see ORDER BY.

Example: reorder the TPC-DS customer table

The following example reorders the customer table from the MaxCompute TPC-DS public dataset (1 TB) using two methods — DISTRIBUTE BY ... SORT BY and CLUSTER BY — then compares the resulting storage sizes.

For more information about the TPC-DS public dataset, see TPC-DS data.

Step 1: Create destination tables

Create two tables — storage_test1 and storage_test2 — with the same schema as the customer table.

create table if not exists <storage_test1|storage_test2>(
c_customer_sk bigint,
c_customer_id char(16),
c_current_cdemo_sk bigint,
c_current_hdemo_sk bigint,
c_current_addr_sk bigint,
c_first_shipto_date_sk bigint,
c_first_sales_date_sk bigint,
c_salutation char(10),
c_first_name char(20),
c_last_name char(30),
c_preferred_cust_flag char(1),
c_birth_day bigint,
c_birth_month bigint,
c_birth_year bigint,
c_birth_country varchar(20),
c_login char(13),
c_email_address char(50),
c_last_review_date_sk char(10)
);

Step 2: Reorder data and write to destination tables

Check the value distribution in the table before reordering. Choose columns with high data volume. In this example, the c_email_address column contains a large amount of data.

Method 1: DISTRIBUTE BY + SORT BY

SET odps.namespace.schema = true;
insert overwrite table storage_test1
select * from bigdata_public_dataset.tpcds_1t.customer
distribute by c_email_address
sort by c_email_address;

Method 2: CLUSTER BY

SET odps.namespace.schema = true;
insert overwrite table storage_test2
select * from bigdata_public_dataset.tpcds_1t.customer
cluster by c_email_address;

Step 3: Compare storage sizes

Run DESC commands to view the size of each table. For more information about DESC, see DESC TABLE/VIEW.

SET odps.namespace.schema = true;
desc bigdata_public_dataset.tpcds_1t.customer;
desc storage_test1;
desc storage_test2;

Results

Table

Size (bytes)

Method

Storage reduction

bigdata_public_dataset.tpcds_1t.customer (original)

469,939,587

N/A

--

storage_test1

449,472,614

DISTRIBUTE BY + SORT BY

~4.355%

storage_test2

449,565,021

CLUSTER BY

~4.336%

Storage reduction formulas:

  • DISTRIBUTE BY + SORT BY: (469,939,587 - 449,472,614) / 469,939,587 x 100% = ~4.355%

  • CLUSTER BY: (469,939,587 - 449,565,021) / 469,939,587 x 100% = ~4.336%

Both methods produce similar storage savings. DISTRIBUTE BY + SORT BY achieved a slightly higher reduction in this test.

References