Handle data bloat-MaxCompute(MaxCompute)-阿里云帮助中心

This topic describes the causes of data expansion and measures that can be taken to handle data expansion issues.

Problem description

In Logview, you may notice that the output data volume of a Fuxi task is significantly larger than its input data volume. You can check the input and output volumes by using the I/O Record and I/O Bytes attributes of the Fuxi task.

For example, a 1 GB input can expand to 1 TB after processing. Processing 1 TB of data on a single instance significantly reduces performance.

Causes and measures

The following table describes the possible causes of this issue and related measures that can be taken.

Cause	Description	Measure
Bug in code	The code is defective. Examples: The `JOIN` condition in the code is incorrect and is written as a Cartesian product. User-defined table-valued functions (UDTFs) are invalid. As a result, the amount of output data is much greater than the amount of input data.	Fix bugs in the code.
Improper aggregation operations	Typically, these operations produce small amounts of intermediate data and have low computational complexity, allowing them to complete quickly even on large datasets. Therefore, standard aggregation operations rarely cause issues. However, certain aggregation operations, such as `collect_list` and `median`, must retain all intermediate data. When combined with other aggregation techniques, this can lead to data expansion. Examples include: Using an aggregation operation within a `select` statement with DISTINCT on different dimensions. Each DISTINCT operation expands the data. Using `grouping sets`, `cube`, or `rollup`, which can multiply the size of intermediate data.	Do not perform aggregation operations that cause data expansion.
Improper `JOIN` operations	For example, the left table of a `JOIN` operation contains a large amount of population data, and the right table is a dimension table, which records hundreds of rows of data for each gender. If you perform the `JOIN` operation on the data based on genders, the size of data in the left table may expand to hundreds of times larger than the original size.	To prevent data expansion, aggregate the right table before joining it with the left table.