This topic describes the causes of data expansion and measures that can be taken to handle data expansion issues.
Problem description
In Logview, you may notice that the output data volume of a Fuxi task is significantly larger than its input data volume. You can check the input and output volumes by using the I/O Record and I/O Bytes attributes of the Fuxi task.
For example, a 1 GB input can expand to 1 TB after processing. Processing 1 TB of data on a single instance significantly reduces performance.
Causes and measures
The following table describes the possible causes of this issue and related measures that can be taken.
| Cause | Description | Measure |
| Bug in code | The code is defective. Examples:
| Fix bugs in the code. |
| Improper aggregation operations |
Typically, these operations produce small amounts of intermediate data and have low computational complexity, allowing them to complete quickly even on large datasets. Therefore, standard aggregation operations rarely cause issues. However, certain aggregation operations, such as
collect_list and median, must retain all intermediate data. When combined with other aggregation techniques, this can lead to data expansion. Examples include:
|
Do not perform aggregation operations that cause data expansion. |
Improper JOIN operations |
For example, the left table of a JOIN operation contains a large amount of population data, and the right table is a dimension table, which records hundreds of rows of data for each gender. If you perform the JOIN operation on the data based on genders, the size of data in the left table may expand to hundreds of times larger than the original size. |
To prevent data expansion, aggregate the right table before joining it with the left table. |
该文章对您有帮助吗?