How to tune data import for different scenarios-AnalyticDB(AnalyticDB)-阿里云帮助中心

AnalyticDB for MySQL provides various data import methods for different scenarios. However, data import performance can be affected by factors such as improper table modeling that leads to data skew, or inefficient resource use caused by suboptimal import configurations. This topic describes how to tune data import performance in different scenarios.

Optimize external table import

Check the distribution key

The distribution key determines how data is distributed among shards for concurrent import. If data is not evenly distributed, shards that receive more data can become bottlenecks, causing data skew and slowing the entire import job. To avoid this, ensure your data is distributed evenly during import. For more information about how to select a distribution key, see Choose a distribution key.

Check if the distribution key is appropriate:

Before you import data, evaluate your chosen distribution key based on its business logic. For example, in the Lineitem table, if you choose the l_discount column as the distribution key, its low cardinality (only 11 distinct values) will cause severe data skew. All data with the same discount value will be sent to the same shard, creating a bottleneck and degrading performance. A better choice is the l_orderkey column, as order IDs are unique and distribute data more evenly.
After you import data, if the Data Modeling Diagnostics report indicates distribution field skew, your distribution key is causing uneven data distribution. For more information about how to view these diagnostics, see Storage diagnostics.

Check the partition key

When you use INSERT OVERWRITE SELECT, the imported partition overwrites any existing partition with the same name in the target table. Within each shard, data is then imported into its respective partition based on the partition key. To avoid triggering an external sorting process that degrades performance, do not import too many partitions at once. For more information about how to select a partition key, see Choose a partition key.

Check if the partition key is appropriate:

Before you import data, evaluate your partition key based on business needs and data distribution. For example, if you partition the Lineitem table by the l_shipdate column over a 7-year period, partitioning by day would create over 2,000 partitions, which is inefficient. Partitioning by month or year is a more suitable approach.
After you import data, if the Data Modeling Diagnostics report shows an unreasonable number of partitions, your partition key is not suitable. For more information about how to view these diagnostics, see Partitioned table diagnostics.

Check the index

By default, AnalyticDB for MySQL indexes all columns when a table is created. Building a full-column index on a wide table is resource-intensive. When you import data into a wide table, use a primary key index. A primary key index is used for deduplication, but using too many columns in the primary key can slow down this process. For more information about how to select a primary key, see Choose a primary key.

Check if the index is appropriate:

In offline import scenarios, a primary key index is typically unnecessary because an offline computing job has already deduplicated the data.
On the Monitoring Information > Table Information Statistics tab, view the size of table data, index data, and primary key index data. If the index data size exceeds the table data size, check for columns with long string values. Indexing these columns is time-consuming and consumes significant storage. You can delete these indexes. For more information, see ALTER TABLE.

Note
A primary key index cannot be deleted. You must rebuild the table.

Add a hint to accelerate the import

You can add the direct_batch_load=true hint to your import statement to accelerate the import.

Note

This hint is supported only for elastic Data Warehouse Edition clusters of V3.1.5 or later. If you do not see a significant performance improvement, Submit a ticket.

Example:

SUBMIT JOB /*+ direct_batch_load=true*/INSERT OVERWRITE adb_table
SELECT * FROM adb_external_table;

Accelerate imports with elastic import

Note

The elastic import feature is available only for clusters with a kernel version of 3.1.10.0 or later.
This feature is supported on Enterprise Edition, Basic Edition, and Data Lakehouse Edition clusters that have a job resource group.
Elastic import supports only data from MaxCompute and CSV, Parquet, or ORC data from Object Storage Service (OSS).
When using elastic import, ensure the job resource group has sufficient resources to prevent long queue times, slow execution, and job failures.

The elastic import feature allows you to run multiple import jobs concurrently. You can also accelerate a single import job by allocating more resources to it. For more information, see Data import methods.

Example:

/*+ elastic_load=true, elastic_load_configs=[adb.load.resource.group.name=resource_group]*/
submit job insert overwrite adb_table select * from adb_external_table;

For information about the parameters, see Hint parameters.

Optimize DataWorks data import

Optimize job configurations

Optimize Data Records Per Write

This parameter specifies the batch size for each import. In most cases, we recommend keeping the default value of 2048.

However, if a single record is large, such as 512 KB, consider setting this parameter to 16. This keeps the batch size under 8 MB and prevents high memory usage on frontend nodes.
Optimize channel control
- Data synchronization performance is directly proportional to the value of the Expected Maximum Concurrency parameter. We recommend that you increase the Expected Maximum Concurrency as much as possible.
  
  Important
  A higher value for Expected Maximum Concurrency consumes more DataWorks resources. Choose a value that fits your needs.
- For better synchronization performance, enable Distributed Execution.

Common issues and solutions

If the client sends data too slowly, the cluster's CPU utilization, disk I/O, and write response time will also be low. Although the database server can process the data sent from the client, the low inbound volume results in a write TPS that is lower than expected.

Solution: Increase the values of Data Records Per Write and Expected Maximum Concurrency. The data import performance improves linearly as the import pressure increases.
If data skew exists in the target table, some cluster nodes become overloaded, degrading import performance. In this case, the CPU and disk I/O utilization are low, but the write response time is high. You can identify skewed tables on the Diagnostics and Optimization > Data Modeling Diagnostics page.

Solution: Redesign the table schema and then re-import the data. For more information, see table schema design.

Optimize JDBC data import

Client-side optimization

Batching data in your application
- When importing data using a JDBC program, use batch imports to reduce network and connection overhead. Avoid single-row imports unless you have specific requirements.
- A batch size of 2048 records is recommended. If a single record is large, keep the total batch size under 8 MB. You can calculate the number of records per batch by dividing 8 MB by the size of a single record. An oversized batch can consume too much memory on the frontend node and degrade import performance.
Configure application concurrency
- When importing data from an application, use multiple concurrent threads. A single thread cannot fully utilize client resources and often cannot keep up with the database's import speed due to client-side data processing and batching. Concurrent imports can significantly increase speed.
- The optimal level of concurrency depends on factors like batching, data source, and client machine load. There is no single best value. Test to find the optimal concurrency for your workload. If the import speed is below expectations, try doubling the concurrency. If the speed then decreases, gradually reduce it to find the optimal setting.

Common issues and solutions

If you experience poor performance when importing data into AnalyticDB for MySQL from a custom program, first check for client-side performance bottlenecks.

Ensure your data source can provide data at a sufficient rate. If the data comes from other systems or files, check the client for any output bottlenecks.
Ensure the data processing speed, verify that data production and consumption are synchronized, and ensure that sufficient data is waiting to be imported into AnalyticDB for MySQL.
Check the client machine load to ensure sufficient system resources, such as CPU and disk I/O.