What is the relationship between concurrency and throttling in batch synchronization-DataWorks(DataWorks)-阿里云帮助中心

This topic explains how to configure Channel Control parameters for batch synchronization tasks to minimize configuration errors and support requests. It describes the relationship between concurrency and throttling in batch synchronization.

Key concepts and best practices

Concurrency

This section answers the following questions:

Question 1: How do I configure concurrency for a data synchronization task?
Question 2: Why is my data synchronization task running slowly with insufficient actual concurrency?
Question 3: Why is my synchronization task still slow even with high concurrency configured, and why does my exclusive resource group frequently wait for resources?

Concurrency refers to the maximum number of threads that can read data from a source and write data to a destination in parallel within a data synchronization task. To improve the efficiency of data synchronization, you can adjust the task concurrency to reduce the time required for data migration. You can configure this setting in the Channel Control section of the data synchronization task configuration page by using the Desired Maximum Task Parallelism drop-down list. The value that you set is the configured concurrency for the task. For file-based sources (OSS, FTP, HDFS, and S3), data is read concurrently on a per-file basis, and the actual concurrency is limited if the number of files to read is less than the configured concurrency. The configured value is the desired maximum task concurrency. However, due to factors such as the limits of the Data Integration resource group or the characteristics of the task itself, the actual concurrency during task execution may be less than or equal to the configured value. In billing scenarios (such as when you use a Data Integration debugging resource group), you are charged based on the actual concurrency of the task. Data Integration attempts to ensure that the execution concurrency matches the configured concurrency. Common scenarios where the actual execution concurrency is less than the configured concurrency include:

When reading from a relational database such as MySQL, PolarDB, SQL Server, PostgreSQL, or Oracle, the task cannot split and read table data in parallel without a valid, configured split key (splitPk). The split key must be of an integer type. For Oracle, time-type columns are also supported in addition to integer types.
For PolarDB-X (DRDS), data is split into slices for reading based on the physical topology of logical tables. The actual concurrency is limited by the number of physical table shards if it is less than the configured concurrency.
For file-based sources (OSS, FTP, HDFS, and S3), data is read concurrently on a per-file basis. The actual concurrency is limited by the number of files to be read if it is less than the configured concurrency.
If the data distribution at the source is highly uneven, some data slices may take longer to process. In the later stages of the task run, after other slices have completed, the actual concurrency will decrease below the configured concurrency.

Best practices for configuring concurrency:

The higher the concurrency, the more resources the task needs to acquire. Data Integration allocates resources on a first-in, first-out (FIFO) basis, meaning tasks submitted earlier are allocated resources first. Configure a reasonable concurrency to avoid long-running, high-concurrency tasks that block subsequent tasks from acquiring resources.
For tables with small data volumes, configure a low concurrency. This requires fewer resources, allowing the task to quickly acquire fragmented resources and start running. Because the data volume is small, the execution time can be kept within a reasonable range.
For synchronization tasks on the same data source, stagger the execution times. This helps balance the resource utilization of the resource group and reduce the concurrent access pressure on the data source.

Synchronization speed

This section answers the following questions:

Question 1: How do I configure the data synchronization speed? How do I understand throttling and non-throttling?
Question 2: Why does throttling sometimes not take effect for data synchronization?
Question 3: Why is the actual data synchronization speed sometimes significantly lower than the throttling threshold?

Synchronization speed: The data synchronization speed and the desired maximum task concurrency are closely related parameters. Together, they protect the source and destination from excessive read/write pressure, preventing data synchronization tasks from causing significant load on the data source and affecting its stability.

Synchronization speed (non-throttling) means the task runs at the configured desired maximum task concurrency (assume the actual running concurrency is ActualConcurrent), and each concurrent slice runs without speed limits (assume the actual speed of each slice is Speed). The overall actual speed of the task is ActualConcurrent × Speed. In non-throttling mode, Data Integration provides the maximum transfer performance possible under the current task configuration (concurrency and memory) and hardware environment (data source specifications and network). To use this mode, select Non-throttling for Synchronization Speed in the Channel Control section.

Synchronization speed (throttling) means the task runs with an overall speed limit and the configured maximum concurrency. When Data Integration creates the execution plan, the speed of each concurrent slice is calculated as (job speed / job concurrency, rounded up). The minimum speed per slice is 1 MB/s. Therefore, the upper limit of the actual task speed is the actual concurrency × the actual speed limit per slice. To use this mode, select Throttling and enter a speed value (in MB/s) in the field on the right. The following common examples illustrate throttling scenarios:

If you configure a concurrency of 5 and a speed limit of 5 MB/s, the task attempts to split into 5 slices for concurrent execution, with each slice limited to 1 MB/s.
- If the actual concurrency is 5, the maximum overall speed is 5 MB/s, which is less than or equal to the task speed limit.
- The actual execution concurrency depends on the specific characteristics of the data source. The actual concurrency may be less than the configured 5 (see the section on desired maximum task concurrency). If the actual running concurrency is 1, the maximum overall speed is 1 MB/s, which is less than or equal to the task speed limit.
If you configure a concurrency of 5 and a speed limit of 3 MB/s, the task attempts to split into 5 slices for concurrent execution, with each slice speed calculated as 3 / 5, rounded up to 1 MB/s.
- If the actual execution concurrency is 5, the maximum overall speed is 5 MB/s, which exceeds the task speed limit.
- If the actual execution concurrency is 1, the maximum overall speed is 1 MB/s, which is less than or equal to the task speed limit.
If you configure a concurrency of 5 and a speed limit of 10 MB/s, the task attempts to split into 5 slices for concurrent execution, with each slice speed calculated as 10 / 5 = 2 MB/s.
- If the actual execution concurrency is 5, the maximum overall speed is 10 MB/s, which is less than or equal to the task speed limit.
- If the actual execution concurrency is 1, the maximum overall speed is 2 MB/s, which is less than or equal to the task speed limit.

Distributed processing

This section answers the following questions:

Question 1: In what scenarios do I need to configure distributed mode for synchronization jobs?
Question 2: Why is the synchronization job still slow when running in distributed mode?

Without distributed mode, the configured concurrency runs only on a single machine as process-level parallelism and cannot leverage multi-machine computing. Distributed execution mode distributes your task slices across multiple execution nodes for concurrent processing. This allows the synchronization speed to scale horizontally with the cluster size, breaking through single-machine bottlenecks. If you have high synchronization performance requirements, use distributed mode. Additionally, distributed mode can utilize fragmented resources across machines, which improves resource utilization.

Limits and best practices:

In distributed execution mode, configuring a high concurrency may create significant access pressure on your data storage. Evaluate the access load of your data storage before using this mode.
If your exclusive resource group has only one machine, distributed execution mode is not recommended because the execution processes are still distributed on a single worker node, which does not maximize the benefits of multi-machine distributed processing.
For synchronization tasks with small data volumes, distributed mode is not recommended. Configure a single-machine task with low concurrency instead.
Distributed mode can be enabled only when the concurrency is 8 or higher.

Dirty data limits

This section answers the following questions:

Question 1: What is dirty data in data synchronization?
Question 2: How do I configure dirty data limits for data synchronization tasks?
Question 3: What is the relationship between data synchronization speed and dirty data?

Dirty data limits control the behavior of a task when dirty data is encountered. Dirty data refers to data records that encounter an exception during the write process to the destination data source. Due to the complexity and differences in data processing across heterogeneous systems, the current policy is that all data that fails to be written is classified as dirty data. In some data synchronization scenarios, dirty data can degrade synchronization efficiency. For example, when writing to a relational database, the default mode is batch write. When dirty data is encountered, the process falls back to single-record write mode (to identify which record in the batch is dirty and ensure normal records are written successfully). However, single-record write is much slower than batch write, so encountering a large amount of dirty data can significantly slow down the overall task performance.

Currently, most Data Integration channels support dirty data threshold limits. For channels that support this feature, common configuration scenarios are described as follows:

No dirty data limit configured: All dirty data is tolerated, dirty data does not cause the task to fail, and the errorLimit field in the task configuration is left empty.
Dirty data limit set to 0: No dirty data is tolerated. The task fails when more than 1 dirty data record is encountered.
Dirty data limit set to a positive integer N: A maximum of N dirty data records are tolerated. The task fails when the number of dirty data records exceeds N. Enter the dirty data threshold in the Error Count Exceeds field in the Channel Control section. Leave this field empty to indicate no limit.

Best practices:

For scenarios that are sensitive to data quality, such as relational databases (MySQL, SQL Server, PostgreSQL, Oracle, PolarDB, and PolarDB-X), Hologres, ClickHouse, and AnalyticDB for MySQL, set the dirty data limit to 0 to identify data quality risks in a timely manner.
For scenarios that are not sensitive to data quality, do not configure a dirty data limit, or configure a reasonable dirty data threshold based on your business requirements. This helps reduce the operational burden of handling dirty data on a daily basis.
Configure task failure and latency alerts for critical tasks to detect issues in a timely manner.
For tasks that can be rerun, configure automatic rerun on failure to reduce the impact of occasional environment issues.

Data source connection quota limits

This section answers the following questions:

Question 1: What is the data source connection quota limit, and how do I configure a reasonable connection limit?
Question 2: Why are batch full synchronization tasks in the data synchronization solution running slowly and remaining in the Submit state for a long time?

The data source connection limit feature refers to:

Destination write concurrency: The maximum number of threads that write data to the destination in a real-time synchronization task. Configure this value based on the resource group size and the actual scale of the destination. The configurable upper limit is 32, and the default value is 3.
Maximum source read connections: During the batch full data initialization phase of a data synchronization solution, JDBC connections are established to the database to read all historical data. This setting controls the maximum number of JDBC connections to the source, preventing a large number of tasks from starting simultaneously and exhausting the database connection pool, which could affect database stability. Configure this value based on the actual capacity of your database resources. The default value is 15. If you find that tasks remain in the Submit state for a long time, this is usually caused by the maximum data source connection limit (try staggering task execution or increasing the maximum connection limit).

You can configure data source connection quota limits in the data synchronization solution as follows: In the Runtime Resource Settings step, you can set Destination Write Concurrency in the real-time synchronization section (default: 3, configurable upper limit: 32). In the 6.2 Batch Full Synchronization section, you can set Maximum Source Read Connections (default: 15) to control the number of JDBC connections to the database during the full data initialization phase.