Example scenarios for configuring incremental synchronization tasks-DataWorks(DataWorks)-阿里云帮助中心

You can use a filter condition in a batch synchronization node to synchronize either full data or incremental data. With a filter condition, Data Integration synchronizes only data that meets the specified criteria. You can also combine scheduling parameters with the filter condition to dynamically filter data based on the node's runtime, enabling incremental synchronization. This topic shows you how to configure a batch synchronization node for incremental synchronization.

Usage notes

Incremental synchronization is not supported for some data sources, such as HBase and OTSStream. To check if incremental synchronization is supported for a specific data source, see the documentation for the corresponding reader plug-in.

The required parameters for incremental synchronization vary by reader plug-in. For details, see the documentation for the specific plug-in and Supported data sources and plug-ins. For example:

Reader plug-in	Required parameter	Supported syntax
MySQL Reader	where Note In wizard mode, this is the filter condition parameter.	Database syntax Note You can use this parameter with scheduling parameters to read data from a specified time range each day.
MongoDB Reader	query Note In wizard mode, this is the Search Condition parameter.	Similar to database syntax Note You can use this parameter with scheduling parameters to read data from a specified time range each day.
OSS Reader	Object	Specify the object path Note You can use this parameter with scheduling parameters to read data from a specified file each day.
...	...	...

Configure incremental synchronization

In a Data Integration batch synchronization node, you can use scheduling parameters to specify data paths and ranges for the source and destination tables. The configuration is the same as for other node types.

At runtime, the system replaces all placeholder parameters configured in the node with the actual values represented by the scheduling parameter expressions, and then performs the data synchronization.

Take MySQL data synchronization as an example:

If you do not configure Data Filtering, all data is synchronized to the destination table by default.
If you configure Data Filtering, only data that meets the filter condition is synchronized to the destination table.

The partition name of the destination MaxCompute table is specified by a scheduling parameter. $bizdate represents the business date. When a scheduled task runs, the partition expression configured for the task is replaced with the business date represented by the scheduling parameter. For detailed configuration instructions on scheduling parameter expressions, see Application scenarios of scheduling parameters in Data Integration. Take a batch synchronization task as an example. You need to configure the bizdate parameter in three places to implement incremental synchronization: In the Data Filtering section of the source, enter STR_TO_DATE('${bizdate}','%Y%m%d') <= gmt_modify_time AND gmt_modify_time < DATE_ADD(STR_TO_DATE('${bizdate}','%Y%m%d'), interval 1 day) to filter data modified on the business date. In the Partition Information section of the destination, enter pt=${bizdate} to write data to the corresponding date partition, and set the Cleanup Rule to Clean up existing data before writing (Insert Overwrite). In the Parameters section of Schedule Settings on the right side, enter bizdate=$bizdate so that the scheduling system automatically replaces ${bizdate} with the actual business date at runtime. When you configure incremental data synchronization:

Incremental synchronization based on time-type columns: You can use scheduling parameters to dynamically replace time-type data. During task scheduling, the scheduling parameters are automatically replaced with specific values based on the business date. For more information about scheduling parameters, see Configure scheduling parameters.
Incremental synchronization based on non-time-type columns: You can use an assignment node to convert the column to the target data type and then pass it to Data Integration for synchronization. For more information about assignment nodes, see Create an assignment node.

Notes

When you configure an incremental synchronization task, note the following:

Safety of clean up existing data before writing (Insert Overwrite): When multiple synchronization tasks write to different partitions of the same MaxCompute table, the Insert Overwrite strategy is safe. This strategy clears only the partition data specified by the current task and does not affect data in other partitions of the table, preventing data conflicts or accidental deletion.
Partition range batch overwrite limitation: DataWorks does not support specifying an hour range (such as hh=00-23) in the partition configuration for batch overwrite. To overwrite data for multiple hours, configure separate tasks for each hour. Partition parameters currently support only a single specific value or the wildcard *.
Wildcard syntax: When the source contains an hour-level partition but the destination only has a day-level partition, enter the wildcard * in the hour partition field to match all hourly data. Enter * directly without quotes (such as "*"). Otherwise, a syntax error occurs.

Timestamp-based high-frequency scheduled incremental synchronization

DataWorks supports timestamp-based scheduled incremental synchronization by combining batch synchronization tasks with periodic scheduling (such as every 5 minutes or every hour). This approach is suitable for T+1 or near-real-time synchronization scenarios from RDS MySQL to destinations such as SelectDB and StarRocks. It implements incremental synchronization through SQL filtering without requiring real-time CDC tasks, which avoids the costs of continuously running tasks. Key configuration points:

In the where condition of the source Data Filtering, use a timestamp column as a variable for filtering. For example: gmt_modify_time >= '$[yyyymmddhhmiss-10/mi]' AND gmt_modify_time < '$[yyyymmddhhmiss]'.
Configure periodic scheduling parameters (such as $[yyyymmddhhmiss]) to dynamically calculate the time range, ensuring that each scheduling run synchronizes only the incremental data within the specified time interval. For detailed configuration of scheduling parameters, see Application scenarios of scheduling parameters in Data Integration.
In the destination column mapping, manually add a constant parameter mapped to the partition column to enable dynamic partition writing.

Incremental configuration for database-level batch synchronization tasks

In addition to single-table synchronization tasks, you can also create a database-level batch synchronization task to implement periodic incremental synchronization. When creating the task, select incremental synchronization and configure the incremental condition in the database-level synchronization task. This enables efficient day-level incremental partition synchronization (for example, filtering by the create_time column).

This approach is suitable for scenarios where you want to centrally manage synchronization for multiple tables but only need incremental processing for specific tables. For the complete configuration process of database-level batch synchronization tasks, see Configure a database-level batch synchronization task.

Examples

Synchronize historical data: If you need to synchronize historical incremental data to the corresponding time partitions of the destination table, you can use the backfill data feature in Operation Center. For more information about the backfill data feature, see Backfill data. In the data synchronization node configuration, select MySQL as the data source and MaxCompute (ODPS) as the destination, and set the table name to a value such as czd. In the data filter condition, use ${bizdate} to control the incremental range (for example, STR_TO_DATE('${bizdate}','%Y%m%d') <= gmt_modify_time). Set the partition information to ds=${bizdate}, and set the cleanup rule to Clean up existing data before writing (Insert Overwrite). In the parameters section of the schedule settings, define bizdate=$bizdate. This scheduling parameter is automatically replaced with the specific date value based on the business date during backfill. When you run the backfill, you can set multiple business date ranges (for example, 2022-05-01 to 2022-05-31 and 2022-04-01 to 2022-04-30), select Immediately Run Backfill Instances Whose Scheduled Time Is Later Than the Current Time, and select Ascending Order of Business Dates for execution.
Incremental data synchronization from ApsaraDB RDS to MaxCompute

FAQ

What do I do if a partition not found error occurs after I use scheduling parameters in the MaxCompute Reader partition filter?

Cause: The configured scheduling parameters are not correctly resolved to actual partition values at runtime, or the resolved values do not match the actual partitions in the source table.

Solution: If the partition value is passed from the outputs parameter of an upstream node, check the parameter configuration in Data Studio and make sure the following conditions are met:

The configured parameter name is consistent with the input parameter name.
The passed parameter value exactly matches the actual partition in MaxCompute.

Does a DataWorks batch synchronization task perform full synchronization or incremental synchronization by default? How do I configure incremental synchronization for a source table without partition columns?

Default behavior

DataWorks Data Integration batch synchronization tasks perform full synchronization by default, which means all data is synchronized each time. Incremental synchronization is enabled only when you configure a Data Filtering condition combined with scheduling parameters.

Handling source tables without partition columns

If the source database table (such as an RDS table) does not have a time or partition column, you cannot directly filter incremental data by using a where condition. We recommend that you add a time column (such as dt or gmt_modify_time) to the source table as the basis for incremental filtering. After the column is ready, configure incremental synchronization logic by referring to the "Configure incremental synchronization" section in this topic.

Key concepts

Parameter definitions

splitPk (split key): Specifies a primary key column. DataWorks splits data into multiple chunks based on the value range of this column to enable multi-threaded concurrent reads.
splitFactor (split factor): Controls the split granularity. A larger value results in finer splits and more read threads.

Performance impact and recommendations

Enabling splitPk and splitFactor increases the load on the source database. To reduce the load on the source, we recommend:

Reduce the concurrency, or set the concurrency of the database-level batch synchronization task to 1.
Make sure that the split column (splitPk) is indexed to improve read efficiency.