Batch database synchronization configuration

更新时间:
复制 MD 格式

The batch database synchronization feature in DataWorks lets you periodically synchronize the schema and data of some or all tables from a source database to a destination. It supports full, incremental, or combined synchronization modes. This topic uses the migration of a MySQL database to MaxCompute as an example to describe how to configure these tasks.

Prerequisites

Applicability

You can configure batch database synchronization tasks in both DataStudio and Data Integration. The features are identical in both modules.

  • Consistent configuration: The configuration interface, parameter settings, and underlying functions are identical whether you create the task in DataStudio or Data Integration.

  • Bidirectional synchronization: Data Integration automatically synchronizes tasks to and displays them in the data_integration_jobs directory in DataStudio. These tasks are categorized by the SourceType-DestinationType channel for unified management.

Configure a synchronization task

Step 1: Create a synchronization task

  1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Integration > Data Integration. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Integration.

  2. In the left-side navigation pane, click Synchronization Task, and then click Create Synchronization Task at the top of the page. Configure the task information:

    • Data Source Type: MySQL.

    • Data Source Type: MaxCompute.

    • Specific Type: batch database synchronization.

    • Synchronization Mode: The Synchronization steps setting works with the subsequent full and incremental control setting. You can combine them to create different synchronization solutions. For details, see Full and incremental control.

      • Schema Migration: Automatically creates database objects like tables, fields, and data types in the destination to match the source schema, but does not migrate data.

      • Full Synchronization (optional): Copies all historical data from specified source objects, such as tables, to the destination in a single operation. This is typically used for initial data migration or data initialization.

      • Incremental Sync (optional): After a full synchronization is complete, this mode continuously captures new data from the source based on the incremental condition and synchronizes it to the destination.

Step 2: Configure data sources and runtime resources

  1. In the Source Data Source section, select the MySQL data source in your workspace. In the Destination section, select the added MaxCompute data source.

  2. In the Running Resources section, select the Resource Group for the synchronization task and allocate Resource Group CUs. If your synchronization task fails due to an out-of-memory (OOM) error, increase the allocated Resource Group CUs.

  3. Ensure that both the source and destination data sources pass the Connectivity Check.

Step 3: Configure the synchronization solution

1. Data source

In this step, select the tables to synchronize from the source data source in the Source tables section. Click the image icon to move them to the Selected tables list on the right.

image

If there are many databases or tables, you can use Database Filtering or Table filtering and configure regular expressions to select the tables you want to synchronize.

2. Destination

Click the Configuration button next to Partition Initialization Configuration to uniformly initialize partitions for new destination tables. Changes made here apply to all new destination tables and do not affect existing tables.

3. Full and incremental control

  1. Configure the task execution frequency.

    • If you select "Full synchronization" or "Incremental synchronization", you can choose to run the task as a One-time task or a Recurring task.

    • If you select both "Full synchronization" and "Incremental synchronization", the system uses a built-in mode: "One-time full synchronization for the first run, followed by recurring incremental synchronization". This option cannot be changed.

      Synchronization steps

      Full and incremental control

      Behavior

      Use case

      Full synchronization

      One-time

      After the task starts, it synchronizes all data from the source table once to the destination table or a specified partition.

      Data initialization, system migration

      Recurring

      Based on the configured scheduling cycle, it periodically synchronizes all data from the source table to the destination table or a specified partition.

      Data reconciliation, T+1 full snapshots

      Incremental synchronization

      One-time

      After the task starts, it synchronizes incremental data once to a specified partition based on the incremental condition you define.

      Manual repair of a specific data batch

      Recurring

      After the task starts, it periodically synchronizes incremental data to a specified partition based on the configured scheduling cycle and incremental condition.

      Daily ETL, building slowly changing dimension tables

      Full synchronization & Incremental synchronization

      (Built-in mode, not selectable)

      First run: Automatically performs a one-time schema migration and full synchronization of historical data.

      Subsequent runs: Periodically synchronizes incremental data to a specified partition based on the configured scheduling cycle and incremental condition.

      One-click data ingestion into a data warehouse or data lake

      Note
      • For batch database synchronization, the system generates instances for recurring tasks in the same way as for tasks set to Start immediately after deployment. For more information, see Instance generation method: Start immediately after deployment.

      • You can define the partition generation method in the subsequent Value assignment step. You can use constants or dynamically generate partitions using system-predefined variables and scheduling parameters.

      • The configurations for the scheduling cycle, incremental condition, and partition generation method are interdependent. For more information, see 6. Incremental condition.

  2. Configure scheduling parameters.

    If your task involves recurring synchronization, click Scheduling Parameters for Periodical Scheduling to configure them. These parameters can be used later when configuring the incremental condition and field value assignment in the destination table mapping.

4. Destination table mapping

In this step, define the mapping rules between source and destination tables, and specify the data writing method by defining the recurring configuration and incremental condition.

image

Actions

Description

Refresh mapping

The system automatically lists the source tables you selected, but you must refresh the mapping to apply the destination table properties.

  • Select the tables to be synchronized in bulk and click Batch Refresh Mapping.

  • Destination table name: The destination table name is automatically generated based on the Customize Mapping Rules for Destination Table Names rules. The default is ${SourceDatabaseName}_${TableName}. If a table with this name does not exist at the destination, the system automatically creates it.

Edit field type mapping (optional)

The system has a default mapping between source and destination field types. You can click Edit Mapping of Field Data Types in the upper-right corner to customize the mapping. After configuration, click Apply and Refresh Mapping.

When editing field type mappings, ensure that the conversion rules are correct. Otherwise, type conversion may fail, resulting in dirty data and affecting task execution.

Customize destination table name mapping (optional)

The system has a default table name generation rule: ${SourceDatabaseName}_${TableName}. You can also click the Edit button in the Customize Mapping Rules for Destination Table Names column to add a custom rule.

  • Rule Name: Define a name for the rule. We recommend using a name with clear business meaning.

  • Destination Table Name: You can construct the destination table name by clicking the image button and combining Manually enter and Built-in Variable. Supported variables include the source data source name, source database name, and source table name.

  • Edit Built-in Variables: You can apply string transformations to the built-in variables.

This feature supports the following scenarios:

  1. Add prefixes or suffixes to names: Use constants to add a prefix or suffix to the source table name.

    Rule configuration

    Result

    image

    image

  2. Global string replacement: Replace the string dev_ in source table names with prd_.

    Rule configuration

    Result

    image

    image

  3. Write data from multiple tables into a single table.

    Rule configuration

    Result

    image

    image

Customize destination database name mapping (optional)

Some destination data sources, such as Hologres, support defining mapping rules for destination databases. The configuration method is the same as for Customize destination table name mapping.

Customize destination schema name mapping (optional)

Some destination data sources, such as Hologres, support defining mapping rules for destination schemas. The configuration method is the same as for Customize destination table name mapping.

Edit destination table schema (optional)

The system automatically generates the destination table schema based on the source table schema. You do not usually need to modify it. For special requirements, you can customize it as follows:

  • Add a field to a single table: Click the image.png button in the Target Table column to add a field.

  • Add fields in bulk: Select all tables to be synchronized, then at the bottom of the table, select Batch Edit > Batch Modify Table Schema.

  • Renaming columns is not supported.

Value assignment for destination table fields

You must manually assign values to partition fields and any newly added fields. Standard fields are automatically mapped based on matching names in the source and destination tables. Follow these steps:

  • Assign values for a single table: Click the Configuration button in the Value assignment column to assign values to the destination table fields.

  • Assign values in bulk: At the bottom of the list, select Batch Edit > Value assignment to assign values to identical fields across multiple destination tables.

You can assign constants or variables. In the Value Type for Table fields and Partition fields, you can switch between the following types:

  • Manual assignment: Enter a constant value or a scheduling parameter directly, such as abc, or use the ${xxx} format to enter a scheduling parameter name, such as ${bizdate}.

  • Select variable: Select a system-supported variable from the drop-down list. You can view the specific meaning of each variable in the image tooltip on the interface.

Both variables and scheduling parameters are automatically replaced with date-specific values when the task is scheduled to run.

Set source sharding column

In the source sharding column, you can select a field from the source table or choose Disable. When the synchronization task runs, DataWorks splits the task into multiple subtasks based on this field to enable concurrent, batched data reading.

It is recommended to use the table's primary key as the source sharding column. String, floating-point, and date types are not supported.

Currently, the source sharding column is supported only for MySQL sources.

Customize advanced parameters

Modify these parameters only if you fully understand them. Incorrect settings can cause task delays, resource contention, or data loss. You can set individual writer-side configurations and runtime configurations for subtasks.

Table type

Table Type: MaxCompute supports standard tables, PK Delta Tables, and Append Delta Tables. If the destination table status is "To be created", you can select the table type when editing the destination table schema. The type of an existing table cannot be changed.

For more information about Delta Tables, see Delta Table.

5. Recurring configuration

If incremental synchronization is configured as Recurring, you need to complete the recurring configuration for the destination table. This includes settings like Scheduling Frequency, Data Timestamp, and Resource Group for Scheduling. The scheduling configuration for this synchronization task is consistent with the node scheduling configuration in DataStudio. For parameter details, see Configure node scheduling.

Note

If a one-time synchronization involves a large number of tables, we recommend staggering their execution times in the schedule to prevent task buildup and resource contention.

6. Incremental condition

If the task needs to synchronize incremental data, you must configure an incremental condition. This condition determines which data each scheduled instance synchronizes.

  1. Function and syntax

    • Function: The incremental condition is essentially a WHERE clause that filters the source data.

    • Syntax: When configuring, you only need to enter the conditional expression that follows the WHERE keyword. Do not include the WHERE keyword itself.

  2. Use scheduling parameters for incremental synchronization

    To achieve periodic incremental synchronization, you can use scheduling parameters in the incremental condition. For example, configuring the condition as STR_TO_DATE('${bizdate}', '%Y%m%d') <= columnName AND columnName < DATE_ADD(STR_TO_DATE('${bizdate}', '%Y%m%d'), INTERVAL 1 DAY) synchronizes data generated on the previous day.

  3. Write to a specific partition

    By combining the incremental condition with the destination table's partition field, you can ensure that each batch of incremental data is written to the correct partition.

    For example, using the incremental condition from the previous step, you can set the partition field to ds=${bizdate} and configure the destination table to be partitioned by day. This way, each daily instance synchronizes only the data from the corresponding date in the source and writes it to the matching partition in the destination table.

Important
By properly combining the time range specified by the incremental condition, the time interval for partition generation, and the scheduling cycle from the recurring configuration, you can create an automated T+n incremental ETL pipeline where business rules and physical partitions are strictly aligned.

Step 4: Configure advanced settings

Advanced parameters

If you need to fine-tune the task to meet custom synchronization requirements, you can modify the advanced parameters.

  1. Click Advanced configuration in the upper-right corner to go to the advanced parameter configuration page.

  2. Modify the parameter values based on the tooltips. The meaning of each parameter is explained next to its name.

  3. You can also use AI-assisted configuration by entering natural language instructions, such as adjusting task concurrency. The AI model generates recommended parameter values, which you can then review and accept.

    image

Important

Modify these parameters only if you fully understand them. Incorrect settings can cause task delays, resource contention, or data loss.

Engine parameters

You do not normally need to configure engine parameters. If changes are required, use them only with guidance from technical support.

Step 5: Deploy and run the task

  1. After completing all configurations, click Save at the bottom of the page to finish configuring the task.

  2. You cannot debug batch synchronization tasks directly. You must Deploy them to Operation Center to run them. Therefore, any new or edited tasks must be deployed for the changes to take effect.

  3. During deployment, if you select Start immediately after deployment, the task starts automatically. Otherwise, after deployment is complete, go to the Data Integration > Synchronization Task page and manually start the task from the Actions column.

  4. Click the Name/ID of the task in the Tasks to view the detailed execution process.

Step 6: Configure alert rules

You must configure alert rules for batch database synchronization tasks on their corresponding subtasks in Operation Center.

  1. On the Data Integration > Synchronization Task page, get the Task ID of the target task.

  2. In the Operation Center > Auto Triggered Task column, find the corresponding full-database offline subtask based on the ID of the integration task (for example, if the integration task ID is 34862, the incremental synchronization auto triggered subtask is named offline_odps_cyc_sync_mysql_test_timetest_to_mysql_test_timetest_34862), and click More > Add Alert Rule to go to the Rule Management page.

  3. Click Create Custom Rule and set the Rule Object, Trigger Method, and Alert Details. For more information, see Rule Management.

    You can search for the subtask ID you obtained in the Rule Object to find the target task and set an alert for it.

Manage synchronization tasks

Edit a task

  1. On the Data Integration > Synchronization Task page, find the created synchronization task. In the Operation column, click More, then click Edit to modify the task information. The steps are the same as for task configuration.

  2. For a task that is not running, you can directly modify the configuration, save it, and deploy it for the changes to take effect.

  3. For a task that is running, if you edit and deploy it without selecting Start immediately after deployment, the original action button changes to Apply Updates. You must click this button for the changes to take effect in the production environment.

  4. After you click Apply Updates, the system performs three steps on the changes: Stop, Deploy, and Restart.

    • If you add a table:

      After you click Apply Updates, a synchronization subtask for the new table is added. The task's schema migration and one-time full synchronization start immediately, followed by incremental synchronization according to the schedule.

    • If you switch the destination table (which is equivalent to deleting the old table and adding a new one):

      After you click Apply Updates, the old table's subtask is deleted, a new subtask is created, and the new task's schema migration and one-time full synchronization start immediately. The new task then performs incremental synchronization according to the schedule.

    • If other information is modified:

      The table's schema migration and one-time full synchronization remain unchanged. Newly generated instances for incremental synchronization will use the updated configuration, while already generated instances are not affected.

    Unmodified tables are not affected and will not be re-run.

View tasks

After creating a synchronization task, you can view the list of created tasks and their basic information on the synchronization task page.

image

  • In the Actions column, you can Start or Stop the synchronization task. In the More menu, you can perform operations such as Edit and View.

  • For a running task, you can see the basic status in the Execution Overview. You can also click the corresponding overview area to view execution details.

    image

Quotas and limitations

  • Debugging limitation: You cannot debug batch synchronization tasks directly in Data Integration or DataStudio. You must deploy them to Operation Center to run them.

  • Sharding column limitation: The Set source sharding column feature currently supports only MySQL sources. The sharding column field must be of a numeric type; string, floating-point, or date types are not supported.

  • Schema change limitation: When you edit the destination table schema, renaming columns is not supported. For an existing destination table, you cannot change its Table type.

Next steps

After the task starts, you can click the task name to view its run details and perform operations, maintenance (O&M), and tuning.

FAQ

For frequently asked questions about batch database synchronization tasks, see FAQ about full and incremental synchronization tasks.

More use cases

See Batch database synchronization use cases.