Configure real-time full-database synchronization

更新时间:
复制 MD 格式

The real-time full-database synchronization feature combines a one-time full migration with continuous incremental capture to synchronize an entire source database, such as MySQL or Oracle, to a destination system with low latency. This task supports full synchronization of historical data from the source database and automatically initializes the schema and data for the destination table. The task then automatically switches to a real-time incremental mode, using technologies like change data capture (CDC) to continuously capture and synchronize subsequent changes. This feature is suitable for scenarios such as building real-time data warehouses and data lakes. This topic uses an example of synchronizing data from a MySQL database to MaxCompute in real time to describe how to configure a synchronization task.

Prerequisites

Usage notes

  • DataWorks supports two types of full-database synchronization: real-time full-database synchronization and full & incremental (near real-time). Both types can perform full synchronization for historical data in a source database and then automatically switch to a real-time incremental mode. However, the two types differ in latency and destination table requirements:

    • Timeliness: The real-time full-database synchronization feature provides a latency of seconds to minutes. The full & incremental (near real-time) feature provides a T+1 timeliness.

    • Destination table (MaxCompute):

      • PK Delta Table: All features of real-time full-database synchronization are supported.

      • Regular table and Append Delta Table: Only the Append mode is supported when you select incremental synchronization mode in a real-time full-database synchronization task.

      • Full & incremental (near real-time): All the preceding table types are supported.

  • Real-time full-database synchronization tasks can be configured in DataStudio (Data Studio) and Data Integration. The two modules are functionally interoperable.

    • Consistent configuration: Whether you create a task in Data Studio or the Data Integration module, the configuration interface, parameter settings, and underlying features are identical.

    • Bidirectional synchronization: Tasks created in the Data Integration module are automatically synchronized and displayed in the data_integration_jobs directory in the Data Studio module. These tasks are categorized by channels in the format of source type-destination type for unified management.

Configure a task

Step 1: Create a synchronization task

  1. Log on to the DataWorks console. In the target region, click Data Integration > Data Integration in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Integration.

  2. In the left-side navigation pane, click Synchronization Task, click Create Synchronization Task at the top of the page, and configure the task information:

    • Source Type: MySQL.

    • Destination Type: MaxCompute.

    • Specific Type: Real-time Full-database.

    • Synchronization Mode:

      • Schema Migration: Automatically creates database objects (such as tables, columns, and data types) that match the source on the destination, without including data.

      • Full Synchronization (optional): Performs a one-time copy of all historical data from the specified objects (such as tables) on the source to the destination. This is typically used for initial data migration or data initialization.

      • Incremental Sync (optional): After full synchronization is complete, continuously captures change data (inserts, updates, and deletes) from the source and synchronizes it to the destination.

Step 2: Configure data sources and running resources

  1. In the Source Data Source section, select the MySQL data source that has been added to the workspace. In the Destination section, select the MaxCompute data source that has been added.

  2. In the Running Resources section, select the Resource Group for the synchronization task, and allocate Resource Group CU to the task.

    Note

    If the task logs display a message such as Please confirm whether there are enough resources..., it indicates that the available compute units (CUs) in the current resource group are insufficient for starting or running the task. You can increase the number of CUs allocated to the task in the Configure Resource Group panel to allocate more compute resources.

    For recommended resource size values, see Recommended CUs for Data Integration. Adjust the values based on your actual requirements.

  3. Ensure that both the source data source and the destination data source pass the Connectivity Check.

Step 3: Configure the synchronization solution

1. Configure the data source

  • In this step, you can select the tables to synchronize from the source data source in the Source Tables section, and click the image icon to move them to the Selected Tables section on the right. If there are many tables, you can use Database Filtering or Table filtering to select the tables to synchronize by configuring regular expressions.

    image

  • To write data from multiple sharded tables (with the same schema) to a single destination table, you can Select Tables by Regex.

    image
    Enter a regular expression in the source table configuration. DataWorks automatically identifies and collects all matching source tables and writes their data to the destination table mapped by the expression.

    Note

    This method is applicable to sharded table merge synchronization scenarios (similar to sharding-based synchronization), improving configuration efficiency and avoiding the need to repeatedly add many-to-one synchronization rules.

2. Configure the data destination

If only Incremental Sync is selected for the real-time full-database synchronization task, you can configure the incremental synchronization mode for writing to the destination table.

  • Replay: Only PK Delta Table is supported. Similar to normal synchronization, only data columns are synchronized.

  • Append: Regular tables and Append Delta Tables are supported. Source table real-time data is written to the destination table with appended metadata such as insert, update, and delete operations. For the format of the append log table, see Append log table format.

3. Destination table mapping

In this step, you must define the mapping rules between source tables and destination tables, and specify the primary key, dynamic partition, DDL/DML configuration, and other rules to determine how data is written.

Operation

Description

Refresh

The system automatically lists the source tables you selected, but the specific properties of the destination tables take effect only after you refresh and confirm them.

  • Select the tables to synchronize in batches and click Batch Refresh Mapping.

  • Destination table name: The destination table name is automatically generated based on the Customize Mapping Rules for Destination Table Names rules. The default format is ${source_database_name}_${table_name}. If a table with the same name does not exist on the destination, the system automatically creates one.

Customize Mapping Rules for Destination Table Names (optional)

The system has a default table name generation rule: ${source_database_name}_${table_name}. You can also click the Edit button in the Customize Mapping Rules for Destination Table Names column to add custom destination table name rules.

  • Rule name: Define a rule name. We recommend that you use a name with a clear business meaning.

  • Destination table name: You can click the image button to select Manually enter and Built-in Variable to concatenate and generate the destination table name. The supported variables include source data source name, source database name, and source table name.

  • Edit built-in variables: Built-in variables support string transformations on top of the original values.

The following scenarios are supported:

  1. Add a prefix or suffix to a name: Add a prefix or suffix to the source table name by setting a constant.

    Rule configuration

    Result

    image

    image

  2. Unified string replacement: Replace the string dev_ in the source table name with prd_.

    Rule configuration

    Result

    image

    image

  3. Write multiple tables to a single table: Set the destination table name to a constant.

    Rule configuration

    Result

    image

    image

Edit column type mapping (optional)

The system has default column type mappings between source and destination. You can click Edit Mapping of Field Data Types in the upper-right corner of the table to customize the column type mapping between source and destination tables, and then click Apply and Refresh Mapping.

When editing column type mappings, make sure the type conversion rules are correct. Incorrect rules may cause type conversion failures that produce dirty data and affect task execution.

Edit destination table schema (optional)

The system automatically creates destination tables that do not exist based on the custom table name mapping rules, or reuses existing tables with the same name.

DataWorks automatically generates the destination table schema based on the source table schema. Manual intervention is not required in most cases. You can also modify the table schema in the following ways:

  • Add columns to a single table: Click the image.png button in the Target Table column to add columns.

  • Add columns in batches: Select all tables to synchronize, and choose Batch Edit > Destination Table Schema - Batch Modify and Add Field at the bottom of the table.

  • Renaming column names is not supported.

For existing tables, you can only add columns. For new tables, you can add columns and partition columns, and set the table type or table properties. For details, see the editable areas on the interface.

Value assignment

Native columns are automatically mapped based on columns with the same name in the source and destination tables. The newly added columns and partition columns from the preceding steps must be manually assigned values. Perform the following operations:

  • Assign values to a single table: Click the Configuration button in the Value assignment column to assign values to destination table columns.

  • Assign values in batches: Choose Batch Edit > Value assignment at the bottom of the list to assign values in batches to the same columns in the destination table.

When assigning values, you can assign constants and variables. Switch the type in Value Type. The following methods are supported:

  • Table columns

    • Manual input: Directly enter a constant value, such as abc.

    • Select a variable: Select a system-supported variable from the drop-down list. You can view the specific meaning of the variable in the image tooltip on the interface.

    • Function: You can use functions to perform simple transformations on the destination columns. For more information, see Function reference.

  • Partition columns: You can dynamically create partitions by using the enumeration values of a source column or the event time as the partition value.

    • Manual input: Directly enter a constant value, such as abc.

    • Source column: Use the values of a source table column as the partition column values. The value type can be a column value or a time value.

      • Column value: The enumeration values from the source column. We recommend that you use a column with a limited number of enumeration values to prevent too many partitions and overly scattered data.

      • Time value: If the values in the source column are timestamps, you can process them based on different formats and specify a Target format to format the partition values.

        • Time string: A string that represents a time, such as "2018-10-23 02:13:56" or "2021/05/18". You can serialize it into a time value by specifying the source and destination time formats. For the preceding examples, you can use the yyyy-MM-dd HH:mm:ss and yyyy/MM/dd formats for serialization.

        • Time object: If the source value is already a time type such as Date or Datetime, select this type directly.

        • Unix timestamp (seconds): A second-level timestamp. This also supports numbers or strings in the 10-digit timestamp format, such as 1610529203 or "1610529203".

        • Unix timestamp (milliseconds): A millisecond-level timestamp. This also supports numbers or strings in the 13-digit timestamp format, such as 1610529203002 or "1610529203002".

    • Select a variable: You can use the source event change time EVENT_TIME as the partition value source. The usage is similar to that of source columns.

    • Function: You can use functions to perform simple transformations on the source column before using the result as a partition value. For more information, see Function reference.

Note

Too many partitions can affect synchronization efficiency. If more than 1,000 new partitions are created per day, partition creation fails and the task is terminated. Therefore, when defining partition column value assignment methods, estimate the number of partitions that may be generated. Use second-level and millisecond-level partition creation methods with caution.

Source Split Key

You can select a column from the source table or select Not Split in the source split key drop-down list. During task execution, data is split into multiple tasks based on this column to enable concurrent, batch data reads.

We recommend that you use the table primary key as the source split key. String, floating-point, date, and other types are not supported.

Currently, the source split key is supported only when the source is MySQL.

Skip Full Synchronization

If you have configured full synchronization in Step 3, you can individually skip full data synchronization for specific tables. This is applicable to scenarios where full data has already been synchronized to the destination through other methods.

Full condition

Apply conditional filtering to the source during the full synchronization phase. You only need to write the WHERE clause here, without the WHERE keyword.

Configure DML Rule

DML message processing is used to perform fine-grained filtering and control on change data captured from the source (Insert, Update, Delete) before writing the data to the destination. This rule takes effect only during the incremental phase.

Others

Table Type: MaxCompute supports regular tables, PK Delta Table, and Append Delta Table. If the destination table status is to be created, you can select the table type when editing the destination table schema. The type cannot be changed for existing tables.

  • The full + incremental mode of real-time full-database synchronization supports only the PK Delta Table table type as the destination.

  • In incremental-only mode, the replay mode supports PK Delta Table, and the append mode supports regular table types and Append Delta Table.

For more information about Delta Table, see Delta Table overview.

Step 4: Advanced configuration

Advanced parameter configuration

To fine-tune the task configuration and achieve custom synchronization requirements, go to the Advanced Parameters tab and modify the advanced parameters.

  1. Click Advanced Configuration in the upper-right corner of the page to go to the advanced parameter configuration page.

  2. Modify the parameter values based on the parameter descriptions. The meaning of each parameter is described after the parameter name.

  3. AI-assisted configuration is also supported. Enter natural language instructions, such as adjusting the task concurrency. The large language model generates recommended parameter values. You can decide whether to accept the AI-generated parameters based on your actual requirements.

    image

Important

Modify parameters only when you fully understand their meanings. Incorrect modifications may cause unexpected issues such as task latency, excessive resource consumption that blocks other tasks, and data loss.

DDL capability configuration

Some real-time synchronization channels can detect metadata changes in the source table schema and notify the destination to synchronize the updates, or take other actions such as alerting, ignoring, or terminating the task.

You can click Configure DDL Capability in the upper-right corner to set the processing policy for each type of change. The supported processing policies vary by channel.

  • Normal processing: The destination processes the DDL change information from the source.

  • Ignore: The change message is ignored, and the destination is not modified.

  • Error: The real-time full-database synchronization task is terminated, and the status is set to Error.

  • Alert: An alert is sent to you when this type of change occurs on the source. You must configure a DDL notification rule in Configure Alert Rule.

Note

After a new column is added on the source and created on the destination through DDL synchronization, the system does not backfill data for the existing data in the destination table.

Step 5: Deploy and run the task

  1. After you complete all configurations, click Save at the bottom of the page to save the task configuration.

  2. Full-database synchronization tasks do not support direct debugging. You must deploy them to Operation Center for execution. Therefore, you must perform the Deploy operation to make any new or edited task take effect.

  3. During deployment, if you select Start immediately after deployment, the task starts simultaneously with the deployment. Otherwise, after deployment, you must go to Data Integration > Synchronization Task and manually start the task in the Operation column of the target task.

  4. Click the Name/ID of the corresponding task in Tasks to view the detailed execution process of the task.

Step 6: Alert configuration

1. Create an alert

In the Data Integration > Synchronization Task list, find the real-time full-database task, and click More > Alerts in the Operation column to configure alert policies for the task.

image

(1) Click Create Rule to configure an alert rule.

You can set Alert Reason to monitor task metrics such as Business delay, Failover, Task status, DDL Notification, and Task Resource Utilization, and set CRITICAL or WARNING alert levels based on specified thresholds.

  • By configuring Configure Advanced Parameters, you can control the time interval between alert messages to prevent sending too many messages at once, which can cause waste and message accumulation.

  • If the alert reason is set to Business delay, Task status, or Task Resource Utilization, you can also enable recovery notifications to notify recipients when the task returns to normal.

(2) Manage alert rules.

For existing alert rules, you can use the alert switch to enable or disable alert rules. You can also send alerts to different people based on the alert level.

2. View alerts

Click More > Configure Alert Rule in the task list to expand the panel and go to the alert events page, where you can view alerts that have occurred.

Manage tasks

Edit a task

  1. On the Data Integration > Synchronization Task page, find the synchronization task you created, click More in the Operation column, and click Edit to modify the task information. The procedure is the same as for task configuration.

  2. For tasks that are not running, you can directly modify the configuration, save it, and then deploy the task to the production environment for it to take effect.

  3. For Running tasks, when you edit and deploy the task without selecting Start immediately after deployment, the original operation button changes to Apply Updates. You must click this button for the changes to take effect in the production environment.

  4. After you click Apply Updates, the system performs three steps: Stop, Deploy, and Restart.

    • If the change involves adding new tables or switching existing tables:

      Selecting a checkpoint is not supported when applying updates. After you confirm, the system performs schema migration and full initialization for the new tables. After full initialization is complete, the new tables begin incremental operations together with the original tables.

    • If other information is modified:

      Selecting a checkpoint is supported when applying updates. After you confirm, the task continues running from the specified checkpoint. If no checkpoint is specified, the task resumes from the checkpoint at which it was last stopped.

    Unmodified tables are not affected. After the update and restart, they continue running from the point at which the task was last stopped.

View a task

After you create a synchronization task, you can view the list of created synchronization tasks and the basic information of each task on the synchronization task page.

image

  • You can click Start or Stop in the Operation column to start or stop a synchronization task. In the More menu, you can edit or View the synchronization task.

  • For started tasks, you can view the basic execution status in Execution Overview, and click the corresponding summary area to view the execution details.

    image

Checkpoint-based resumption

Use cases

Manually resetting the checkpoint when starting or restarting a task is mainly applicable to the following scenarios:

  • Task recovery and data resumption: After a task is interrupted, you may need to manually specify a checkpoint at the interruption time to ensure data resumes from the exact breakpoint.

  • Data troubleshooting and replay: If you find that synchronized data is missing or abnormal, you can roll back the checkpoint to a time before the issue occurred to replay and fix the problematic data.

  • Major task configuration changes: After making significant changes to the task configuration (such as destination table schema or column mapping), we recommend that you reset the checkpoint to start synchronization from a specific time to ensure data accuracy under the new configuration.

Instructions

Click Start, and in the dialog, select Whether to reset the site:

image

  • Do not select reset checkpoint and run directly: The task continues running from the checkpoint at which it was last stopped (the last checkpoint).

  • Reset checkpoint and select a time: The task starts running from the specified checkpoint. Make sure the selected time does not exceed the earliest time supported by the source Binlog.

Important

If you receive a checkpoint error or a message indicating that the checkpoint does not exist when running a synchronization task, try the following solutions:

  • Reset the checkpoint: When starting the real-time synchronization task, reset the checkpoint and select the earliest available checkpoint for the source database.

  • Adjust the log retention period: If the database checkpoint has expired, consider adjusting the log retention period in the database, for example, to 7 days.

  • Data synchronization: If data has been lost, consider performing a full synchronization again, or configuring a batch synchronization task to manually synchronize the missing data.

Task operations and tuning

After the task starts, if you encounter issues such as data consumption latency, stalling, or poor performance, see Troubleshoot and tune real-time synchronization tasks for solutions.

FAQ

For frequently asked questions about real-time full-database synchronization, see FAQ about real-time full-database synchronization.

Appendix: Append log table format

Source columns flattened

Column name

Description

sequence_id

The record ID of the incremental event. The value is unique and incrementing.

operation_type

The operation type (I/D/U).

execute_time

The timestamp of the data.

before_image

Whether this is the pre-change image (Y/N).

after_image

Whether this is the post-change image (Y/N).

src_datasource

The data source from which the data originates.

src_database

The database from which the data originates.

src_table

The table from which the data originates.

Column 1

Actual data column 1.

Column 2

Actual data column 2.

Column 3

Actual data column 3.

Source columns merged into JSON

Column name

Description

sequence_id

The record ID of the incremental event. The value is unique and incrementing.

operation_type

The operation type (I/D/U).
DDL: ALTER, TRUNCATE, RENAME

execute_time

The timestamp of the data.

before_image

Whether this is the pre-change image (Y/N).

after_image

Whether this is the post-change image (Y/N).

src_datasource

The data source from which the data originates.

src_database

The database from which the data originates.

src_table

The table from which the data originates.

ddl_sql

When the operation is a DDL type, the DDL statement is written to this column.

data_columns

Actual data columns merged into JSON.