Data Integration provides a wizard-based guide to periodically synchronize full or incremental data from a single source table or sharded source tables to a destination data table. You can configure a task without writing any code by selecting a data source and a destination in the UI and using DataWorks scheduling parameters. This topic describes the general configurations for a single-table offline task in wizard mode. The configurations for different data sources may vary. For more information, see Supported data sources and synchronization solutions.
Prerequisites
-
Data sources are configured. Before you create a Data Integration task, make sure that the required source and destination data sources are configured in Data Source Management in the DataWorks console. For more information, see Data source list.
Note-
For information about the data sources that support single-table batch synchronization and their configurations, see Supported data sources and synchronization solutions.
-
For an overview of data source features, see Data Source Management.
-
-
You have purchased a resource group with appropriate specifications and associated it with the workspace. For more information, see Use a Serverless resource group.
-
Network connectivity is established between the resource group and the data sources. For more information, see Configure network connectivity.
Step 1: Create a Data Integration node
Data Studio (new version)
-
Log on to the DataWorks console and switch to the target region. In the navigation pane on the left, choose . Select the desired workspace from the drop-down list and click <p><a href={url} target="_blank">Learn more.</a></p> Data Studio.
-
Create a workflow. For more information, see Workflows.
-
Create a Data Integration node. You can create a Data Integration node in one of the following ways:
-
Method 1: In the upper-right corner of the workflow list, click the
icon and choose . -
Method 2: Double-click the workflow name and drag the Data Integration node from the Data Integration directory to the workflow canvas on the right.
-
-
Configure the source and destination types, select Single Table Batch Sync as the specific type, and then click OK to create the node.
Data Development (legacy)
-
Log on to the DataWorks console and switch to the target region. In the navigation pane on the left, choose . Select the desired workspace from the drop-down list and click Data Analytics.
-
Create a workflow. For more information, see Create a workflow.
-
Create a batch synchronization node. You can create a batch synchronization node in one of the following ways:
-
Method 1: Expand the workflow, right-click .
-
Method 2: Double-click the workflow name and drag the Batch Synchronization node from the Data Integration directory to the workflow canvas on the right.
-
-
Follow the on-screen instructions to create the batch synchronization node.
Step 2: Configure data sources and runtime resources
In this example, the Source data source type is set to MySQL and the specific data source is mysql. The Destination data source type is set to MaxCompute, and the data source is own_mc. The Resource Group is set to dwGroup, and the CU is set to 0.5 CU.
-
In the Source Information and Destination sections, select the data source objects that you want to read from and write to.
-
In the Runtime Resource section, select the Resource Group for the synchronization task, and allocate Resource Group CU to the task. If your synchronization task encounters an out-of-memory (OOM) error due to insufficient resources, adjust the CU allocation for the resource group. For the recommended resource quota configurations, see Resource Group Performance Metrics - Data Integration.
-
Make sure that both the source and destination data sources pass the Connectivity Check. If a data source fails the connectivity check, follow the on-screen instructions or see Configure network connectivity to establish a network connection.
-
If a resource group that you created is not displayed, make sure that the resource group is associated with the workspace. For more information, see Use a Serverless resource group.
Step 3: Configure the synchronization solution
In the Source and Destination sections, configure the source and destination tables, and specify the range of data to synchronize.
Plugin configurations may vary. The following sections describe common settings. To learn if a configuration is supported and how to implement it, see the documentation for the specific plugin. For more information, see Data source list.
1. Source
In the Source section, configure the data table and specify the required parameters as prompted.
|
Actions |
Description |
|
Configure data filtering |
Some source types support data filtering. You can specify a condition (a To perform incremental synchronization, you can combine this filter condition with scheduling parameters to make the condition dynamic. For example, if you use the The method for configuring incremental synchronization varies by data source (plugin). If you do not configure a data filter, all data in the table is synchronized by default. |
|
Configure a sharding key for a relational database |
When the task runs, it splits the data into multiple subtasks based on this field for concurrent, batched data reads. We recommend using the table's primary key as the sharding key (splitPk) because primary keys are usually distributed evenly, which helps prevent data skew across the shards. Currently, splitPk supports sharding only for integer data types. String, float, and date types are not supported. If you specify a field of an unsupported data type, DataWorks ignores the splitPk setting and uses a single channel for synchronization. If you do not specify a value for splitPk (if the parameter is omitted or left empty), a single channel is used to synchronize the data. Not all plugins support specifying a sharding key to configure the sharding logic. The information above is for reference only. For more information, see the documentation for the specific plugin. For more information, see Data source list. |
2. Data processing
The data processing feature is available in Data Studio. To use this feature in Data Development (legacy), you must select Use New Version (with data processing feature) when you create a task. We recommend upgrading your workspace to the latest version to use all available features. For more information, see Data Studio upgrade guide.
The data processing feature lets you transform source data by using methods such as string replacement, AI-assisted processing, and data embedding before writing the data to the destination table.
For example, to configure a Replace String node, you need to provide a Name and Description. For each replacement rule, select a Field Name, specify the Content to be replaced (with support for Regular expression and Case-sensitive matching), and enter the Replacement content. Click Add Rule to add more rules. You can view the results in the Data output preview panel.
-
Click the switch to enable data processing.
-
In the Data Processing List, click Add Node and select a processing type: Replace String, AI-assisted processing, or data embedding. You can add multiple data processing nodes. DataWorks processes them in sequence.
-
Configure the data processing rules as prompted. For information about AI-assisted processing and data embedding, see Data processing.
NoteData processing consumes additional computing resources and increases task runtime. To maintain synchronization efficiency, keep the processing logic as simple as possible.
3. Destination
In the Destination section, configure the data table and specify the required parameters as prompted.
|
Actions |
Description |
|
Configure pre- and post-synchronization statements |
Some data sources allow you to execute SQL statements on the destination before (pre-synchronization) or after (post-synchronization) the task writes data to the destination. Example: The MySQL writer supports the preSql and postSql parameters, which allow you to run MySQL commands before or after writing data to a MySQL database. For example, you can set the Statement Run Before Writing (preSql) parameter to |
|
Define the write mode for conflicts |
Specifies how to handle write conflicts, such as those that involve paths or primary keys. The available options depend on the destination data source and the writer plugin. For configuration details, see the documentation for the specific writer plugin. |
4. Configure field mapping
After you select a source and a destination, you must map the source fields to the destination fields. The task uses this mapping to write data from each source field to the corresponding destination field.
-
If a source field is not mapped to a destination field, its data is not synchronized.
-
If the automatic mapping result is incorrect, you must adjust it manually.
-
To exclude a field from synchronization, you can delete the connecting line between the source and destination fields.
Data type mismatches between source and destination fields can produce dirty data and cause write failures. You can configure the tolerance for dirty data in the Advanced configuration section, as described in the next step.
Data Integration supports mapping by name, by row, and by using the smart mapping and rule-based mapping features. You can also perform the following actions:
-
Smart mapping: To improve configuration efficiency and reduce manual errors, Data Integration now supports smart mapping. This feature uses AI-powered semantic analysis to automatically identify field names, data types, and comments in the source and destination tables, and then recommends an optimal mapping. You can then confirm or slightly adjust the recommendations to complete the configuration.
In the field mapping area, click Smart Mapping to open the dialog box. You can describe your mapping requirements in natural language.
-
Use case
Typical example
Recommended prompt
Global semantic matching
Field names are completely different but have the same meaning
(Example:user_idtodevice_id)Semantically match all fields between the source and destination tables and automatically identify fields with the same meaning.Specific business domain matching
Map only fields related to a specific business
(Example: user or order information)Map only the fields that contain user information, such as name, phone number, and ID, from the source table to the corresponding fields in the destination table.
(Note: You can replace the keywords with "order," "logistics," or "payment.")Prefix/suffix convention differences
Core names are the same, but prefixes or suffixes are different
(Example:src_user_nametotgt_user_name)Ignore field prefixes and suffixes and perform semantic matching based only on the core name.Abbreviation and full name matching
One uses an abbreviation, the other uses the full name
(Example:amttoamount)Identify common English abbreviation-full name pairs, such as amt=amount and addr=address, and create mappings.Excluding specific fields
Some fields are similar but should not be synchronized
(Example: excludingcreate_time)Perform semantic matching, but exclude all fields that contain "time" or "log".Correcting complex logic
The automatic mapping result is incorrect and requires manual guidance
Do not map the id field of the source table to the order_id field of the destination table. Generate the mapping suggestions again. -
After you enter your description, click Generate Preview. The system displays the suggested mappings in the Matching Result Preview area. You can review and select the mappings you need, and then click Apply to add them to the field mapping configuration. If you are not satisfied with the results, you can modify your description and generate a new preview.
-
-
Rule-based mapping: If systematic differences exist between the source and destination field names, you can use the rule-based mapping feature to create field mappings in batches by configuring rules such as prefix and suffix matching or character replacement. In the field mapping area, click Rule-based Mapping, select a mapping method and rule, preview the results, and click Apply.
-
Assign values to destination fields: In the Source Table Field column, click Add Fields to add constants, scheduling parameters, or built-in variables to the destination table. For example, you can add '123', '${scheduling_parameter}', or '#{built_in_variable}#'.
NoteFor more information about how to use scheduling parameters, see Formats of scheduling parameters.
-
Add built-in variables: You can manually add built-in variables and map them to destination fields to pass their values to downstream components.
The following table describes the built-in variables that are available for different plugins.
Built-in variable
Description
Supported plugins
'
#{DATASOURCE_NAME_SRC}#'Name of the source data source
-
MySQL Reader
-
MySQL (Sharded) Reader
-
PolarDB Reader
-
PolarDB (Sharded) Reader
-
PostgreSQL Reader
-
PolarDB-O Reader
-
PolarDB-O (Sharded) Reader
'
#{DB_NAME_SRC}#'Name of the database where the source table resides
-
MySQL Reader
-
MySQL (Sharded) Reader
-
PolarDB Reader
-
PolarDB (Sharded) Reader
-
PostgreSQL Reader
-
PolarDB-O Reader
-
PolarDB-O (Sharded) Reader
'
#{SCHEMA_NAME_SRC}#'Name of the schema where the source table resides
-
PolarDB Reader
-
PolarDB (Sharded) Reader
-
PostgreSQL Reader
-
PolarDB-O Reader
-
PolarDB-O (Sharded) Reader
'
#{TABLE_NAME_SRC}#'Name of the source table
-
MySQL Reader
-
MySQL (Sharded) Reader
-
PolarDB Reader
-
PolarDB (Sharded) Reader
-
PostgreSQL Reader
-
PolarDB-O Reader
-
PolarDB-O (Sharded) Reader
'
#{FILE_NAME_SRC}#'File name
-
OSS Reader
-
HDFS Reader
-
FTP Reader
-
TOS Reader
-
COS Reader
-
S3 Reader
-
Azure Blob Reader
'
#{FILE_PATH_SRC}#'Absolute file path
-
OSS Reader
-
HDFS Reader
-
FTP Reader
-
TOS Reader
-
COS Reader
-
S3 Reader
-
Azure Blob Reader
-
-
Edit source fields: Click Manually Edit Mapping to perform the following operations:
-
Apply functions that are supported by the source database to source fields. For example, you can use the
Max(id)function to synchronize only the maximum value. -
Manually edit source fields if automatic mapping did not retrieve all of them.
NoteThe MaxCompute reader does not support functions.
-
Step 4: Configure advanced settings
Advanced settings was known as the Channel feature in earlier versions of Data Integration.
You can use advanced settings to control data synchronization properties. For more information about the parameters, see Relationship between concurrency and throttling for batch synchronization.
|
Parameter |
Description |
|
Expected Maximum Concurrency |
The maximum number of threads for concurrently reading data from the source or writing data to the destination. Note
|
|
Sync Rate |
This setting controls the data transfer speed.
Note
This traffic metric measures the throughput within Data Integration, not the actual network interface card (NIC) traffic. NIC traffic is typically one to two times the channel traffic, depending on the serialization overhead of the data storage system. |
|
Policy for Dirty Data Records |
Dirty data refers to records that the task fails to write to the destination due to an exception, such as a data type conflict or a constraint violation. You can define a dirty data policy that specifies a tolerance threshold and how dirty data affects the task.
Important
An excessive amount of dirty data can slow down the overall synchronization speed. |
|
Distributed Execution |
This setting controls whether to run the task in distributed mode.
Use distributed mode for high-performance requirements or to make efficient use of fragmented machine resources. Important
|
|
Time Zone |
To synchronize data across time zones, specify the source time zone to enable conversion. |
The overall synchronization speed is affected not only by the settings above but also by other factors, such as source data source performance and network conditions. For more information about synchronization speed and how to tune it, see Accelerate or rate-limit batch synchronization tasks.
Step 5: Configure scheduling properties
For periodically scheduled tasks, you must configure their auto-scheduling properties. On the node configuration page, click Scheduling Settings in the right-side panel.
You need to configure scheduling parameters, a scheduling policy, a schedule time, and scheduling dependencies for the synchronization task. This process is the same as for other data development nodes and is not detailed here.
-
For Data Studio, see Node scheduling (new version).
-
For Data Development (legacy), see Node scheduling configuration (legacy version).
For information about how to use scheduling parameters, see Typical scenarios for using scheduling parameters in Data Integration.
Step 6: Test and publish the task
-
Configure run parameters.
In the right-side panel, click Run Configuration and configure the parameters for a test run.
Parameter
Description
Resource Group
Select a resource group that is connected to the data sources.
Script Parameters
Assign values to placeholder parameters in the data synchronization task. For example, if the task uses the
${bizdate}parameter, you must enter a date parameter in theyyyymmddformat. -
Run the task.
Click the
Run icon in the toolbar to run and debug the task in Data Studio. After the task completes, you can create a node for the destination data source type to query the destination table and verify the results. -
Publish the task.
If the task passes the test run and needs to run on a schedule, click the
Publish icon at the top of the node configuration page to publish the task to the production environment. For more information, see Publish a task.
Limitations
-
You can configure single-table batch synchronization tasks only in Data Studio or Data Development.
-
The codeless UI does not support configuring these tasks for some data sources.
If a message indicates that the codeless UI is not supported for your selected data source, click the
icon in the toolbar to switch to the code editor and continue to configure the task. For more information, see Configure a task in the code editor. -
Although the codeless UI is easy to use, it lacks some advanced features. For more fine-grained configuration, click the script conversion icon in the toolbar to switch to the code editor.
Next steps
After publishing the task to the production environment, you can view it in Operation Center. For more information on running, managing, and monitoring the task, and maintaining the resource group, see O&M for batch synchronization tasks.