Configure a single-table offline task using the wizard

更新时间:
复制 MD 格式

Data Integration provides a wizard-based guide to periodically synchronize full or incremental data from a single source table or sharded source tables to a destination data table. You can configure a task without writing any code by selecting a data source and a destination in the UI and using DataWorks scheduling parameters. This topic describes the general configurations for a single-table offline task in wizard mode. The configurations for different data sources may vary. For more information, see Supported data sources and synchronization solutions.

Prerequisites

  • Data sources are configured. Before you create a Data Integration task, make sure that the required source and destination data sources are configured in Data Source Management in the DataWorks console. For more information, see Data source list.

    Note
  • You have purchased a resource group with appropriate specifications and associated it with the workspace. For more information, see Use a Serverless resource group.

  • Network connectivity is established between the resource group and the data sources. For more information, see Configure network connectivity.

Step 1: Create a Data Integration node

Data Studio (new version)

  1. Log on to the DataWorks console and switch to the target region. In the navigation pane on the left, choose Data Development and O&M > DataStudio. Select the desired workspace from the drop-down list and click <p><a href={url} target="_blank">Learn more.</a></p> Data Studio.

  2. Create a workflow. For more information, see Workflows.

  3. Create a Data Integration node. You can create a Data Integration node in one of the following ways:

    • Method 1: In the upper-right corner of the workflow list, click the image icon and choose Create Node > Data Integration.

    • Method 2: Double-click the workflow name and drag the Data Integration node from the Data Integration directory to the workflow canvas on the right.

  4. Configure the source and destination types, select Single Table Batch Sync as the specific type, and then click OK to create the node.

Data Development (legacy)

  1. Log on to the DataWorks console and switch to the target region. In the navigation pane on the left, choose Data Development and O&M > DataStudio. Select the desired workspace from the drop-down list and click Data Analytics.

  2. Create a workflow. For more information, see Create a workflow.

  3. Create a batch synchronization node. You can create a batch synchronization node in one of the following ways:

    • Method 1: Expand the workflow, right-click Data Integration > Create Node > Batch Synchronization.

    • Method 2: Double-click the workflow name and drag the Batch Synchronization node from the Data Integration directory to the workflow canvas on the right.

  4. Follow the on-screen instructions to create the batch synchronization node.

Step 2: Configure data sources and runtime resources

In this example, the Source data source type is set to MySQL and the specific data source is mysql. The Destination data source type is set to MaxCompute, and the data source is own_mc. The Resource Group is set to dwGroup, and the CU is set to 0.5 CU.

  1. In the Source Information and Destination sections, select the data source objects that you want to read from and write to.

  2. In the Runtime Resource section, select the Resource Group for the synchronization task, and allocate Resource Group CU to the task. If your synchronization task encounters an out-of-memory (OOM) error due to insufficient resources, adjust the CU allocation for the resource group. For the recommended resource quota configurations, see Resource Group Performance Metrics - Data Integration.

  3. Make sure that both the source and destination data sources pass the Connectivity Check. If a data source fails the connectivity check, follow the on-screen instructions or see Configure network connectivity to establish a network connection.

Note
  • If a resource group that you created is not displayed, make sure that the resource group is associated with the workspace. For more information, see Use a Serverless resource group.

Step 3: Configure the synchronization solution

In the Source and Destination sections, configure the source and destination tables, and specify the range of data to synchronize.

Important

Plugin configurations may vary. The following sections describe common settings. To learn if a configuration is supported and how to implement it, see the documentation for the specific plugin. For more information, see Data source list.

1. Source

In the Source section, configure the data table and specify the required parameters as prompted.

Actions

Description

Configure data filtering

Some source types support data filtering. You can specify a condition (a WHERE clause without the where keyword) to filter source data. When the task runs, it synchronizes only the data that meets this condition. For more information, see Scenario: Configure a batch synchronization task for incremental data.

To perform incremental synchronization, you can combine this filter condition with scheduling parameters to make the condition dynamic. For example, if you use the gmt_create >= '${bizdate}' condition, the task synchronizes only the data that is generated on the current day each time it runs. When you configure scheduling properties, you must assign a value to the variable defined in this condition. For more information, see Formats of scheduling parameters.

The method for configuring incremental synchronization varies by data source (plugin).

If you do not configure a data filter, all data in the table is synchronized by default.

Configure a sharding key for a relational database

When the task runs, it splits the data into multiple subtasks based on this field for concurrent, batched data reads.

We recommend using the table's primary key as the sharding key (splitPk) because primary keys are usually distributed evenly, which helps prevent data skew across the shards.

Currently, splitPk supports sharding only for integer data types. String, float, and date types are not supported. If you specify a field of an unsupported data type, DataWorks ignores the splitPk setting and uses a single channel for synchronization.

If you do not specify a value for splitPk (if the parameter is omitted or left empty), a single channel is used to synchronize the data.

Not all plugins support specifying a sharding key to configure the sharding logic. The information above is for reference only. For more information, see the documentation for the specific plugin. For more information, see Data source list.

2. Data processing

Important

The data processing feature is available in Data Studio. To use this feature in Data Development (legacy), you must select Use New Version (with data processing feature) when you create a task. We recommend upgrading your workspace to the latest version to use all available features. For more information, see Data Studio upgrade guide.

The data processing feature lets you transform source data by using methods such as string replacement, AI-assisted processing, and data embedding before writing the data to the destination table.

For example, to configure a Replace String node, you need to provide a Name and Description. For each replacement rule, select a Field Name, specify the Content to be replaced (with support for Regular expression and Case-sensitive matching), and enter the Replacement content. Click Add Rule to add more rules. You can view the results in the Data output preview panel.

  1. Click the switch to enable data processing.

  2. In the Data Processing List, click Add Node and select a processing type: Replace String, AI-assisted processing, or data embedding. You can add multiple data processing nodes. DataWorks processes them in sequence.

  3. Configure the data processing rules as prompted. For information about AI-assisted processing and data embedding, see Data processing.

    Note

    Data processing consumes additional computing resources and increases task runtime. To maintain synchronization efficiency, keep the processing logic as simple as possible.

3. Destination

In the Destination section, configure the data table and specify the required parameters as prompted.

Actions

Description

Configure pre- and post-synchronization statements

Some data sources allow you to execute SQL statements on the destination before (pre-synchronization) or after (post-synchronization) the task writes data to the destination.

Example: The MySQL writer supports the preSql and postSql parameters, which allow you to run MySQL commands before or after writing data to a MySQL database. For example, you can set the Statement Run Before Writing (preSql) parameter to truncate table tablename. This clears existing data from the destination table before new data is written.

Define the write mode for conflicts

Specifies how to handle write conflicts, such as those that involve paths or primary keys. The available options depend on the destination data source and the writer plugin. For configuration details, see the documentation for the specific writer plugin.

4. Configure field mapping

After you select a source and a destination, you must map the source fields to the destination fields. The task uses this mapping to write data from each source field to the corresponding destination field.

  • If a source field is not mapped to a destination field, its data is not synchronized.

  • If the automatic mapping result is incorrect, you must adjust it manually.

  • To exclude a field from synchronization, you can delete the connecting line between the source and destination fields.

Data type mismatches between source and destination fields can produce dirty data and cause write failures. You can configure the tolerance for dirty data in the Advanced configuration section, as described in the next step.

Data Integration supports mapping by name, by row, and by using the smart mapping and rule-based mapping features. You can also perform the following actions:

  • Smart mapping: To improve configuration efficiency and reduce manual errors, Data Integration now supports smart mapping. This feature uses AI-powered semantic analysis to automatically identify field names, data types, and comments in the source and destination tables, and then recommends an optimal mapping. You can then confirm or slightly adjust the recommendations to complete the configuration.

    In the field mapping area, click Smart Mapping to open the dialog box. You can describe your mapping requirements in natural language.

    • Use case

      Typical example

      Recommended prompt

      Global semantic matching

      Field names are completely different but have the same meaning
      (Example: user_id to device_id)

      Semantically match all fields between the source and destination tables and automatically identify fields with the same meaning.

      Specific business domain matching

      Map only fields related to a specific business
      (Example: user or order information)

      Map only the fields that contain user information, such as name, phone number, and ID, from the source table to the corresponding fields in the destination table.
      (Note: You can replace the keywords with "order," "logistics," or "payment.")

      Prefix/suffix convention differences

      Core names are the same, but prefixes or suffixes are different
      (Example: src_user_name to tgt_user_name)

      Ignore field prefixes and suffixes and perform semantic matching based only on the core name.

      Abbreviation and full name matching

      One uses an abbreviation, the other uses the full name
      (Example: amt to amount)

      Identify common English abbreviation-full name pairs, such as amt=amount and addr=address, and create mappings.

      Excluding specific fields

      Some fields are similar but should not be synchronized
      (Example: excluding create_time)

      Perform semantic matching, but exclude all fields that contain "time" or "log".

      Correcting complex logic

      The automatic mapping result is incorrect and requires manual guidance

      Do not map the id field of the source table to the order_id field of the destination table. Generate the mapping suggestions again.

    • After you enter your description, click Generate Preview. The system displays the suggested mappings in the Matching Result Preview area. You can review and select the mappings you need, and then click Apply to add them to the field mapping configuration. If you are not satisfied with the results, you can modify your description and generate a new preview.

  • Rule-based mapping: If systematic differences exist between the source and destination field names, you can use the rule-based mapping feature to create field mappings in batches by configuring rules such as prefix and suffix matching or character replacement. In the field mapping area, click Rule-based Mapping, select a mapping method and rule, preview the results, and click Apply.

  • Assign values to destination fields: In the Source Table Field column, click Add Fields to add constants, scheduling parameters, or built-in variables to the destination table. For example, you can add '123', '${scheduling_parameter}', or '#{built_in_variable}#'.

    Note

    For more information about how to use scheduling parameters, see Formats of scheduling parameters.

  • Add built-in variables: You can manually add built-in variables and map them to destination fields to pass their values to downstream components.

    The following table describes the built-in variables that are available for different plugins.

    Built-in variable

    Description

    Supported plugins

    '#{DATASOURCE_NAME_SRC}#'

    Name of the source data source

    • MySQL Reader

    • MySQL (Sharded) Reader

    • PolarDB Reader

    • PolarDB (Sharded) Reader

    • PostgreSQL Reader

    • PolarDB-O Reader

    • PolarDB-O (Sharded) Reader

    '#{DB_NAME_SRC}#'

    Name of the database where the source table resides

    • MySQL Reader

    • MySQL (Sharded) Reader

    • PolarDB Reader

    • PolarDB (Sharded) Reader

    • PostgreSQL Reader

    • PolarDB-O Reader

    • PolarDB-O (Sharded) Reader

    '#{SCHEMA_NAME_SRC}#'

    Name of the schema where the source table resides

    • PolarDB Reader

    • PolarDB (Sharded) Reader

    • PostgreSQL Reader

    • PolarDB-O Reader

    • PolarDB-O (Sharded) Reader

    '#{TABLE_NAME_SRC}#'

    Name of the source table

    • MySQL Reader

    • MySQL (Sharded) Reader

    • PolarDB Reader

    • PolarDB (Sharded) Reader

    • PostgreSQL Reader

    • PolarDB-O Reader

    • PolarDB-O (Sharded) Reader

    '#{FILE_NAME_SRC}#'

    File name

    • OSS Reader

    • HDFS Reader

    • FTP Reader

    • TOS Reader

    • COS Reader

    • S3 Reader

    • Azure Blob Reader

    '#{FILE_PATH_SRC}#'

    Absolute file path

    • OSS Reader

    • HDFS Reader

    • FTP Reader

    • TOS Reader

    • COS Reader

    • S3 Reader

    • Azure Blob Reader

  • Edit source fields: Click Manually Edit Mapping to perform the following operations:

    • Apply functions that are supported by the source database to source fields. For example, you can use the Max(id) function to synchronize only the maximum value.

    • Manually edit source fields if automatic mapping did not retrieve all of them.

    Note

    The MaxCompute reader does not support functions.

Step 4: Configure advanced settings

Important

Advanced settings was known as the Channel feature in earlier versions of Data Integration.

You can use advanced settings to control data synchronization properties. For more information about the parameters, see Relationship between concurrency and throttling for batch synchronization.

Parameter

Description

Expected Maximum Concurrency

The maximum number of threads for concurrently reading data from the source or writing data to the destination.

Note
  • Due to factors such as resource specifications, the actual concurrency at runtime may be less than or equal to the value you specify. The resource group used for debugging is charged based on the actual concurrency. For more information, see Performance metrics.

  • Task scheduling fees depend on the number of single-table batch synchronization tasks, not the configured concurrency.

Sync Rate

This setting controls the data transfer speed.

  • Throttling: You can limit the synchronization rate to protect the source database from being overloaded by high extraction speeds. The minimum speed limit is 1 MB/s.

  • No throttling: The task runs at the maximum transfer rate supported by the hardware within the configured concurrency limit.

Note

This traffic metric measures the throughput within Data Integration, not the actual network interface card (NIC) traffic. NIC traffic is typically one to two times the channel traffic, depending on the serialization overhead of the data storage system.

Policy for Dirty Data Records

Dirty data refers to records that the task fails to write to the destination due to an exception, such as a data type conflict or a constraint violation. You can define a dirty data policy that specifies a tolerance threshold and how dirty data affects the task.

  • By default, dirty data is allowed and does not affect task execution.

  • If you set the tolerance to 0, no dirty data is allowed. If any dirty data is generated during synchronization, the task fails.

  • If you allow dirty data and set a threshold:

    • If the number of dirty data records is within the threshold, the task ignores these records (they are not written to the destination) and continues to run.

    • If the number of dirty data records exceeds the threshold, the task fails.

Important

An excessive amount of dirty data can slow down the overall synchronization speed.

Distributed Execution

This setting controls whether to run the task in distributed mode.

  • Enabled: Splits the task into multiple concurrent processes. This mode helps overcome single-process bottlenecks and improve synchronization efficiency.

  • Disabled: The task runs as a single process.

Use distributed mode for high-performance requirements or to make efficient use of fragmented machine resources.

Important
  • Distributed execution can be enabled only when the concurrency is set to 8 or greater.

  • Enabling distributed execution consumes more resources. If an out-of-memory (OOM) error occurs at runtime, try disabling this setting.

Time Zone

To synchronize data across time zones, specify the source time zone to enable conversion.

Note

The overall synchronization speed is affected not only by the settings above but also by other factors, such as source data source performance and network conditions. For more information about synchronization speed and how to tune it, see Accelerate or rate-limit batch synchronization tasks.

Step 5: Configure scheduling properties

For periodically scheduled tasks, you must configure their auto-scheduling properties. On the node configuration page, click Scheduling Settings in the right-side panel.

You need to configure scheduling parameters, a scheduling policy, a schedule time, and scheduling dependencies for the synchronization task. This process is the same as for other data development nodes and is not detailed here.

For information about how to use scheduling parameters, see Typical scenarios for using scheduling parameters in Data Integration.

Step 6: Test and publish the task

  • Configure run parameters.

    In the right-side panel, click Run Configuration and configure the parameters for a test run.

    Parameter

    Description

    Resource Group

    Select a resource group that is connected to the data sources.

    Script Parameters

    Assign values to placeholder parameters in the data synchronization task. For example, if the task uses the ${bizdate} parameter, you must enter a date parameter in the yyyymmdd format.

  • Run the task.

    Click the image Run icon in the toolbar to run and debug the task in Data Studio. After the task completes, you can create a node for the destination data source type to query the destination table and verify the results.

  • Publish the task.

    If the task passes the test run and needs to run on a schedule, click the image Publish icon at the top of the node configuration page to publish the task to the production environment. For more information, see Publish a task.

Limitations

  • You can configure single-table batch synchronization tasks only in Data Studio or Data Development.

  • The codeless UI does not support configuring these tasks for some data sources.

    If a message indicates that the codeless UI is not supported for your selected data source, click the image.png icon in the toolbar to switch to the code editor and continue to configure the task. For more information, see Configure a task in the code editor.

  • Although the codeless UI is easy to use, it lacks some advanced features. For more fine-grained configuration, click the script conversion icon in the toolbar to switch to the code editor.

Next steps

After publishing the task to the production environment, you can view it in Operation Center. For more information on running, managing, and monitoring the task, and maintaining the resource group, see O&M for batch synchronization tasks.

Related documentation