首页 Tablestore User Guide Data Channels Data import Synchronize MaxCompute data

Synchronize MaxCompute data

更新时间: 2026-06-03 04:25:35

You can use DataWorks data integration to export all data from MaxCompute to Tablestore.

Background

DataWorks is a unified, end-to-end big data development and governance platform built on big data engines such as MaxCompute, Hologres, EMR, AnalyticDB, and CDP. It provides comprehensive support for data warehouse, data lake, and lakehouse solutions.

Data Integration is a stable, efficient, and elastically scalable platform that reliably synchronizes data at high speeds between various heterogeneous data sources in complex network environments.

Synchronizing processed data from MaxCompute to Tablestore with DataWorks lets you use the high-concurrency throughput and complex query capabilities of Tablestore for real-time analytics and versatile retrieval.

image

Prerequisites

Important

If your MaxCompute and Tablestore instances are in different regions, create a VPC peering connection to establish cross-region network connectivity.

Create a VPC peering connection for cross-region connectivity

This example assumes the DataWorks workspace and MaxCompute instance are in the China (Hangzhou) region, and the Tablestore instance is in the China (Shanghai) region.

  1. Bind a VPC to the Tablestore instance.

    1. Log on to the Tablestore console and select a region in the top navigation bar.

    2. Click the instance alias to go to the Instance Management page.

    3. Go to the Network Management tab, click Bind VPC, select a VPC and a vSwitch, enter a name for the VPC, and then click Yes.

    4. After the VPC is bound, the page refreshes automatically. You can then view the VPC ID and VPC endpoint in the VPC list.

      Note

      Use this VPC endpoint when you add a Tablestore data source in the DataWorks console.

      image

  2. Obtain the VPC information of the DataWorks workspace resource group.

    1. Log on to the DataWorks console, select the region of the workspace from the top navigation bar, and then click Workspaces in the navigation pane on the left to go to the Workspace list page.

    2. Click the workspace name to go to the Workspace Details page. In the navigation pane on the left, click Resource Group to view the list of resource groups bound to the workspace.

    3. Click Network Settings to the right of the target resource group. In the Resource Scheduling & Data Integration area, view the VPC ID of the bound VPC.

  3. Create a VPC peering connection and configure routes.

    1. Log on to the VPC console. In the navigation pane on the left, click VPC. Select the regions of the Tablestore instance and the DataWorks workspace, and record the CIDR block for each VPC.

      image

    2. In the navigation pane on the left, click VPC Peering Connection. On the VPC Peering Connection page, click Create VPC Peering Connection.

    3. On the Create VPC Peering Connection page, enter a name for the peering connection, select the requester VPC, accepter account type, accepter region, and accepter VPC, and then click OK.

    4. On the VPC Peering Connection page, find the VPC peering connection that you created. In the Requester VPC and Accepter VPC columns, click Configure route.

      The destination CIDR block must be the CIDR block of the peer VPC. When you configure a route for the requester VPC, enter the CIDR block of the accepter VPC. When you configure a route for the accepter VPC, enter the CIDR block of the requester VPC.

Procedure

Step 1: Add a Table Store data source

  1. Go to the Data Integration page.

    Log on to the DataWorks console. After switching to the destination region, click Data Integration > Data Integration in the navigation pane on the left. In the drop-down list, select the corresponding workspace and click Go to Data Integration.

  2. In the navigation pane on the left, click Data Source.

  3. On the Data Sources page, click Add Data Source.

  4. In the Add Data Source dialog box, select Tablestore as the data source type.

  5. In the Add Tablestore Data Source dialog box, set the data source parameters as outlined in the table below.

    Parameter

    Description

    Data Source Name

    The name of the data source. The name can contain only letters, digits, and underscores (_), and must start with a letter.

    Data Source Description

    The description of the data source. The description cannot exceed 80 characters in length.

    Region

    Select the region where the Tablestore instance is located.

    Tablestore Instance Name

    The name of the Tablestore instance.

    Endpoint

    The endpoint of the Tablestore instance. We recommend that you use the VPC address.

    AccessKey ID

    The AccessKey ID and AccessKey secret of an Alibaba Cloud account or a Resource Access Management (RAM) user.

    AccessKey Secret

  6. Test the connectivity of the resource group. You must perform this test when you create a data source to ensure that the resource group used by the sync task can connect to the data source. Otherwise, the data synchronization task cannot run properly.

    1. In the Connection Configuration section, for the target resource group, click Test Network Connectivity in the Connection Status column.

    2. After the connectivity test passes, the Connectivity Status changes to Connected. Click Complete. The new data source then appears in the data source list.

      Note

      If the connectivity test fails and the status is Failed, use the connectivity diagnostic tool to troubleshoot the issue. If the resource group still cannot connect to the data source, submit a ticket or .

Step 2: Add MaxCompute data source

Similar to Step 1, in the Add Data Source dialog box, search for and select MaxCompute as the data source type, and then configure the related data source parameters.

Step 3: Configure an offline synchronization

DataStudio (legacy)

Step 1: Create a task node

  1. Go to the Data Development page.

    1. Log on to the DataWorks console.

    2. In the top navigation bar, select a resource group and region.

    3. In the navigation pane on the left, click Data Development and O&M > Data Development.

    4. On the Data Development page, select the target workspace from the drop-down list and click the Go to Data Development button.

  2. On the Data Development page of the Data Studio console, under the Business Flow node, click the target business flow.

    For more information, see Create a business flow.

  3. Right-click the Data Integration node and choose Create Node > Batch Synchronization.

  4. In the Create Node dialog box, select a path, enter a name, and click OK.

    The new batch synchronization node is displayed under the Data Integration node.

Step 2: Configure the synchronization task

  1. Under the Data Integration node, double-click the new batch synchronization task node.

  2. Configure the network and resources.

    Select the data source, data destination, and the resource group for the synchronization task, and then test the connectivity.

    1. In the Configure Network and Resources step, set Data Source to MaxCompute and select the newly added MaxCompute data source for Data Source Name.

    2. Select a resource group.

      After you select a resource group, the system displays its region and specifications, and automatically tests the connectivity between the resource group and the selected data source.

      Note

      Serverless resource groups allow you to specify an upper limit of compute units (CUs) for a synchronization task. If your synchronization task fails with an Out of Memory (OOM) error due to insufficient resources, you must increase the number of CUs for the resource group.

    3. Set Data Destination to Tablestore and select the newly added Tablestore data source for Data Source Name.

      The system automatically tests the connectivity between the resource group and the selected data source.

    4. After the connectivity test succeeds, click Next.

  3. Configure and save the task.

    Wizard mode

    1. In the Configure Task step, configure the data source and data destination in the Configure Data Source and Destination area.

      Data source

      Parameter

      Description

      Data Source

      Defaults to the MaxCompute data source that you selected in the previous step.

      Tunnel resource group

      This parameter specifies the Tunnel quota. By default, the public data transfer resource is used, which is the free quota for MaxCompute.

      For more information about MaxCompute data transfer resources, see Purchase and use a dedicated data transfer service resource group.

      Note

      If a dedicated Tunnel quota becomes unavailable due to overdue payments or expiration, the task automatically switches to the public data transfer resource.

      Table

      The source table.

      Filter by

      The filtering logic for data synchronization. The following methods are supported:

      • Partition Filter: Filters the scope of offline synchronization based on a partition expression. If you select this method, you must also configure the Partition Information and If Partition Does Not Exist parameters.

      • Data Filter: Filters the scope of data synchronization by using a WHERE clause. You do not need to enter the WHERE keyword.

      Partition Information

      Note

      This parameter is required if you set Filter by to Partition Filter.

      Specify the value of the partition column.

      • You can use a fixed value, such as ds=20220101.

      • You can use a scheduling parameter, such as ds=${bizdate}. The scheduling parameter is replaced with its actual value at runtime.

      If Partition Does Not Exist

      Note

      This parameter is required if you set Filter by to Partition Filter.

      The policy for handling the synchronization task if the partition does not exist.

      • Fail

      • Ignore non-existent partitions and run the task

      Data destination

      Parameter

      Description

      Data Source

      Defaults to the Tablestore data source that you selected in the previous step.

      Table

      The destination table.

      Primary Key Information

      The primary key of the destination table.

      write mode

      The mode for writing data to Tablestore. The following modes are supported:

      • PutRow: Corresponds to the PutRow API operation of Tablestore. Inserts data into a specified row. If the row does not exist, a new row is created. If the row exists, the existing row is overwritten.

      • UpdateRow: Corresponds to the UpdateRow API operation of Tablestore. Updates data in a specified row. If the row does not exist, a new row is created. If the row exists, the values of specified columns are added, modified, or deleted based on the request.

    2. Configure field mapping.

      You must specify the mapping between source fields and target fields. The task uses this mapping to write data from the source fields to the corresponding target fields. For more information, see Configure field mapping.

      Important
      • You must specify the primary key in the source field list to read the primary key data.

      • Because the primary key information for the target table is already configured under Data Destination, you cannot configure the primary key information again in the target field list.

      • For fields with an INTEGER data type, you must configure them as INT. DataWorks automatically converts INT to the INTEGER type. If you configure the type as INTEGER directly, the task fails with an error.

    3. Configure channel control.

      You can use channel control to manage properties related to the offline synchronization process, such as concurrency and rate limiting. For more information about the parameters, see Relationship between concurrency and rate limiting for offline synchronization.

    4. Click the image.png icon to save the configuration.

    Script mode

    1. In the Configure Task step, click the image.png icon, and then click OK in the dialog box that appears. In the toolbar of the task configuration page, click the </> icon to switch to script mode.

    2. On the script configuration page, edit the script.

      The following code provides a script example. Replace the parameters with your specific values.

      {
        "type": "job",
        "version": "2.0",
        "steps": [
          {
            "stepType": "odps",
            "parameter": {
              "partition": [],
              "datasource": "MyMaxComputeDataSource",
              "table": "source_table",
              "column": [
                "pk",
                "col_string",
                "col_long",
                "col_double",
                "col_bool"
              ]
            },
            "name": "Reader",
            "category": "reader"
          },
          {
            "stepType": "ots",
            "parameter": {
              "datasource": "MyTablestoreDataSource",
              "table": "target_table",
              "newVersion": "true",
              "mode": "normal",
              "writeMode": "updateRow",
              "column": [
                {
                  "name": "col_string",
                  "type": "STRING"
                },
                {
                  "name": "col_long",
                  "type": "INT"
                },
                {
                  "name": "col_double",
                  "type": "DOUBLE"
                },
                {
                  "name": "col_bool",
                  "type": "BOOL"
                }
              ],
              "primaryKey": [
                {
                  "name": "pk",
                  "type": "STRING"
                }
              ]
            },
            "name": "Writer",
            "category": "writer"
          },
          {
            "name": "Processor",
            "stepType": null,
            "category": "processor",
            "parameter": {}
          }
        ],
        "setting": {
          "executeMode": null,
          "errorLimit": {
            "record": "0"
          },
          "speed": {
            "throttle": true,
            "concurrent": 2,
            "mbps": "12"
          }
        },
        "order": {
          "hops": [
            {
              "from": "Reader",
              "to": "Writer"
            }
          ]
        }
      }
      Important
      • You must specify the primary key in the column parameter of MaxCompute Reader to read the primary key data.

      • For fields with an INTEGER data type, you must configure them as INT. DataWorks automatically converts INT to the INTEGER type. If you configure the type as INTEGER directly, the task fails with an error.

      • The following table describes the parameters that you need to replace for MaxCompute Reader.

        Parameter

        Description

        partition

        The partition from which data is read.

        Note

        This parameter is required only for partitioned tables.

        datasource

        The name of the MaxCompute data source.

        table

        The name of the source table.

        column

        The columns to read from the MaxCompute source table.

      • The following table describes the parameters that you need to replace for Tablestore Writer.

        Parameter

        Description

        datasource

        The name of the Tablestore data source.

        table

        The name of the destination table.

        primaryKey

        The primary key columns of the destination table.

        column

        The attribute columns to write.

        Important

        The order of columns must correspond to the order of columns specified in the MaxCompute Reader configuration.

    3. Click the image.png icon to save the configuration.

Step 3: Run the synchronization task

  1. Click the 1680170333627-a1e19a43-4e2a-4340-9564-f53f2fa6806e icon.

  2. In the Parameters dialog box, select the resource group for the task.

  3. Click Run.

Data studio (new version)

Step 1: Create a task node

  1. Go to the Data Development page.

    1. Log on to the DataWorks console.

    2. In the top navigation bar, select a resource group and region.

    3. In the navigation pane on the left, click Data Development and O&M > Data Development.

    4. On the Data Development page, select your target workspace from the drop-down list and click Go to Data Studio.

  2. In data development page of the Data Studio console, click the image icon to the right of Workspace Directories, and select Create Node > Data Integration > Batch Synchronization.

    Note

    If this is your first time using Workspace Directories, you can also click the Create Node button.

  3. In the Create Node dialog box, select a path, enter a name, and click OK.

    The new batch synchronization node appears in the Workspace Directories.

Step 2: Configure the synchronization task

  1. In the project directory, click the new offline synchronization task node.

  2. Configure the network and resources.

    Select the data source, data destination, and the resource group for the synchronization task, and then test the connectivity.

    1. In the Configure Network and Resources step, set Data Source to MaxCompute and select the newly added MaxCompute data source for Data Source Name.

    2. Select a resource group.

      After you select a resource group, the system displays its region and specifications, and automatically tests the connectivity between the resource group and the selected data source.

      Note

      Serverless resource groups allow you to specify an upper limit of compute units (CUs) for a synchronization task. If your synchronization task fails with an Out of Memory (OOM) error due to insufficient resources, you must increase the number of CUs for the resource group.

    3. Set Data Destination to Tablestore and select the newly added Tablestore data source for Data Source Name.

      The system automatically tests the connectivity between the resource group and the selected data source.

    4. After the connectivity test succeeds, click Next.

  3. Configure and save the task.

    Wizard mode

    1. In the Configure Task step, configure the data source and data destination in the Configure Data Source and Destination area.

      Data source

      Parameter

      Description

      Data Source

      Defaults to the MaxCompute data source that you selected in the previous step.

      Tunnel resource group

      This parameter specifies the Tunnel quota. By default, the public data transfer resource is used, which is the free quota for MaxCompute.

      For more information about MaxCompute data transfer resources, see Purchase and use a dedicated data transfer service resource group.

      Note

      If a dedicated Tunnel quota becomes unavailable due to overdue payments or expiration, the task automatically switches to the public data transfer resource.

      Table

      The source table.

      Filter by

      The filtering logic for data synchronization. The following methods are supported:

      • Partition Filter: Filters the scope of offline synchronization based on a partition expression. If you select this method, you must also configure the Partition Information and If Partition Does Not Exist parameters.

      • Data Filter: Filters the scope of data synchronization by using a WHERE clause. You do not need to enter the WHERE keyword.

      Partition Information

      Note

      This parameter is required if you set Filter by to Partition Filter.

      Specify the value of the partition column.

      • You can use a fixed value, such as ds=20220101.

      • You can use a scheduling parameter, such as ds=${bizdate}. The scheduling parameter is replaced with its actual value at runtime.

      If Partition Does Not Exist

      Note

      This parameter is required if you set Filter by to Partition Filter.

      The policy for handling the synchronization task if the partition does not exist.

      • Fail

      • Ignore non-existent partitions and run the task

      Data destination

      Parameter

      Description

      Data Source

      Defaults to the Tablestore data source that you selected in the previous step.

      Table

      The destination table.

      Primary Key Information

      The primary key of the destination table.

      write mode

      The mode for writing data to Tablestore. The following modes are supported:

      • PutRow: Corresponds to the PutRow API operation of Tablestore. Inserts data into a specified row. If the row does not exist, a new row is created. If the row exists, the existing row is overwritten.

      • UpdateRow: Corresponds to the UpdateRow API operation of Tablestore. Updates data in a specified row. If the row does not exist, a new row is created. If the row exists, the values of specified columns are added, modified, or deleted based on the request.

    2. Configure field mapping.

      You must specify the mapping between source fields and target fields. The task uses this mapping to write data from the source fields to the corresponding target fields. For more information, see Configure field mapping.

      Important
      • You must specify the primary key in the source field list to read the primary key data.

      • Because the primary key information for the target table is already configured under Data Destination, you cannot configure the primary key information again in the target field list.

      • For fields with an INTEGER data type, you must configure them as INT. DataWorks automatically converts INT to the INTEGER type. If you configure the type as INTEGER directly, the task fails with an error.

    3. Configure channel control.

      You can use channel control to manage properties related to the offline synchronization process, such as concurrency and rate limiting. For more information about the parameters, see Relationship between concurrency and rate limiting for offline synchronization.

    4. Click Save.

    Script mode

    1. In the Configure Task step, click Script Mode, and then click OK in the dialog box that appears. In the toolbar on the task configuration page, click Script Mode to switch to script editing mode.

    2. On the script configuration page, edit the script.

      The following code provides a script example. Replace the parameters with your specific values.

      {
        "type": "job",
        "version": "2.0",
        "steps": [
          {
            "stepType": "odps",
            "parameter": {
              "partition": [],
              "datasource": "MyMaxComputeDataSource",
              "table": "source_table",
              "column": [
                "pk",
                "col_string",
                "col_long",
                "col_double",
                "col_bool"
              ]
            },
            "name": "Reader",
            "category": "reader"
          },
          {
            "stepType": "ots",
            "parameter": {
              "datasource": "MyTablestoreDataSource",
              "table": "target_table",
              "newVersion": "true",
              "mode": "normal",
              "writeMode": "updateRow",
              "column": [
                {
                  "name": "col_string",
                  "type": "STRING"
                },
                {
                  "name": "col_long",
                  "type": "INT"
                },
                {
                  "name": "col_double",
                  "type": "DOUBLE"
                },
                {
                  "name": "col_bool",
                  "type": "BOOL"
                }
              ],
              "primaryKey": [
                {
                  "name": "pk",
                  "type": "STRING"
                }
              ]
            },
            "name": "Writer",
            "category": "writer"
          },
          {
            "name": "Processor",
            "stepType": null,
            "category": "processor",
            "parameter": {}
          }
        ],
        "setting": {
          "executeMode": null,
          "errorLimit": {
            "record": "0"
          },
          "speed": {
            "throttle": true,
            "concurrent": 2,
            "mbps": "12"
          }
        },
        "order": {
          "hops": [
            {
              "from": "Reader",
              "to": "Writer"
            }
          ]
        }
      }
      Important
      • You must specify the primary key in the column parameter of MaxCompute Reader to read the primary key data.

      • For fields with an INTEGER data type, you must configure them as INT. DataWorks automatically converts INT to the INTEGER type. If you configure the type as INTEGER directly, the task fails with an error.

      • The following table describes the parameters that you need to replace for MaxCompute Reader.

        Parameter

        Description

        partition

        The partition from which data is read.

        Note

        This parameter is required only for partitioned tables.

        datasource

        The name of the MaxCompute data source.

        table

        The name of the source table.

        column

        The columns to read from the MaxCompute source table.

      • The following table describes the parameters that you need to replace for Tablestore Writer.

        Parameter

        Description

        datasource

        The name of the Tablestore data source.

        table

        The name of the destination table.

        primaryKey

        The primary key columns of the destination table.

        column

        The attribute columns to write.

        Important

        The order of columns must correspond to the order of columns specified in the MaxCompute Reader configuration.

    3. Click Save.

Step 3: Run the synchronization task

  1. Click Runtime Configuration to the right of the task, and select the resource group for the task.

  2. Click Run.

Step 4: View synchronization results

After the synchronization task runs, you can view its execution status in the log and the synchronized data in the destination data table in the Tablestore console.

  1. View the task execution status.

    1. On the Result tab of the synchronization task, check the value of Current task status.

      If Current task status is FINISH, the task is complete.

    2. To view detailed execution logs, click the Detail log url link.

  2. View the synchronized data in the destination data table.

    1. Go to the Instance Management page.

      1. Log in to the Tablestore console.

      2. In the top navigation bar, select a resource group and a region.

      3. On the Overview page, click the instance alias, or click Instance Management in the Actions column of the instance.

    2. On the Instance Details tab, click the Data Table List tab.

    3. On the Data Table List tab, click the name of the destination data table.

    4. On the Data Management tab, you can view the data synchronized to this data table.

上一篇: Synchronize HBase data 下一篇: Synchronize Tablestore data
阿里云首页 表格存储 相关技术圈