Export all data to MaxCompute

更新时间:
复制 MD 格式

Use the DataWorks data integration feature to export full data from Tablestore to MaxCompute for offline analysis and processing.

Prerequisites

Before you export data, complete the following tasks:

Note

If your DataWorks workspace and Tablestore instance are in different regions, you must create a VPC peering connection to enable cross-region network connectivity.

Create a VPC peering connection for cross-region network connectivity

The following example demonstrates a scenario where the source Tablestore instance is in the China (Shanghai) region and the DataWorks workspace is in the China (Hangzhou) region.

  1. Attach a VPC to the Tablestore instance.

    1. Log on to the Tablestore console. In the top navigation bar, select the region where the target table is located.

    2. Click the instance alias to navigate to the Instance Management page.

    3. On the Network Management tab, click Bind VPC. Select a VPC and vSwitch, enter a VPC name, and then click OK.

    4. Wait for the VPC to attach. The page automatically refreshes to display the VPC ID and VPC Address in the VPC list.

      Note

      When you add a Tablestore data source in the DataWorks console, you must use this VPC address.

      image

  2. Obtain the VPC information for the DataWorks workspace resource group.

    1. Log on to the DataWorks console. In the top navigation bar, select the region where your workspace is located. In the navigation pane on the left, click Workspace to go to the Workspaces page.

    2. Click the workspace name to go to the Workspace Details page. In the left navigation pane, click Resource Group to view the resource groups attached to the workspace.

    3. To the right of the target resource group, click Network Settings. In the Data Scheduling & Data Integration section, view the VPC ID of the attached virtual private cloud.

  3. Create a VPC peering connection and configure routes.

    1. Log on to the VPC console. In the navigation pane on the left, click VPC Peering Connection and then click Create VPC Peering Connection.

    2. On the Create VPC Peering Connection page, enter a name for the peering connection and select the requester VPC instance, accepter account type, accepter region, and accepter VPC instance. Then, click OK.

    3. On the VPC Peering Connection page, find the VPC peering connection and click Configure route in the Requester VPC and Accepter columns.

      For the destination CIDR block, enter the CIDR block of the peer VPC. For example, when you configure a route entry for the requester VPC, enter the CIDR block of the accepter VPC. When you configure a route entry for the accepter VPC, enter the CIDR block of the requester VPC.

Procedure

Follow these steps to configure a full data export from Tablestore to MaxCompute.

Step 1: Add Tablestore data source

Add a Tablestore data source in DataWorks to establish a connection with the source table.

  1. Log on to the DataWorks console. Switch to the destination region. In the navigation pane on the left, choose Data Integration > Data Integration. From the drop-down list, select the workspace and click Go to Data Integration.

  2. In the navigation pane on the left, click Data source.

  3. On the Data Sources page, click Add Data Source.

  4. In the Add Data Source dialog box, search for and select Tablestore as the data source type.

  5. In the Add OTS Data Source dialog box, configure the data source parameters as described in the following table.

    Parameter

    Description

    Data Source Name

    The data source name must be a combination of letters, digits, and underscores (_). It cannot start with a digit or an underscore (_).

    Data Source Description

    A brief description of the data source. The description cannot exceed 80 characters in length.

    Region

    Select the region where the Tablestore instance resides.

    Tablestore Instance Name

    The name of the Tablestore instance.

    Endpoint

    The endpoint of the Tablestore instance. Use the VPC address.

    AccessKey ID

    The AccessKey ID and AccessKey secret of the Alibaba Cloud account or RAM user.

    AccessKey Secret

  6. Test the resource group connectivity.

    When you create a data source, you must test the connectivity of the resource group to ensure that the resource group for the sync task can connect to the data source. Otherwise, the data sync task cannot run.

    1. In the Connection Configuration section, click Test Network Connectivity in the Connection Status column for the resource group.

    2. After the connectivity test passes, click Complete. The new data source appears in the data source list.

      If the connectivity test fails, use the Network Connectivity Diagnostic Tool to troubleshoot the issue.

Step 2: Add MaxCompute data source

Add and configure a MaxCompute data source as the destination for the data export.

  1. Click Add data source again. Select MaxCompute as the data source type and configure the parameters.

    Parameter

    Description

    Data source name

    The name must consist of letters, digits, and underscores (_). It cannot start with a digit or an underscore (_).

    Data source description

    A brief description of the data source, not exceeding 80 characters.

    Authentication method

    This value is set to Alibaba Cloud account and Alibaba Cloud RAM role by default and cannot be changed.

    Alibaba Cloud Account

    • Current Alibaba Cloud Account: Select the MaxCompute project name and Default Access Identity for the current account in the specified region.

    • Another Alibaba Cloud Account: Enter the UID of Alibaba Cloud account, MaxCompute project, and RAM role for the other account in the specified region.

    Region

    The region where the MaxCompute project is located.

    Endpoint

    The default value is Auto adapt. You can also select Custom configuration as needed.

  2. After you configure the parameters and the connectivity test succeeds, click Complete to add the data source.

Step 3: Configure batch synchronization task

Create a data synchronization task to define the transfer rules and field mappings for exporting data from Tablestore to MaxCompute.

Create task node

  1. Go to the Data development page.

    1. Log on to the DataWorks console.

    2. In the top navigation bar, select a resource group and a region.

    3. In the left navigation pane, click .

    4. Select the target workspace and click Go to Data Studio.

  2. On the Data development page of the Data Studio console, click the image icon to the right of Workspace Directory, and then choose Batch Synchronization.

  3. In the Create node dialog box, select a Path. Set the data source to Tablestore and the destination to MaxCompute (ODPS). Enter a Name and click OK.

Configure synchronization task

Under Workspace Directory, click to open the newly created batch synchronization task node. You can configure the task by using either the codeless UI or the code editor.

Codeless UI (default)

Configure the following items in the codeless UI:

  • Data source: Select the source and destination data sources.

  • Runtime Resource: Select a resource group. The system then automatically tests the data source connectivity.

  • Data Source:

    • Table: Select the source data table from the drop-down list.

    • Primary Key Range (Start): The start of the primary key range from which to read data. The value must be in JSON array format. inf_min represents negative infinity.

      For example, if the primary key consists of an int column named id and a string column named name, the sample configuration is as follows:

      Specific primary key range

      Full data

      [
        {
          "type": "int",
          "value": "000"
        },
        {
          "type": "string",
          "value": "aaa"
        }
      ]
      [
        {
          "type": "inf_min"
        },
        {
          "type": "inf_min"
        }
      ]
    • Primary Key Range (End): The end of the primary key range from which to read data. The value must be in JSON array format. inf_max represents positive infinity.

      For example, if the primary key consists of an int column named id and a string column named name, the sample configuration is as follows:

      Specific primary key range

      Full data

      [
        {
          "type": "int",
          "value": "999"
        },
        {
          "type": "string",
          "value": "zzz"
        }
      ]
      [
        {
          "type": "inf_max"
        },
        {
          "type": "inf_max"
        }
      ]
    • Splitting Configuration: Custom splitting configuration in JSON array format. In most cases, you do not need to configure this parameter (set it to []).

      If data hotspots occur in your Tablestore instance and the automatic splitting policy of Tablestore Reader is ineffective, we recommend that you use custom splitting rules. A split specifies the split points within the primary key range. You only need to configure the shard keys, not all primary keys.

  • Destination: Configure the following items. You can keep the default values for other parameters or modify them as needed.

    • Project Name in Production Environment: Displays the name of the MaxCompute project associated with the destination data source.

    • Tunnel Resource Group: By default, Common transmission resources is selected, which is the free quota of MaxCompute. You can select a dedicated Tunnel resource group as needed.

    • Table: Select the destination table. You can click Generate Target Table Schema to automatically generate the destination table.

    • Partition: The synchronized data is saved in a partition for a specified date. This can be used for daily incremental synchronization.

    • Write Mode: Select whether to clear existing data or append new data.

  • Destination Field Mapping: Maps the fields from the source to the destination table. The system provides a default mapping based on the source table fields, which you can modify as needed.

After you complete the configuration, click Save at the top of the page.

Code editor

To edit the script, click Code Editor at the top of the page.

The following sample script is for a source data table where the primary key consists of an int column named id and a string column named name. The attribute column is an int field named age. In your script, replace the values for the datasource and table parameters with your actual values.
{
    "type": "job",
    "version": "2.0",
    "steps": [
        {
            "stepType": "ots",
            "parameter": {
                "datasource": "source_data",
                "column": [
                    {
                        "name": "id",
                        "type": "INTEGER"
                    },
                    {
                        "name": "name",
                        "type": "STRING"
                    },
                    {
                        "name": "age",
                        "type": "INTEGER"
                    }
                ],
                "range": {
                    "begin": [
                        {
                            "type": "inf_min"
                        },
                        {
                            "type": "inf_min"
                        }
                    ],
                    "end": [
                        {
                            "type": "inf_max"
                        },
                        {
                            "type": "inf_max"
                        }
                    ],
                    "split": []
                },
                "table": "source_table",
                "newVersion": "true"
            },
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "odps",
            "parameter": {
                "partition": "pt=${bizdate}",
                "truncate": true,
                "datasource": "target_data",
                "tunnelQuota": "default",
                "column": [
                    "id",
                    "name",
                    "age"
                ],
                "emptyAsNull": false,
                "guid": null,
                "table": "source_table",
                "consistencyCommit": false
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "setting": {
        "errorLimit": {
            "record": "0"
        },
        "speed": {
            "concurrent": 2,
            "throttle": false
        }
    },
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    }
}

After you finish editing the script, click Save at the top of the page.

Run synchronization task

  1. On the right side of the page, click Debug configuration, select the resource group to use for the run, and add the Script parameters.

    • bizdate: The data partition of the MaxCompute destination table, such as 20251120.

  2. Click Run at the top of the page to start the synchronization task.

Step 4: View synchronization results

After the task runs, view its execution status in the logs and check the synchronized data in the DataWorks console.

  1. View the task running status and result at the bottom of the page. The following log information indicates that the sync task ran successfully.

    2025-11-18 11:16:23 INFO Shell run successfully!
    2025-11-18 11:16:23 INFO Current task status: FINISH
    2025-11-18 11:16:23 INFO Cost time is: 77.208s
  2. View the results in the destination table.

    Go to the DataWorks console. In the left navigation pane, click . Then, click Go to data map to view the destination table and its data.

FAQ