Synchronize MaxCompute data
You can use DataWorks data integration to export all data from MaxCompute to Tablestore.
Background
DataWorks is a unified, end-to-end big data development and governance platform built on big data engines such as MaxCompute, Hologres, EMR, AnalyticDB, and CDP. It provides comprehensive support for data warehouse, data lake, and lakehouse solutions.
Data Integration is a stable, efficient, and elastically scalable platform that reliably synchronizes data at high speeds between various heterogeneous data sources in complex network environments.
Synchronizing processed data from MaxCompute to Tablestore with DataWorks lets you use the high-concurrency throughput and complex query capabilities of Tablestore for real-time analytics and versatile retrieval.
Prerequisites
-
You have information about the MaxCompute project and the source table.
-
You have enabled the Tablestore service, created an instance, and created a destination data table. You also have the instance name, instance endpoint, and region ID.
-
You have created an AccessKey for your Alibaba Cloud account or a RAM user with the required permissions for Tablestore and MaxCompute.
-
You have enabled DataWorks and created a workspace in the same region as your MaxCompute or Tablestore instance.
-
You have created a Serverless resource group and bound it to the workspace. For billing information, see Billing of Serverless resource groups.
If your MaxCompute and Tablestore instances are in different regions, create a VPC peering connection to establish cross-region network connectivity.
Procedure
Step 1: Add a Table Store data source
Go to the Data Integration page.
Log on to the DataWorks console. After switching to the destination region, click in the navigation pane on the left. In the drop-down list, select the corresponding workspace and click Go to Data Integration.
In the navigation pane on the left, click Data Source.
On the Data Sources page, click Add Data Source.
In the Add Data Source dialog box, select Tablestore as the data source type.
In the Add Tablestore Data Source dialog box, set the data source parameters as outlined in the table below.
Parameter
Description
Data Source Name
The name of the data source. The name can contain only letters, digits, and underscores (_), and must start with a letter.
Data Source Description
The description of the data source. The description cannot exceed 80 characters in length.
Region
Select the region where the Tablestore instance is located.
Tablestore Instance Name
The name of the Tablestore instance.
Endpoint
The endpoint of the Tablestore instance. We recommend that you use the VPC address.
AccessKey ID
The AccessKey ID and AccessKey secret of an Alibaba Cloud account or a Resource Access Management (RAM) user.
AccessKey Secret
Test the connectivity of the resource group. You must perform this test when you create a data source to ensure that the resource group used by the sync task can connect to the data source. Otherwise, the data synchronization task cannot run properly.
In the Connection Configuration section, for the target resource group, click Test Network Connectivity in the Connection Status column.
After the connectivity test passes, the Connectivity Status changes to Connected. Click Complete. The new data source then appears in the data source list.
NoteIf the connectivity test fails and the status is Failed, use the connectivity diagnostic tool to troubleshoot the issue. If the resource group still cannot connect to the data source, submit a ticket or .
Step 2: Add MaxCompute data source
Similar to Step 1, in the Add Data Source dialog box, search for and select MaxCompute as the data source type, and then configure the related data source parameters.
Step 3: Configure an offline synchronization
DataStudio (legacy)
Step 1: Create a task node
Go to the Data Development page.
Log on to the DataWorks console.
In the top navigation bar, select a resource group and region.
In the navigation pane on the left, click .
On the Data Development page, select the target workspace from the drop-down list and click the Go to Data Development button.
On the Data Development page of the Data Studio console, under the Business Flow node, click the target business flow.
For more information, see Create a business flow.
Right-click the Data Integration node and choose Create Node > Batch Synchronization.
In the Create Node dialog box, select a path, enter a name, and click OK.
The new batch synchronization node is displayed under the Data Integration node.
Step 2: Configure the synchronization task
-
Under the Data Integration node, double-click the new batch synchronization task node.
-
Configure the network and resources.
Select the data source, data destination, and the resource group for the synchronization task, and then test the connectivity.
-
In the Configure Network and Resources step, set Data Source to MaxCompute and select the newly added MaxCompute data source for Data Source Name.
-
Select a resource group.
After you select a resource group, the system displays its region and specifications, and automatically tests the connectivity between the resource group and the selected data source.
NoteServerless resource groups allow you to specify an upper limit of compute units (CUs) for a synchronization task. If your synchronization task fails with an Out of Memory (OOM) error due to insufficient resources, you must increase the number of CUs for the resource group.
-
Set Data Destination to Tablestore and select the newly added Tablestore data source for Data Source Name.
The system automatically tests the connectivity between the resource group and the selected data source.
-
After the connectivity test succeeds, click Next.
-
-
Configure and save the task.
Wizard mode
-
In the Configure Task step, configure the data source and data destination in the Configure Data Source and Destination area.
Data source
Parameter
Description
Data Source
Defaults to the MaxCompute data source that you selected in the previous step.
Tunnel resource group
This parameter specifies the Tunnel quota. By default, the public data transfer resource is used, which is the free quota for MaxCompute.
For more information about MaxCompute data transfer resources, see Purchase and use a dedicated data transfer service resource group.
NoteIf a dedicated Tunnel quota becomes unavailable due to overdue payments or expiration, the task automatically switches to the public data transfer resource.
Table
The source table.
Filter by
The filtering logic for data synchronization. The following methods are supported:
-
Partition Filter: Filters the scope of offline synchronization based on a partition expression. If you select this method, you must also configure the Partition Information and If Partition Does Not Exist parameters.
-
Data Filter: Filters the scope of data synchronization by using a WHERE clause. You do not need to enter the
WHEREkeyword.
Partition Information
NoteThis parameter is required if you set Filter by to Partition Filter.
Specify the value of the partition column.
-
You can use a fixed value, such as
ds=20220101. -
You can use a scheduling parameter, such as
ds=${bizdate}. The scheduling parameter is replaced with its actual value at runtime.
If Partition Does Not Exist
NoteThis parameter is required if you set Filter by to Partition Filter.
The policy for handling the synchronization task if the partition does not exist.
-
Fail
-
Ignore non-existent partitions and run the task
Data destination
Parameter
Description
Data Source
Defaults to the Tablestore data source that you selected in the previous step.
Table
The destination table.
Primary Key Information
The primary key of the destination table.
write mode
The mode for writing data to Tablestore. The following modes are supported:
-
PutRow: Corresponds to the
PutRowAPI operation of Tablestore. Inserts data into a specified row. If the row does not exist, a new row is created. If the row exists, the existing row is overwritten. -
UpdateRow: Corresponds to the
UpdateRowAPI operation of Tablestore. Updates data in a specified row. If the row does not exist, a new row is created. If the row exists, the values of specified columns are added, modified, or deleted based on the request.
-
-
Configure field mapping.
You must specify the mapping between source fields and target fields. The task uses this mapping to write data from the source fields to the corresponding target fields. For more information, see Configure field mapping.
Important-
You must specify the primary key in the source field list to read the primary key data.
-
Because the primary key information for the target table is already configured under Data Destination, you cannot configure the primary key information again in the target field list.
-
For fields with an INTEGER data type, you must configure them as INT. DataWorks automatically converts INT to the INTEGER type. If you configure the type as INTEGER directly, the task fails with an error.
-
-
Configure channel control.
You can use channel control to manage properties related to the offline synchronization process, such as concurrency and rate limiting. For more information about the parameters, see Relationship between concurrency and rate limiting for offline synchronization.
-
Click the
icon to save the configuration.
Script mode
-
In the Configure Task step, click the
icon, and then click OK in the dialog box that appears. In the toolbar of the task configuration page, click the </> icon to switch to script mode. -
On the script configuration page, edit the script.
The following code provides a script example. Replace the parameters with your specific values.
{ "type": "job", "version": "2.0", "steps": [ { "stepType": "odps", "parameter": { "partition": [], "datasource": "MyMaxComputeDataSource", "table": "source_table", "column": [ "pk", "col_string", "col_long", "col_double", "col_bool" ] }, "name": "Reader", "category": "reader" }, { "stepType": "ots", "parameter": { "datasource": "MyTablestoreDataSource", "table": "target_table", "newVersion": "true", "mode": "normal", "writeMode": "updateRow", "column": [ { "name": "col_string", "type": "STRING" }, { "name": "col_long", "type": "INT" }, { "name": "col_double", "type": "DOUBLE" }, { "name": "col_bool", "type": "BOOL" } ], "primaryKey": [ { "name": "pk", "type": "STRING" } ] }, "name": "Writer", "category": "writer" }, { "name": "Processor", "stepType": null, "category": "processor", "parameter": {} } ], "setting": { "executeMode": null, "errorLimit": { "record": "0" }, "speed": { "throttle": true, "concurrent": 2, "mbps": "12" } }, "order": { "hops": [ { "from": "Reader", "to": "Writer" } ] } }Important-
You must specify the primary key in the
columnparameter of MaxCompute Reader to read the primary key data. -
For fields with an INTEGER data type, you must configure them as INT. DataWorks automatically converts INT to the INTEGER type. If you configure the type as INTEGER directly, the task fails with an error.
-
The following table describes the parameters that you need to replace for MaxCompute Reader.
Parameter
Description
partition
The partition from which data is read.
NoteThis parameter is required only for partitioned tables.
datasource
The name of the MaxCompute data source.
table
The name of the source table.
column
The columns to read from the MaxCompute source table.
-
The following table describes the parameters that you need to replace for Tablestore Writer.
Parameter
Description
datasource
The name of the Tablestore data source.
table
The name of the destination table.
primaryKey
The primary key columns of the destination table.
column
The attribute columns to write.
ImportantThe order of columns must correspond to the order of columns specified in the MaxCompute Reader configuration.
-
-
Click the
icon to save the configuration.
-
Step 3: Run the synchronization task
-
Click the
icon. -
In the Parameters dialog box, select the resource group for the task.
-
Click Run.
Data studio (new version)
Step 1: Create a task node
Go to the Data Development page.
Log on to the DataWorks console.
In the top navigation bar, select a resource group and region.
In the navigation pane on the left, click .
On the Data Development page, select your target workspace from the drop-down list and click Go to Data Studio.
In data development page of the Data Studio console, click the
icon to the right of Workspace Directories, and select .NoteIf this is your first time using Workspace Directories, you can also click the Create Node button.
In the Create Node dialog box, select a path, enter a name, and click OK.
The new batch synchronization node appears in the Workspace Directories.
Step 2: Configure the synchronization task
-
In the project directory, click the new offline synchronization task node.
-
Configure the network and resources.
Select the data source, data destination, and the resource group for the synchronization task, and then test the connectivity.
-
In the Configure Network and Resources step, set Data Source to MaxCompute and select the newly added MaxCompute data source for Data Source Name.
-
Select a resource group.
After you select a resource group, the system displays its region and specifications, and automatically tests the connectivity between the resource group and the selected data source.
NoteServerless resource groups allow you to specify an upper limit of compute units (CUs) for a synchronization task. If your synchronization task fails with an Out of Memory (OOM) error due to insufficient resources, you must increase the number of CUs for the resource group.
-
Set Data Destination to Tablestore and select the newly added Tablestore data source for Data Source Name.
The system automatically tests the connectivity between the resource group and the selected data source.
-
After the connectivity test succeeds, click Next.
-
-
Configure and save the task.
Wizard mode
-
In the Configure Task step, configure the data source and data destination in the Configure Data Source and Destination area.
Data source
Parameter
Description
Data Source
Defaults to the MaxCompute data source that you selected in the previous step.
Tunnel resource group
This parameter specifies the Tunnel quota. By default, the public data transfer resource is used, which is the free quota for MaxCompute.
For more information about MaxCompute data transfer resources, see Purchase and use a dedicated data transfer service resource group.
NoteIf a dedicated Tunnel quota becomes unavailable due to overdue payments or expiration, the task automatically switches to the public data transfer resource.
Table
The source table.
Filter by
The filtering logic for data synchronization. The following methods are supported:
-
Partition Filter: Filters the scope of offline synchronization based on a partition expression. If you select this method, you must also configure the Partition Information and If Partition Does Not Exist parameters.
-
Data Filter: Filters the scope of data synchronization by using a WHERE clause. You do not need to enter the
WHEREkeyword.
Partition Information
NoteThis parameter is required if you set Filter by to Partition Filter.
Specify the value of the partition column.
-
You can use a fixed value, such as
ds=20220101. -
You can use a scheduling parameter, such as
ds=${bizdate}. The scheduling parameter is replaced with its actual value at runtime.
If Partition Does Not Exist
NoteThis parameter is required if you set Filter by to Partition Filter.
The policy for handling the synchronization task if the partition does not exist.
-
Fail
-
Ignore non-existent partitions and run the task
Data destination
Parameter
Description
Data Source
Defaults to the Tablestore data source that you selected in the previous step.
Table
The destination table.
Primary Key Information
The primary key of the destination table.
write mode
The mode for writing data to Tablestore. The following modes are supported:
-
PutRow: Corresponds to the
PutRowAPI operation of Tablestore. Inserts data into a specified row. If the row does not exist, a new row is created. If the row exists, the existing row is overwritten. -
UpdateRow: Corresponds to the
UpdateRowAPI operation of Tablestore. Updates data in a specified row. If the row does not exist, a new row is created. If the row exists, the values of specified columns are added, modified, or deleted based on the request.
-
-
Configure field mapping.
You must specify the mapping between source fields and target fields. The task uses this mapping to write data from the source fields to the corresponding target fields. For more information, see Configure field mapping.
Important-
You must specify the primary key in the source field list to read the primary key data.
-
Because the primary key information for the target table is already configured under Data Destination, you cannot configure the primary key information again in the target field list.
-
For fields with an INTEGER data type, you must configure them as INT. DataWorks automatically converts INT to the INTEGER type. If you configure the type as INTEGER directly, the task fails with an error.
-
-
Configure channel control.
You can use channel control to manage properties related to the offline synchronization process, such as concurrency and rate limiting. For more information about the parameters, see Relationship between concurrency and rate limiting for offline synchronization.
-
Click Save.
Script mode
-
In the Configure Task step, click Script Mode, and then click OK in the dialog box that appears. In the toolbar on the task configuration page, click Script Mode to switch to script editing mode.
-
On the script configuration page, edit the script.
The following code provides a script example. Replace the parameters with your specific values.
{ "type": "job", "version": "2.0", "steps": [ { "stepType": "odps", "parameter": { "partition": [], "datasource": "MyMaxComputeDataSource", "table": "source_table", "column": [ "pk", "col_string", "col_long", "col_double", "col_bool" ] }, "name": "Reader", "category": "reader" }, { "stepType": "ots", "parameter": { "datasource": "MyTablestoreDataSource", "table": "target_table", "newVersion": "true", "mode": "normal", "writeMode": "updateRow", "column": [ { "name": "col_string", "type": "STRING" }, { "name": "col_long", "type": "INT" }, { "name": "col_double", "type": "DOUBLE" }, { "name": "col_bool", "type": "BOOL" } ], "primaryKey": [ { "name": "pk", "type": "STRING" } ] }, "name": "Writer", "category": "writer" }, { "name": "Processor", "stepType": null, "category": "processor", "parameter": {} } ], "setting": { "executeMode": null, "errorLimit": { "record": "0" }, "speed": { "throttle": true, "concurrent": 2, "mbps": "12" } }, "order": { "hops": [ { "from": "Reader", "to": "Writer" } ] } }Important-
You must specify the primary key in the
columnparameter of MaxCompute Reader to read the primary key data. -
For fields with an INTEGER data type, you must configure them as INT. DataWorks automatically converts INT to the INTEGER type. If you configure the type as INTEGER directly, the task fails with an error.
-
The following table describes the parameters that you need to replace for MaxCompute Reader.
Parameter
Description
partition
The partition from which data is read.
NoteThis parameter is required only for partitioned tables.
datasource
The name of the MaxCompute data source.
table
The name of the source table.
column
The columns to read from the MaxCompute source table.
-
The following table describes the parameters that you need to replace for Tablestore Writer.
Parameter
Description
datasource
The name of the Tablestore data source.
table
The name of the destination table.
primaryKey
The primary key columns of the destination table.
column
The attribute columns to write.
ImportantThe order of columns must correspond to the order of columns specified in the MaxCompute Reader configuration.
-
-
Click Save.
-
Step 3: Run the synchronization task
-
Click Runtime Configuration to the right of the task, and select the resource group for the task.
-
Click Run.
Step 4: View synchronization results
After the synchronization task runs, you can view its execution status in the log and the synchronized data in the destination data table in the Tablestore console.
-
View the task execution status.
-
On the Result tab of the synchronization task, check the value of
Current task status.If
Current task statusis FINISH, the task is complete. -
To view detailed execution logs, click the
Detail log urllink.
-
-
View the synchronized data in the destination data table.
-
Go to the Instance Management page.
-
Log in to the Tablestore console.
-
In the top navigation bar, select a resource group and a region.
-
On the Overview page, click the instance alias, or click Instance Management in the Actions column of the instance.
-
-
On the Instance Details tab, click the Data Table List tab.
-
On the Data Table List tab, click the name of the destination data table.
-
On the Data Management tab, you can view the data synchronized to this data table.
-

