How to synchronize data from MaxCompute to MongoDB using Data Integration-DataWorks(DataWorks)-阿里云帮助中心

Prerequisites

Before you begin, ensure that you meet the following requirements:

DataWorks is activated and you have added a MaxCompute data source.
This tutorial uses an exclusive resource group for Data Integration to run the batch task. You must purchase and configure this resource group before proceeding. For more information, see Use an exclusive resource group for Data Integration.

Note
You can also use a serverless resource group. For more information, see Use a serverless resource group.

Prepare sample data

Prepare a MongoDB collection and a MaxCompute table for batch synchronization.

Create a MaxCompute table and add data to it.

Create a partitioned table named test_write_mongo. Set the partition field to pt.

CREATE TABLE IF NOT EXISTS test_write_mongo(
    id STRING ,
    col_string STRING,
    col_int int,
    col_bigint bigint,
    col_decimal decimal,
    col_date DATETIME,
    col_boolean boolean,
    col_array string
) PARTITIONED BY (pt STRING) LIFECYCLE 10;

Add a partition with the value 20230215.

insert into test_write_mongo partition (pt='20230215')
values ('11','name11',1,111,1.22,cast('2023-02-15 15:01:01' as datetime),true,'1,2,3');

Verify that the partitioned table is created correctly.
```
SELECT * FROM test_write_mongo
WHERE pt='20230215';
```

Prepare the destination MongoDB collection.
This tutorial uses ApsaraDB for MongoDB as an example. Create a collection named test_write_mongo.
```
db.createCollection('test_write_mongo')
```

Batch task configuration

Step 1: Add a MongoDB data source

Add a MongoDB data source and establish a network connection between it and the exclusive resource group for Data Integration. For more information, see Configure a MongoDB data source.

Step 2: Create and configure a batch sync node

In DataWorks DataStudio, create a batch sync node and configure its source and destination. This section describes only the key parameters. Leave other parameters at their default values. For detailed instructions, see Configure a task in the codeless UI.

Configure the network connection for synchronization.

Select the MongoDB data source, the MaxCompute data source, and the exclusive resource group for Data Integration. Then, test the connectivity.

Configure the source and destination.

Select the MaxCompute partitioned table and MongoDB collection. The following table describes the key parameters.

Parameter

Description

write mode

Determines whether to overwrite existing data. This functionality involves two related settings: write mode and business primary key:

write mode:
- If you select No, the plugin inserts each record as a new document. This is the default option.
- If you select Yes, you must specify a business primary key. The plugin then overwrites records that have the same business primary key.
business primary key: Specifies the business primary key for each record, which is used for the overwrite operation. You can specify only one field. In MongoDB, this typically corresponds to the _id field.

Note

If write mode is set to Yes and you specify a field other than _id as the business primary key, an error similar to the following may occur: After applying the update, the (immutable) field '_id' was found to have been altered to _id: "2". This error occurs because the incoming data contains a record whose _id does not match the _id of the existing document that is identified by the business primary key. For more information, see Error: After applying the update, the (immutable) field '_id' was found to have been altered to _id: "2".

Pre-Import Statement

Specifies a precondition (PreSQL) to run before the import task. The configuration must be in JSON format and include the type and json properties.

type: Required. Valid values are remove and drop (in lowercase).
json:
- If type is set to remove, this parameter is required. Its value must be a standard MongoDB query. For more information, see Query Documents.
- If type is set to drop, this parameter is not required.

Configure field mapping.

When the data source is MongoDB, Peer mapping is used by default. You can also click the icon icon to manually edit the source table fields. The following is an example of manual editing.

{"name":"id","type":"string"}
{"name":"col_string","type":"string"}
{"name":"col_int","type":"long"}
{"name":"col_bigint","type":"long"}
{"name":"col_decimal","type":"double"}
{"name":"col_date","type":"date"}
{"name":"col_boolean","type":"bool"}
{"name":"col_array","type":"array","splitter":","}

After you manually edit the fields, the UI displays the mapping between source and destination fields.

Source field	Destination field
id	`{"name":"id","type":"string"}`
col_string	`{"name":"col_string","type":"string"}`
col_int	`{"name":"col_int","type":"long"}`
col_bigint	`{"name":"col_bigint","type":"long"}`
col_decimal	`{"name":"col_decimal","type":"double"}`
col_date	`{"name":"col_date","type":"date"}`
col_boolean	`{"name":"col_boolean","type":"bool"}`
col_array	`{"name":"col_array","type":"array","splitter":","}`

Step 3: Commit and deploy the batch sync node

To run this task periodically in a standard mode workspace, you must commit and deploy the batch sync node to the production environment. For more information, see node deployment.

Step 4: Run the batch sync node and view results

After you configure the node, run it. When the task is complete, view the synchronized data in the MongoDB collection. The task creates a document record in the destination database with the following fields and values: _id is 63ecc513143e565c9b037623, col_array is [1, 2, 3], col_bigint is 111, col_boolean is true, col_date is 2023-02-15T07:01:01.000Z, col_decimal is 1.22, col_int is 1, col_string is name11, and id is 11.

Appendix: Data type conversion

Supported values for the type parameter

The supported values for the type parameter are: INT, LONG, DOUBLE, STRING, BOOL, DATE, and ARRAY.

Array type

To write data as a MongoDB array, set the type parameter to ARRAY and configure the splitter property. For example:

The source data is the string: a,b,c.
In the task configuration, type is set to ARRAY and splitter is set to ,.
After the task runs, the resulting data in MongoDB is: ["a","b","c"].