Migrate JSON data from OSS to MaxCompute-MaxCompute(MaxCompute)-阿里云帮助中心

This topic describes how to use DataWorks data integration to migrate JSON data from OSS to MaxCompute and use the built-in MaxCompute string function GET_JSON_OBJECT to extract information from the JSON data.

Prerequisites

MaxCompute is activated.
A workflow has been created in DataWorks. This example uses a workspace in basic mode. For more information, see Create a workflow.

A JSON file with the .txt extension has been uploaded to OSS. In this example, the OSS bucket is in the China (Shanghai) region. The following is an example file.

{
    "store": {
        "book": [
             {
                "category": "reference",
                "author": "Nigel Rees",
                "title": "Sayings of the Century",
                "price": 8.95
             },
             {
                "category": "fiction",
                "author": "Evelyn Waugh",
                "title": "Sword of Honour",
                "price": 12.99
             },
             {
                 "category": "fiction",
                 "author": "J. R. R. Tolkien",
                 "title": "The Lord of the Rings",
                 "isbn": "0-395-19395-8",
                 "price": 22.99
             }
          ],
          "bicycle": {
              "color": "red",
              "price": 19.95
          }
    },
    "expensive": 10
}

Migrate JSON data

Add an OSS data source. For more information, see Configure an OSS data source.
Create a table in DataWorks to store the migrated JSON data.
1. Log on to the DataWorks console. In the target region, click Data Development and O&M > Data Development in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Development.
2. Click Data Source to go to the Data Source page, and click New data source to add a MaxCompute project.
3. Click Add Data Source and Bind to Data Development to complete the binding.
4. On the Data Development page, move the pointer over the icon and choose Create Table > Table.
5. In the Create Table dialog box, select a Path, enter a Name, and click Create .
  
  Note
  If multiple instances are bound, you must select a MaxCompute engine instance.
6. On the table editing page, click DDL Statement.
7. In the DDL Statement dialog box, enter the following statement and click Generate Table Schema.
```
create table mqdata (mq_data string);
```
8. In the Confirm operation dialog box, click Confirm.
9. After the table schema is generated, go to the General section, enter a Display Name for the table, and then click Commit to Development Environment and Commit to Production Environment.
  
  Note
  If you are using a workspace in basic mode, you need to click only Commit to Production Environment.

Create a batch synchronization node.

Go to the data analytics page. Right-click the specified workflow and choose Create Node > Data Integration > Offline synchronization.
In Create Node dialog box, enter Name, and click Confirm.
In the top navigation bar, choose icon.
In script mode, click icon.
In import Template dialog box SOURCE type, data source, target type and data source, and click confirm.

Modify the JSON configuration, and then click the button.

The following code is provided for reference.

{
    "type": "job",
    "steps": [
        {
            "stepType": "oss",
            "parameter": {
                "fieldDelimiterOrigin": "^",
                "nullFormat": "",
                "compress": "",
                "datasource": "OSS_userlog",
                "column": [
                    {
                        "name": 0,
                        "type": "string",
                        "index": 0
                    }
                ],
                "skipHeader": "false",
                "encoding": "UTF-8",
                "fieldDelimiter": "^",
                "fileFormat": "binary",
                "object": [
                    "applog.txt"
                ]
            },
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "odps",
            "parameter": {
                "partition": "",
                "isCompress": false,
                "truncate": true,
                "datasource": "odps_first",
                "column": [
                    "mqdata"
                ],
                "emptyAsNull": false,
                "table": "mqdata"
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "version": "2.0",
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    },
    "setting": {
        "errorLimit": {
            "record": ""
        },
        "speed": {
            "concurrent": 2,
            "throttle": false
        }
    }
}

Verify the results

Create an ODPS SQL node.

Right-click the workflow and choose new > MaxCompute > ODPS SQL.
In create a function dialog box, enter function name, click submit.

On the ODPS SQL node configuration tab, enter the following statements.

-- Query data in the mqdata table.
SELECT * from mqdata;
-- Get the value of the 'expensive' field from the JSON data.
SELECT GET_JSON_OBJECT(mqdata.mq_data,'$.expensive') FROM mqdata;

Click icon to run the code.
You can operation Log view the results.