Data Lake Formation data source

更新时间:
复制 MD 格式

Alibaba Cloud Data Lake Formation (DLF) is a fully managed platform that provides unified services for metadata management, storage management, permission management, storage analysis, and optimization. DataWorks Data Integration can write data to DLF data sources. This topic explains how to configure and use a Data Lake Formation data source in DataWorks.

Limitations

You can use a Data Lake Formation data source only in Data Integration and with serverless resource groups.

Create a data source

  1. Go to the Data Sources page.

    1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

    2. In the left-side navigation pane of the SettingCenter page, click Data Sources.

  1. Click Add Data Source. In the dialog box that appears, search for and select DLF. Configure the parameters as described in the following table.

    Parameter

    Description

    Data Source Name

    Enter a custom name for the data source. The name must be unique within the workspace, consist of only letters, digits, and underscores (_), and cannot start with a digit or an underscore.

    Configuration Mode

    Only Alibaba Cloud Instance Mode is supported.

    Endpoint

    Select the endpoint of the DLF engine instance from the drop-down list.

    Access identity

    Select one of the following:

    • Alibaba Cloud Account

    • Alibaba Cloud RAM User

    • Alibaba Cloud RAM Role

    Note

    If you select RAM user or RAM role as the access identity, you must grant the following permissions to the RAM user or RAM role:

    DLF data catalog

    Select a DLF data catalog in the same region as your DataWorks workspace.

    Database Name

    Select a database from the data catalog.

    After you configure the parameters, test the connectivity to the serverless resource group. If the test succeeds, click Complete Modification. If it fails, see Network connectivity configuration to troubleshoot the issue.

Create a Data Integration task

You can use a Data Lake Formation data source in a DataWorks Data Integration task. For more information, see Synchronize data to Data Lake Formation.

Appendix: Script examples and parameters

Offline task script configuration

When you configure an offline task using the code editor, you must add the parameters to the task script in the required format. For more information, see Configure a task in the code editor. The following sections describe the data source parameters that you can use in the code editor.

Reader script example

{
   "type": "job",
   "version": "2.0",
   "steps": [
      {
         "stepType": "dlf",
         "parameter": {
            "datasource": "guxuan_dlf",
            "table": "auto_ob_3088545_0523",
            "column": [
               "id",
               "col1",
               "col2",
               "col3"
            ],
            "tableType": "table",
            "where": "id > 1"
         },
         "name": "Reader",
         "category": "reader"
      },
      {
         "stepType": "stream",
         "parameter": {
            "print": false
         },
         "name": "Writer",
         "category": "writer"
      }
   ],
   "setting": {
      "errorLimit": {
         "record": "" // The number of error records that are allowed.
      },
      "speed": {
         "throttle": true, // Set to true to enable rate limiting. If set to false, rate limiting is disabled and the mbps parameter is ignored.
         "concurrent": 20, // The number of concurrent threads.
         "mbps": "12" // The maximum data transfer rate. Unit: MB/s.
      }
   },
   "order": {
      "hops": [
         {
            "from": "Reader",
            "to": "Writer"
         }
      ]
   }
}

Reader script parameters

Parameter

Description

Required

Default

datasource

The name of the DLF data source.

Yes

N/A

table

The name of the source table.

Yes

N/A

tableType

The table type. Valid values: table (Paimon table), format-table (Format table), and iceberg-table (Iceberg table).

No

table

column

The names of columns to read from the source table.

Yes

N/A

where

The filter condition.

No

N/A

Writer script example

{
   "type": "job",
   "version": "2.0",
   "steps": [
      {
         "stepType": "stream",
         "parameter": {
         },
         "name": "Reader",
         "category": "reader"
      },
      {
         "stepType": "dlf",
         "parameter": {
            "datasource": "guxuan_dlf",
            "column": [
               "id",
               "col1",
               "col2",
               "col3"
            ],
            "tableType": "table",
            "table": "auto_ob_3088545_0523"
         },
         "name": "Writer",
         "category": "writer"
      }
   ],
   "setting": {
      "errorLimit": {
         "record": "" // The number of error records that are allowed.
      },
      "speed": {
         "throttle": true, // Set to true to enable rate limiting. If set to false, rate limiting is disabled and the mbps parameter is ignored.
         "concurrent": 20, // The number of concurrent threads.
         "mbps": "12" // The maximum data transfer rate. Unit: MB/s.
      }
   },
   "order": {
      "hops": [
         {
            "from": "Reader",
            "to": "Writer"
         }
      ]
   }
}

Writer script parameters

Parameter

Description

Required

Default

datasource

The name of the DLF data source.

Yes

N/A

table

The name of the destination table.

Yes

N/A

tableType

The table type. Valid values: table (Paimon table), format-table (Format table), and iceberg-table (Iceberg table).

No

table

column

The columns in the destination table.

Yes

N/A