Alibaba Cloud Data Lake Formation (DLF) is a fully managed platform that provides unified services for metadata management, storage management, permission management, storage analysis, and optimization. DataWorks Data Integration can write data to DLF data sources. This topic explains how to configure and use a Data Lake Formation data source in DataWorks.
Limitations
You can use a Data Lake Formation data source only in Data Integration and with serverless resource groups.
Create a data source
Go to the Data Sources page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, click Data Sources.
-
Click Add Data Source. In the dialog box that appears, search for and select DLF. Configure the parameters as described in the following table.
Parameter
Description
Data Source Name
Enter a custom name for the data source. The name must be unique within the workspace, consist of only letters, digits, and underscores (_), and cannot start with a digit or an underscore.
Configuration Mode
Only Alibaba Cloud Instance Mode is supported.
Endpoint
Select the endpoint of the DLF engine instance from the drop-down list.
Access identity
Select one of the following:
-
Alibaba Cloud Account
-
Alibaba Cloud RAM User
-
Alibaba Cloud RAM Role
NoteIf you select RAM user or RAM role as the access identity, you must grant the following permissions to the RAM user or RAM role:
-
In the RAM console, attach the AliyunDataWorksDIAccessDLF system policy to the RAM user or RAM role to grant permissions for accessing DLF metadata. For more information, see Grant permissions to a RAM user.
-
In the Data Lake Formation console, grant the Data Editor permission to the RAM role or RAM user on the tables to be synchronized. For more information, see Data Authorization Management.
DLF data catalog
Select a DLF data catalog in the same region as your DataWorks workspace.
Database Name
Select a database from the data catalog.
After you configure the parameters, test the connectivity to the serverless resource group. If the test succeeds, click Complete Modification. If it fails, see Network connectivity configuration to troubleshoot the issue.
-
Create a Data Integration task
You can use a Data Lake Formation data source in a DataWorks Data Integration task. For more information, see Synchronize data to Data Lake Formation.
Appendix: Script examples and parameters
Offline task script configuration
When you configure an offline task using the code editor, you must add the parameters to the task script in the required format. For more information, see Configure a task in the code editor. The following sections describe the data source parameters that you can use in the code editor.
Reader script example
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "dlf",
"parameter": {
"datasource": "guxuan_dlf",
"table": "auto_ob_3088545_0523",
"column": [
"id",
"col1",
"col2",
"col3"
],
"tableType": "table",
"where": "id > 1"
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "stream",
"parameter": {
"print": false
},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"errorLimit": {
"record": "" // The number of error records that are allowed.
},
"speed": {
"throttle": true, // Set to true to enable rate limiting. If set to false, rate limiting is disabled and the mbps parameter is ignored.
"concurrent": 20, // The number of concurrent threads.
"mbps": "12" // The maximum data transfer rate. Unit: MB/s.
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}
Reader script parameters
|
Parameter |
Description |
Required |
Default |
|
datasource |
The name of the DLF data source. |
Yes |
N/A |
|
table |
The name of the source table. |
Yes |
N/A |
|
tableType |
The table type. Valid values: |
No |
table |
|
column |
The names of columns to read from the source table. |
Yes |
N/A |
|
where |
The filter condition. |
No |
N/A |
Writer script example
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "stream",
"parameter": {
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "dlf",
"parameter": {
"datasource": "guxuan_dlf",
"column": [
"id",
"col1",
"col2",
"col3"
],
"tableType": "table",
"table": "auto_ob_3088545_0523"
},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"errorLimit": {
"record": "" // The number of error records that are allowed.
},
"speed": {
"throttle": true, // Set to true to enable rate limiting. If set to false, rate limiting is disabled and the mbps parameter is ignored.
"concurrent": 20, // The number of concurrent threads.
"mbps": "12" // The maximum data transfer rate. Unit: MB/s.
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}
Writer script parameters
|
Parameter |
Description |
Required |
Default |
|
datasource |
The name of the DLF data source. |
Yes |
N/A |
|
table |
The name of the destination table. |
Yes |
N/A |
|
tableType |
The table type. Valid values: |
No |
table |
|
column |
The columns in the destination table. |
Yes |
N/A |