The HDFS input component reads data from HDFS data sources. To synchronize HDFS data to other data sources, configure the HDFS input component first, and then configure the target data source.
Prerequisites
-
An HDFS data source has been created. For more information, see .
-
The account used to configure the HDFS input component must have read-through permission for the data source. If you do not have the required permission, request access. For more information, see Request Data Source Permission.
Procedure
-
In the top menu bar on the Dataphin home page, select Development > Data Integration.
-
In the top menu bar on the integration page, select Project (Dev-Prod mode requires selecting an environment).
-
In the navigation pane on the left, click Batch Pipeline. In the Batch Pipeline list, click the offline pipeline that needs to be developed to open its configuration page.
-
Click the Component Library in the upper right corner of the page to open the Component Library panel.
-
In the navigation pane on the left of the Component Library panel, select Input, find the HDFS component in the input component list on the right, and drag the component to the canvas.
-
Click the
icon in the HDFS input component card to open the HDFS Input Configuration dialog box. -
In the HDFS Input Configuration dialog box, configure the parameters.
Parameter
Description
Step Name
The name of the HDFS input component. Dataphin generates a default step name, which you can modify as needed. Naming rules:
-
Can only contain Chinese characters, letters, underscores (_), and numbers.
-
The name can be up to 64 characters in length.
Datasource
The drop-down list displays all HDFS data sources in Dataphin, including those you have read-through permission for and those you do not. Click the
icon to copy the data source name.-
For data sources without read-through permission, click Request next to the data source to request permission. For more information, see Request Data Source Permission.
-
If no HDFS data source exists, click Create Data Source to create one. For more information, see .
File Path
The absolute path of the file. You do not need to include the
hdfs://{namenode}:{port}prefix because theNameNodeis already configured in the data source. For example,/hadoop/input/file.txt. The system accesses the file at:hdfs://{NameNode configured for the data source}:{IPC Port configured for the data source}{the file path you entered}.File Type
The file type. Supported File Types: Text, ORC, RC, Sequence, CSV, Parquet.
When File Does Not Exist
Specifies the behavior when the file does not exist.
-
Ignore: Skip the missing file and continue reading other files.
-
Set Task To Fail: Terminate the task and mark it as failed.
When File Is Empty
Specifies the behavior when the file is empty.
-
Ignore: Skip the empty file and continue reading other files.
-
Set Task To Fail: Terminate the task and mark it as failed.
Data Content Starting Line
Required when the file type is Text or CSV. Default value: 1 (data starts from the first line). To skip the first N lines, set this value to N+1.
File Encoding (optional)
The file encoding. Supported File Encodings: UTF-8 and GBK.
Field Separator (optional)
Required when the file type is Text or CSV. Specify the delimiter between fields based on the actual file format. Default: comma (,).
Compression Format (optional)
The compression format of the file. Supported formats:
-
zip
-
gzip
-
bzip2
Output Fields
The output fields of the component. You can add output fields manually:
-
Click Batch Add. JSON and TEXT formats are supported for batch configuration.
-
Batch configuration in JSON format, for example:
[{ "index": 0, "type": "double", "name": "HDFS1" },Noteindex: the field index. type: the field type. name: the field name.
-
Batch configuration in TEXT format, for example:
0,HDFS1,Double 1,HDFS2,String-
Row delimiter: separates fields. Default: line feed (\n). Also supports semicolon (;) and period (.).
-
Column delimiter: separates the field name and type. Default: comma (,).
-
-
-
Click Create Output Field, then enter the Column name and select the Type.
You can also manage added fields:
-
Click the Actions column
icon to edit a field. -
Click the Actions column
icon to delete a field.
-
-
Click Confirm to save the HDFS input component configuration.