The Hive input component reads data from a Hive data source. Configure this component as the input when synchronizing Hive data to another data source.
Limitations
Supported Hive table formats: orc, parquet, text, rc, seq, and iceberg. The iceberg format requires Hive compute sources or data sources on E-MapReduce 5.x. ORC transactional tables and Kudu tables are not supported.
To integrate data from a Kudu table, use the Impala input component. Configure the Impala input component.
Prerequisites
-
A Hive data source is created. Create a Hive data source.
-
Your account has sync read permissions for the data source. Apply for data source permissions.
Procedure
-
In the top navigation bar of the Dataphin homepage, choose Develop > Data Integration.
-
On the Data Integration page, select a Project. In Dev-Prod mode, also select an environment.
-
In the left-side navigation pane, click Offline Integration. In the Offline Integration list, click the offline pipeline you want to develop.
-
In the upper-right corner of the page, click Component Library to open the component library panel.
-
In the component library panel, select Input, find the Hive component, and drag it onto the canvas.
-
Click the
icon on the Hive component to open the Hive Input Configuration dialog box. -
In the Hive Input Configuration dialog box, configure the parameters.
Parameter
Description
Step name
The name of the Hive input component. Dataphin auto-generates a name that you can customize. Requirements:
-
Can contain only Chinese characters, letters, underscores (_), and numbers.
-
Cannot exceed 64 characters.
Data source
Select a Hive data source. The list shows all Hive data sources regardless of your sync read permissions. Click the
icon to copy the data source name.-
If you lack sync read permissions, click Apply next to the data source to request access. Apply for data source permissions.
-
If no Hive data source exists, click New Data Source. Create a Hive data source.
Table
Select the source table. Click the
icon to copy the name of the selected table.NoteIf you select a Hudi table or a Paimon table, you can only configure the Partition parameter.
Partition
You can read data from a static partition, such as
ds=20230101ords1=2023,ds2=01, or a range partition, such as/*query*/ds >=20230101 and ds <= 20230107.NoteIf the selected table is a Hudi table or a Paimon table, reading from a range partition is not supported.
Action on partition not found
Action when the specified partition does not exist:
-
Fail task: The task fails.
-
Succeed task, no data written: The task succeeds, but writes no data to the target table.
File encoding
Select the encoding of the source files. Supported encodings are UTF-8 and GBK.
NULL value replacement
Applies only to
textfiletables. Enter the string to replace withNULL. For example, entering\Nreplaces all\Nvalues withNULL.Compression format
Optional. Select the compression format for compressed files. ORC tables default to zlib; other formats have no default. Supported formats: zlib, hadoop-snappy, lz4, and none.
Field delimiter
Enter the field delimiter used in the table, typically set with the
ROW FORMAT DELIMITED FIELDS TERMINATED BYstatement. Defaults to\u0001.Output fields
Lists all fields from the selected table that match the filter criteria. Remove fields that you do not need in downstream components.
NoteThe data classification of output fields is visible only when using the Hadoop compute engine.
-
To remove a single field: Click the
icon in the Actions column to remove the field. -
To remove fields in bulk: Click Field Management. In the Field Management dialog box, select the fields to remove, click the
left arrow icon to move them to the unselected list, and click OK.
-
-
Click Confirm to save the component's configuration.
