Configure the Hive input component

The Hive input component reads data from a Hive data source. Configure this component as the input when synchronizing Hive data to another data source.

Limitations

Supported Hive table formats: orc, parquet, text, rc, seq, and iceberg. The iceberg format requires Hive compute sources or data sources on E-MapReduce 5.x. ORC transactional tables and Kudu tables are not supported.

Note

To integrate data from a Kudu table, use the Impala input component. Configure the Impala input component.

Prerequisites

A Hive data source is created. Create a Hive data source.
Your account has sync read permissions for the data source. Apply for data source permissions.

Procedure

In the top navigation bar of the Dataphin homepage, choose Develop > Data Integration.
On the Data Integration page, select a Project. In Dev-Prod mode, also select an environment.
In the left-side navigation pane, click Offline Integration. In the Offline Integration list, click the offline pipeline you want to develop.
In the upper-right corner of the page, click Component Library to open the component library panel.
In the component library panel, select Input, find the Hive component, and drag it onto the canvas.
Click the icon on the Hive component to open the Hive Input Configuration dialog box.

In the Hive Input Configuration dialog box, configure the parameters.

Parameter	Description
Step name	The name of the Hive input component. Dataphin auto-generates a name that you can customize. Requirements: Can contain only Chinese characters, letters, underscores (_), and numbers. Cannot exceed 64 characters.
Data source	Select a Hive data source. The list shows all Hive data sources regardless of your sync read permissions. Click the icon to copy the data source name. If you lack sync read permissions, click Apply next to the data source to request access. Apply for data source permissions. If no Hive data source exists, click New Data Source. Create a Hive data source.
Table	Select the source table. Click the icon to copy the name of the selected table. Note If you select a Hudi table or a Paimon table, you can only configure the Partition parameter.
Partition	You can read data from a static partition, such as `ds=20230101` or `ds1=2023,ds2=01`, or a range partition, such as `/query/ds >=20230101 and ds <= 20230107`. Note If the selected table is a Hudi table or a Paimon table, reading from a range partition is not supported.
Action on partition not found	Action when the specified partition does not exist: Fail task: The task fails. Succeed task, no data written: The task succeeds, but writes no data to the target table.
File encoding	Select the encoding of the source files. Supported encodings are UTF-8 and GBK.
NULL value replacement	Applies only to `textfile` tables. Enter the string to replace with `NULL`. For example, entering `\N` replaces all `\N` values with `NULL`.
Compression format	Optional. Select the compression format for compressed files. ORC tables default to zlib; other formats have no default. Supported formats: zlib, hadoop-snappy, lz4, and none.
Field delimiter	Enter the field delimiter used in the table, typically set with the `ROW FORMAT DELIMITED FIELDS TERMINATED BY` statement. Defaults to `\u0001`.
Output fields	Lists all fields from the selected table that match the filter criteria. Remove fields that you do not need in downstream components. Note The data classification of output fields is visible only when using the Hadoop compute engine. To remove a single field: Click the icon in the Actions column to remove the field. To remove fields in bulk: Click Field Management. In the Field Management dialog box, select the fields to remove, click the left arrow icon to move them to the unselected list, and click OK.

Click Confirm to save the component's configuration.