Configure the Hive input component

更新时间:
复制 MD 格式

The Hive input component reads data from a Hive data source. Configure this component as the input when synchronizing Hive data to another data source.

Limitations

Supported Hive table formats: orc, parquet, text, rc, seq, and iceberg. The iceberg format requires Hive compute sources or data sources on E-MapReduce 5.x. ORC transactional tables and Kudu tables are not supported.

Note

To integrate data from a Kudu table, use the Impala input component. Configure the Impala input component.

Prerequisites

Procedure

  1. In the top navigation bar of the Dataphin homepage, choose Develop > Data Integration.

  2. On the Data Integration page, select a Project. In Dev-Prod mode, also select an environment.

  3. In the left-side navigation pane, click Offline Integration. In the Offline Integration list, click the offline pipeline you want to develop.

  4. In the upper-right corner of the page, click Component Library to open the component library panel.

  5. In the component library panel, select Input, find the Hive component, and drag it onto the canvas.

  6. Click the image icon on the Hive component to open the Hive Input Configuration dialog box.

  7. In the Hive Input Configuration dialog box, configure the parameters.

    Parameter

    Description

    Step name

    The name of the Hive input component. Dataphin auto-generates a name that you can customize. Requirements:

    • Can contain only Chinese characters, letters, underscores (_), and numbers.

    • Cannot exceed 64 characters.

    Data source

    Select a Hive data source. The list shows all Hive data sources regardless of your sync read permissions. Click the image icon to copy the data source name.

    Table

    Select the source table. Click the image icon to copy the name of the selected table.

    Note

    If you select a Hudi table or a Paimon table, you can only configure the Partition parameter.

    Partition

    You can read data from a static partition, such as ds=20230101 or ds1=2023,ds2=01, or a range partition, such as /*query*/ds >=20230101 and ds <= 20230107.

    Note

    If the selected table is a Hudi table or a Paimon table, reading from a range partition is not supported.

    Action on partition not found

    Action when the specified partition does not exist:

    • Fail task: The task fails.

    • Succeed task, no data written: The task succeeds, but writes no data to the target table.

    File encoding

    Select the encoding of the source files. Supported encodings are UTF-8 and GBK.

    NULL value replacement

    Applies only to textfile tables. Enter the string to replace with NULL. For example, entering \N replaces all \N values with NULL.

    Compression format

    Optional. Select the compression format for compressed files. ORC tables default to zlib; other formats have no default. Supported formats: zlib, hadoop-snappy, lz4, and none.

    Field delimiter

    Enter the field delimiter used in the table, typically set with the ROW FORMAT DELIMITED FIELDS TERMINATED BY statement. Defaults to \u0001.

    Output fields

    Lists all fields from the selected table that match the filter criteria. Remove fields that you do not need in downstream components.

    Note

    The data classification of output fields is visible only when using the Hadoop compute engine.

    • To remove a single field: Click the sgaga icon in the Actions column to remove the field.

    • To remove fields in bulk: Click Field Management. In the Field Management dialog box, select the fields to remove, click the image left arrow icon to move them to the unselected list, and click OK.

      image..png

  8. Click Confirm to save the component's configuration.