Configure the Hive output component

更新时间:
复制 MD 格式

The Hive output component writes data to a Hive data source. To synchronize data from other data sources to a Hive data source, you must configure the Hive output component to specify the target data source after you have configured the source.

Limitations

The Hive output component supports writing data to Hive tables in orc, parquet, text, Hudi, Iceberg, and Paimon formats. The Hudi format is available only for Hive compute engines or data sources on Cloudera Data Platform 7.x. The Iceberg and Paimon formats are available only for Hive compute engines or data sources on E-MapReduce 5.x. The component does not support writing to orc transactional tables or Kudu tables.

Note

To integrate data from Kudu tables, use the Impala output component. For more information, see Configure the Impala output component.

Prerequisites

  • A Hive data source is created. For more information, see Create a Hive data source.

  • To configure the Hive output component, your account must have sync write permissions on the data source. If you do not have the required permissions, apply for them. For more information, see Apply for data source permissions.

Procedure

  1. On the Dataphin homepage, in the top menu bar, choose Develop > Data Integration.

  2. In the top menu bar of the integration page, select a Project. If you are in Dev-Prod mode, also select an Environment.

  3. In the left navigation pane, click Offline Integration. In the Offline Integration list, click the offline pipeline that you want to develop to open its configuration page.

  4. In the upper-right corner of the page, click Component Library to open the component library panel.

  5. In the component library panel, click Output in the left navigation pane. Find the Hive component in the list on the right and drag it to the canvas.

  6. Click and drag the image icon from a source, transformation, or flow component and connect it to the Hive output component.

  7. Click the image icon on the Hive output component card to open the Hive Output Configuration dialog box.image

  8. In the Hive Output Configuration dialog box, configure the parameters.

    Required parameters vary depending on whether you are writing to a Hive, Hudi, or Paimon table.

    Hive table

    Parameter

    Description

    Basic Settings

    Step Name

    The name of the Hive output component. Dataphin generates a default name, which you can change. Naming rules:

    • Can contain only Chinese characters, letters, underscores (_), and digits.

    • Cannot exceed 64 characters.

    Data Source

    From the drop-down list, select a Hive data source. The list includes all Hive data sources, whether or not you have sync write permissions for them. Click the image icon to copy the current data source name.

    • If you do not have sync write permissions for a data source, click Apply next to it to request permissions. For more information, see Apply for data source permissions.

    • If you do not have a Hive data source, click Create Data Source to create one. For more information, see Create a Hive data source.

    Table

    Select the target Hive table. You can enter a keyword to search for a table or enter the exact table name and click Precise Search. After you select a table, the system automatically checks its status. Click the image icon to copy the selected table name.

    If the target table for data synchronization does not exist in the Hive data source, you can use the one-click table creation feature to create it. The steps are as follows:

    1. Click One-click table creation. Dataphin automatically generates a DDL statement for the target table, including the table name (defaults to the source table name) and field types (which are preliminarily converted from Dataphin field types).

    2. Set Data lake table format to None or Iceberg.

      Note

      You can select Iceberg only if the data lake table format is supported by the selected data source or the project's compute engine.

    3. Set Execution engine to Hive or Spark.

      Note

      You can select an execution engine only when Data lake table format is set to Iceberg. If the selected data source is configured for Spark, Spark is selected by default. Otherwise, only Hive is available.

    4. A DDL statement is automatically generated based on the selected data lake table format and execution engine. You can modify this statement and then click Create. After you create the table, Dataphin automatically sets it as the target table.

      Note
      • If a table with the same name already exists in the development environment, Dataphin reports an error.

      • If no matches are found, you can still perform the integration by manually entering a table name.

    Missing production table policy

    Specifies the policy for handling cases where the production table does not exist. You can select Do nothing or Automatically create. The default is Automatically create. If you select Do nothing, the task is published without creating the production table. If you select Automatically create, a table with the same name is created in the target environment when the task is published.

    • Do nothing: If the target table does not exist, a warning is displayed upon submission, but the task can still be published. You must manually create the target table in the production environment before you can run the task.

    • Automatically create: You must edit the DDL statement. By default, the DDL statement of the selected table is populated, which you can modify. The table name in the DDL statement must be the placeholder ${table_name}. This placeholder is replaced with the actual table name at runtime.

      If table creation fails, the publish check will fail. You can then correct the statement and republish. If the table already exists, the DDL statement is not executed.

    Note

    This parameter is available only in Dev-Prod mode projects.

    File encoding

    Select the encoding for the files stored in Hive. The options for file encoding are UTF-8 and GBK.

    Load policy

    The policy for writing data to a table in the target data source (a Hive data source). Load policies include Overwrite all data, Append data, and Overwrite only the data written by the integration task. The applicable scenarios are described as follows:

    • Overwrite all data: Deletes all existing data in the target table or partition, and then adds new data files prefixed with the table name.

    • Append data: Appends new data directly to the target table.

    • Overwrite only data written by the integration task: Deletes only the data files in the target table or partition that are prefixed with the table name. Data written by other methods, such as SQL queries, is not deleted.

    NULL replacement (Optional)

    This parameter applies only to target tables in the textfile format. Specify a string to be replaced with NULL. For example, if you enter \N, all occurrences of the string \N are replaced with NULL.

    Field delimiter (Optional)

    This parameter applies only to target tables in the textfile format. Specify the character that separates fields. If you do not specify a delimiter, the system uses \u0001 by default.

    Compression format (Optional)

    Select a compression format for the files. The available options depend on the storage format of the data in Hive:

    • If the storage format is orc, supported formats are zlib and snappy.

    • If the storage format is parquet, supported formats are snappy and gzip.

    • If the storage format is textfile, supported formats are gzip, bzip2, lzo, lzo_deflate, hadoop-snappy, and zlib.

    Field delimiter handling (Optional)

    This parameter applies only to output tables in the textfile format. If your data contains characters that are also used as the field delimiter, configure this policy to prevent write errors. You can choose to Keep, Remove, or Replace with.

    Row delimiter handling (Optional)

    This parameter applies only to output tables in the textfile format. The default row delimiter is \n. If your data contains newline characters such as \r or \n, configure this policy to prevent write errors. You can choose to Keep, Remove, or Replace with.

    Hadoop parameters (Optional)

    Specifies advanced write parameters to tune performance. You can specify different parameters for different table types. Use a JSON format: {"key1":"value1", "key2":"value2"}. For example, for an ORC-formatted output table with many fields, you can adjust the {"hive.exec.orc.default.buffer.size"} parameter based on available memory. If memory is sufficient, increasing this value can improve write performance. If memory is limited, decreasing it can reduce garbage collection (GC) time and improve write stability. The default is 16384Byte (16 KB), and the recommended maximum is 262144Byte (256 KB).

    Partition

    If the target table is partitioned, you must specify the partition information. For example, state_date=20190101. You can also use parameters for daily incremental writes, such as state_date=${bizdate}.

    Pre-execution statement (Optional)

    An SQL script to execute on the database before data is imported.

    For example, to maintain service continuity, you can perform an atomic swap. First, write the new data to a temporary table (Target_A). After the write is complete, rename the current production table (Service_B) to a backup (Temp_C), rename the temporary table (Target_A) to Service_B, and finally drop the backup table (Temp_C).

    Post-execution statement (Optional)

    An SQL script to execute on the database after data is imported.

    Field mapping

    Input Fields

    Displays the input fields from the upstream component.

    Output Fields

    Displays all fields of the selected target table.

    Important

    To prevent data errors, you must map an input field to every output field.

    Mapping relationship

    Based on the upstream input and the fields of the target table, you can manually select the field mapping. Mapping relationships include map by position and map by name.

    • Map by name: Maps fields with the same name.

    • Map by position: Maps fields based on their order. Use this when the source and target field names differ, but the data needs to be mapped between fields in the same position.

    Hudi table

    Parameter

    Description

    Basic Settings

    Step Name

    The name of the Hive output component. Dataphin generates a default name, which you can change. Naming rules:

    • Can contain only Chinese characters, letters, underscores (_), and digits.

    • Cannot exceed 64 characters.

    Data Source

    From the drop-down list, select a Hive data source. The list includes all Hive data sources, whether or not you have sync write permissions for them. Click the image icon to copy the current data source name.

    • If you do not have sync write permissions for a data source, click Apply next to it to request permissions. For more information, see Apply for data source permissions.

    • If you do not have a Hive data source, click Create Data Source to create one. For more information, see Create a Hive data source.

    Table

    Select the target Hudi table. You can enter a keyword to search for a table or enter the exact table name and click Precise Search. After you select a table, the system automatically checks its status. Click the image icon to copy the selected table name.

    If the target table for data synchronization does not exist in the Hive data source, you can use the one-click table creation feature to create it. The steps are as follows:

    1. Click One-click table creation. Dataphin automatically generates a DDL statement for the target table, including the table name (defaults to the source table name) and field types (which are preliminarily converted from Dataphin field types).

    2. Set Data lake table format to Hudi.

      • Hudi table type: Select MOR (merge on read) or COW (copy on write). The default is MOR (merge on read).

      • Primary key field (Optional): Enter the primary key fields. Separate multiple fields with a comma (,).

      • Extended properties (Optional): Enter Hudi configuration properties in the format k=v.

        Note
        • If a table with the same name already exists in the development environment, Dataphin reports an error when you click Create.

        • If no matches are found, you can still perform the integration by manually entering a table name.

    3. Set Execution engine to Hive or Spark.

      Note

      You can select an execution engine only when Data lake table format is set to Hudi. The default is Hive. If Spark is enabled for the selected data source, you can select Spark.

    4. A DDL statement is automatically generated based on the selected data lake table format and execution engine. You can modify this statement and then click Create.

    Missing production table policy

    Specifies the policy for handling cases where the production table does not exist. You can select Do nothing or Automatically create. The default is Automatically create. If you select Do nothing, the task is published without creating the production table. If you select Automatically create, a table with the same name is created in the target environment when the task is published.

    • Do nothing: If the target table does not exist, a warning is displayed upon submission, but the task can still be published. You must manually create the target table in the production environment before you can run the task.

    • Automatically create: You must edit the DDL statement. By default, the DDL statement of the selected table is populated, which you can modify. The table name in the DDL statement must be the placeholder ${table_name}. This placeholder is replaced with the actual table name at runtime.

      If table creation fails, the publish check will fail. You can then correct the statement and republish. If the table already exists, the DDL statement is not executed.

    Note

    This parameter is available only in Dev-Prod mode projects.

    Partition

    If the target table is partitioned, you must specify the partition information. For example, state_date=20190101. You can also use parameters for daily incremental writes, such as state_date=${bizdate}.

    Load policy

    The policy for writing data to the target table. Options include Overwrite data, Append data, and Update data.

    • Overwrite data: Overwrites existing data with new data.

    • Append data: Appends new data directly to the target table.

    • Update data: Updates existing records based on the primary key, or inserts new records if they do not exist (an upsert operation).

      Note

      Data written by other methods, such as SQL queries, is not deleted.

    BulkInsert

    Enables fast synchronization for high-volume, large-scale datasets, and is ideal for initial data ingestion.

    Note

    This parameter is available and enabled by default only when the load policy is set to Append data or Overwrite data.

    Batch Write

    Writes data to the target table in batches. If enabled, you must also configure the Batch Ratio.

    Batch Ratio

    The proportion of total JVM memory to use. The default is 0.2. You can enter a two-decimal value between 0.01 and 0.50.

    Hadoop parameters (Optional)

    Specifies advanced write parameters to tune performance. Use a JSON format: {"key1":"value1", "key2":"value2"}. For example, you can use the parameter {"hoodie.parquet.compression.codec":"snappy"} to change the compression format to snappy.

    Field mapping

    Input Fields

    Displays the input fields from the upstream component.

    Output Fields

    Displays all fields of the selected target table.

    Note

    For Hudi tables, you do not need to map all fields.

    Mapping relationship

    You can manually select field mappings based on the upstream input and the fields of the target table. The mapping relationships include map by position and map by name.

    • Map by name: Maps fields with the same name.

    • Map by position: Maps fields based on their order. Use this when the source and target field names differ, but the data needs to be mapped between fields in the same position.

    Paimon table

    Parameter

    Description

    Basic Settings

    Step Name

    The name of the Hive output component. Dataphin generates a default name, which you can change. Naming rules:

    • Can contain only Chinese characters, letters, underscores (_), and digits.

    • Cannot exceed 64 characters.

    Data Source

    From the drop-down list, select a Hive data source. The list includes all Hive data sources, whether or not you have sync write permissions for them. Click the image icon to copy the current data source name.

    • If you do not have sync write permissions for a data source, click Apply next to it to request permissions. For more information, see Apply for data source permissions.

    • If you do not have a Hive data source, click Create Data Source to create one. For more information, see Create a Hive data source.

    Table

    Select the target Paimon table. You can enter a keyword to search for a table or enter the exact table name and click Precise Search. After you select a table, the system automatically checks its status. Click the image icon to copy the selected table name.

    If the target table for data synchronization does not exist in the Hive data source, you can use the one-click table creation feature to create it. The steps are as follows:

    1. Click One-click table creation. Dataphin automatically generates a DDL statement for the target table, including the table name (defaults to the source table name) and field types (which are preliminarily converted from Dataphin field types).

    2. Set Data lake table format to None, Iceberg, or Paimon.

      Note

      You can select Iceberg only if the data lake table format is supported by the selected data source or the project's compute engine.

    3. Set Execution engine to Hive or Spark.

      Note

      You can select an execution engine only when Data lake table format is set to Iceberg or Paimon. If the selected data source is configured for Spark, Spark is selected by default. Otherwise, only Hive is available.

    4. For Paimon table type, you can select MOR (merge on read), COW (copy on write), or MOW (merge on write). The default is MOR.

      Note

      This option is available only when the data lake table format is set to Paimon.

    5. A DDL statement is automatically generated based on the selected data lake table format and execution engine. You can modify this statement and then click Create. After you create the table, Dataphin automatically sets it as the target table.

      Note
      • If a table with the same name already exists in the development environment, Dataphin reports an error when you click Create.

      • If no matches are found, you can still perform the integration by manually entering a table name.

    Missing production table policy

    Specifies the policy for handling cases where the production table does not exist. You can select Do nothing or Automatically create. The default is Automatically create. If you select Do nothing, the task is published without creating the production table. If you select Automatically create, a table with the same name is created in the target environment when the task is published.

    • Do nothing: If the target table does not exist, a warning is displayed upon submission, but the task can still be published. You must manually create the target table in the production environment before you can run the task.

    • Automatically create: You must edit the DDL statement. By default, the DDL statement of the selected table is populated, which you can modify. The table name in the DDL statement must be the placeholder ${table_name}. This placeholder is replaced with the actual table name at runtime.

      If table creation fails, the publish check will fail. You can then correct the statement and republish. If the table already exists, the DDL statement is not executed.

    Note

    This parameter is available only in Dev-Prod mode projects.

    Load policy

    The load policy determines how data is written to a table in the target data source (a Hive data source). Load policies include Append Data, Overwrite Data, and Update Data. The applicable scenarios are described as follows:

    • Append data: Appends new data directly to the target table.

    • Overwrite data: Overwrites existing data with new data.

    • Update data: Updates existing records based on the primary key, or inserts new records if they do not exist (an upsert operation).

    NULL replacement (Optional)

    This parameter applies only to target tables in the textfile format. Specify a string to be replaced with NULL. For example, if you enter \N, all occurrences of the string \N are replaced with NULL.

    Field delimiter (Optional)

    This parameter applies only to target tables in the textfile format. Specify the character that separates fields. If you do not specify a delimiter, the system uses \u0001 by default.

    Field delimiter handling (Optional)

    This parameter applies only to output tables in the textfile format. If your data contains characters that are also used as the field delimiter, configure this policy to prevent write errors. You can choose to Keep, Remove, or Replace with.

    Row delimiter handling (Optional)

    This parameter applies only to output tables in the textfile format. The default row delimiter is \n. If your data contains newline characters such as \r or \n, configure this policy to prevent write errors. You can choose to Keep, Remove, or Replace with.

    Hadoop parameters (Optional)

    Specifies advanced write parameters to tune performance. You can specify different parameters for different table types. Use a JSON format: {"key1":"value1", "key2":"value2"}. For example, for an ORC-formatted output table with many fields, you can adjust the {"hive.exec.orc.default.buffer.size"} parameter based on available memory. If memory is sufficient, increasing this value can improve write performance. If memory is limited, decreasing it can reduce garbage collection (GC) time and improve write stability. The default is 16384Byte (16 KB), and the recommended maximum is 262144Byte (256 KB).

    Partition

    If the target table is partitioned, you must specify the partition information. For example, state_date=20190101. You can also use parameters for daily incremental writes, such as state_date=${bizdate}.

    Field mapping

    Input Fields

    Displays the input fields from the upstream component.

    Output Fields

    Displays all fields of the selected target table.

    Important

    To prevent data errors, you must map an input field to every output field.

    Mapping relationship

    You can manually select field mappings based on the upstream input and the fields in the target table. The Mapping Relationship includes Map by Position and Map by Name.

    • Map by name: Maps fields with the same name.

    • Map by position: Maps fields based on their order. Use this when the source and target field names differ, but the data needs to be mapped between fields in the same position.

  9. Click Confirm to save the Hive output component configuration.