Create a Hive data source

更新时间: 2026-05-22 06:06:25

A Hive data source allows Dataphin to read business data from or write data to Hive. This topic describes the procedure.

Background information

Hive is a data warehouse tool built on Hadoop. It maps structured data files to database tables and provides SQL query capabilities. Hive transforms HQL or SQL statements into programs for execution engines like MapReduce and Tez. To use Hive with Dataphin for data development or to write data to Hive, you must first create a Hive data source. For more information, see the Hive official website.

Limits

  • To use Hive external tables from OSS for offline integration with an Aliyun E-MapReduce 5.x Hadoop compute engine, you must perform additional configuration. For configuration instructions, see .

  • To use a Hive data source for integration, add Dataphin's IP addresses to the Hive network whitelist to grant access.

  • Ensure network connectivity between the Dataphin application and scheduling clusters and the following services: the Hive service, HDFS NameNode (including its web UI and IPC) and DataNodes, KDC Server, metadata database or Hive Metastore, and ZooKeeper.

  • To use a Hive data source, you must first purchase the Metadata Collection-Big Data Storage feature and create a registered scheduling cluster that can access it.

Permissions

Only users with the following roles can create data sources: a super administrator, a data source administrator, a domain architect, a project administrator, or a custom global role with the Create Data Source permission.

Procedure

  1. In the top navigation bar of the Dataphin homepage, choose management center > Data Source Management.

  2. On the Data Sources page, click + New Data Source.

  3. On the New Data Source page, in the Big Data Storage section, select Hive.

    If you have used Hive recently, you can also select it from the Recently Used section. Alternatively, you can enter Hive in the search box to quickly find it.

  4. On the New Hive Data Source page, configure the connection parameters for the data source.

    In the Hive data source configuration: the Integration parameters enable data integration, the Real-time Development parameters support real-time development, and the Metadata Database parameters are used to obtain metadata.

    Note

    Typically, you should configure production and development data sources separately to isolate the environments and prevent development activities from impacting production. However, Dataphin also allows you to use the same data source for both environments by entering identical parameter values.

    • Basic information

      Parameter

      Description

      Data source name

      Enter a name for the data source. The name must meet the following rules:

      • Can contain only Chinese characters, uppercase and lowercase letters, digits, underscores (_), and hyphens (-).

      • Cannot exceed 64 characters in length.

      Data source code

      After you configure the data source code, you can reference tables in Flink SQL tasks by using the format data_source_code.table_name or data_source_code.schema.table_name. To automatically access the data source for the corresponding environment, use the variable format ${data_source_code}.table or ${data_source_code}.schema.table. For more information, see Flink SQL task development methods.

      Important
      • The data source code cannot be modified after it is configured.

      • You can preview data on the object details page in the asset directory and asset checklist only after the data source code is configured.

      • In Flink SQL, only MySQL, Hologres, MaxCompute, Oracle, StarRocks, Hive, SelectDB, and GaussDB data warehouse service (DWS) data sources are currently supported.

      Version

      Select the version of your Hive data source. Dataphin supports the following versions:

      • CDH5.x Hive 1.1.0

      • Aliyun EMR3.x Hive 2.3.5

      • Aliyun EMR5.x Hive 3.1.x

      • CDH6.x Hive 2.1.1

      • FusionInsight 8.x Hive 3.1.0

      • CDP7.x Hive 3.1.3

      • AsiaInfo DP5.x Hive 3.1.0

      • Amazon EMR

      Data source description

      Enter a brief description of the data source, up to 128 characters.

      Data lake table format

      Select whether to enable or disable the data lake table format. It is disabled by default. If you enable it, you must select a table format.

      • When Version is set to CDP7.x Hive3.1.3, only Hudi is supported.

      • When Version is set to Aliyun EMR5.x Hive3.1.x, you can select Iceberg or Paimon.

      Note

      This parameter is configurable only when Version is set to CDP7.x Hive3.1.3 or Aliyun EMR5.x Hive3.1.x.

      Data source configuration

      Select the environment for your data source configuration:

      • If your business data sources are separated into production and development environments, select Production + development data source.

      • If your business data sources are not separated, select Production data source.

      Tags

      You can use tags to classify and label your data sources. For information on how to create tags, see Manage Data Source Tags.

    • Production/development data source

      The required configuration parameters vary depending on the Hive version.

      Other versions

      Hive configuration

      Metadata Acquisition Method: Supported methods include Metadata Database, and DLF. The required configuration varies depending on the selected method.

      Metadata database

      Note

      If you enable only Real-time Development, you must use the metadata database method for metadata acquisition.

      Parameter

      Description

      Database type

      Select the database type that corresponds to the metadata database used in your cluster. Dataphin supports MySQL. The supported versions for MySQL are MySQL 5.1.43, MySQL 5.6/5.7, and MySQL 8.

      JDBC URL

      Enter the JDBC connection URL for the target database. The format of the connection URL is jdbc:mysql://host:port/dbname.

      Username, Password

      Enter the username and password for logging in to Hive.

      Hive JDBC URL

      Enter the JDBC connection URL for Hive. The following three types of connection URLs are supported:

      • Hive Server: The connection address format is jdbc:hive://host:port/dbname.

      • ZooKeeper: For example, jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2.

      • If Kerberos is enabled for Hive: The connection address format is jdbc:hive2://host:port/dbname;principal=hive/_HOST@xx.com.

      Note

      For E-MapReduce3.x, E-MapReduce5.x, and Cloudera Data Platform versions, if Kerberos authentication is enabled, you cannot specify multiple IP addresses in the JDBC URL.

      DLF

      Note

      The DLF acquisition method is supported only for the Aliyun EMR5.x Hive 3.1.x version.

      Parameter

      Description

      Endpoint

      Enter the endpoint for the region where the cluster's DLF data center is located. To obtain this information, see DLF Regions and Endpoints.

      AccessKey ID, AccessKey Secret

      Enter the AccessKey ID and AccessKey Secret of the account where the cluster is located.

      You can obtain the AccessKey ID and AccessKey Secret on the User Information Management page.

      hive-site.xml

      Upload the hive-site.xml configuration file for Hive.

      Note

      If Real-time Development is enabled, this configuration file is also used for real-time development.

      Hive JDBC URL

      Enter the JDBC connection URL for Hive. The following three types of connection URLs are supported:

      • Hive Server: The connection address format is jdbc:hive://host:port/dbname.

      • ZooKeeper: For example, jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2.

      • If Kerberos is enabled for Hive: The connection URL format is jdbc:hive2://host:port/dbname;principal=hive/_HOST@xx.com.

      Integration

      Note

      You can configure the integration module if you selected Integration in the Enabled Modules section.

      Integration

      Two storage modes are supported: HDFS and OSS-HDFS Cluster Storage. The required configuration information varies depending on the storage mode.

      HDFS

      Parameter

      Description

      NameNode

      The NameNode manages the metadata of the HDFS cluster. Click Add and configure the NameNode information in the Add NameNode dialog box.

      • Web UI Port: The port for accessing the web interfaces of the Hadoop cluster components.

      • IPC Port: The port for inter-process communication (IPC).

      In a CDH5 environment, the default values for Web UI Port and IPC Port are 50070 and 8020, respectively. You can specify the ports based on your actual configuration.

      Note

      The Test Connection can be completed successfully if either the Web UI Port or the IPC Port is entered correctly.

      Configuration files

      Upload the Hadoop configuration files hdfs-site.xml and core-site.xml. You can export these files from the Hadoop cluster.

      Note

      After you enable the data lake table format and select Hudi as the table format, you also need to upload the hive-site.xml configuration file if real-time development uses a Hive data source and the dp-hudi Connector.

      Enable Kerberos

      Kerberos is an authentication protocol based on symmetric key technology that provides identity verification for other services and supports Single Sign-On (SSO). SSO allows a client to access multiple services, such as HBase and HDFS, after a single authentication.

      If your Hadoop cluster uses Kerberos authentication, configure the following parameters.

      • Kerberos configuration method

        • KDC Server: The address of the KDC server, which facilitates Kerberos authentication.

          Note

          You can configure multiple KDC Server addresses, separated by commas (,).

        • Krb5 File Configuration: Upload a krb5 file for Kerberos authentication.

      • HDFS configuration

        • HDFS keytab file: Upload the keytab authentication file. You can obtain this file from the HDFS Server.

        • HDFS principal: Enter the principal corresponding to the HDFS keytab file, for example, xxx/hdfsclient@xxx.xxx.

      • Hive configuration

        • JDBC URL: Enter the JDBC connection URL for Hive. The following three types of URLs are supported:

          • The connection address of the Hive server is in the format jdbc:hive://host:port/dbname.

          • The connection address of ZooKeeper. For example, jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2.

          • The connection string for a Kerberos-enabled connection is jdbc:hive2://host:port/dbname;principal=hive/_HOST@xx.com.

          Note

          For E-MapReduce3.x, E-MapReduce5.x, and Cloudera Data Platform versions, if Kerberos authentication is enabled, you cannot specify multiple IP addresses in the JDBC URL.

        • Hive keytab file: Upload the keytab authentication file. You can obtain this file from the Hive Server.

        • Hive principal: Enter the principal corresponding to the Hive keytab file, for example, xxx/hdfsclient@xxx.xxx.

      Disable Kerberos

      If your Hadoop cluster does not use Kerberos authentication, configure the following parameters.

      • JDBC URL: Enter the JDBC connection URL for Hive. The following three types of URLs are supported:

        • The connection address of the Hive server. The format is jdbc:hive://host:port/dbname.

        • The connection address for ZooKeeper. For example, jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2.

        • The connection string for a service with Kerberos enabled is jdbc:hive2://host:port/dbname;principal=hive/_HOST@xx.com.

          Note

          For E-MapReduce3.x, E-MapReduce5.x, and Cloudera Data Platform versions, if Kerberos authentication is enabled, you cannot specify multiple IP addresses in the JDBC URL.

      • Username, Password: Enter the username and password for logging in to Hive.

        Note

        To ensure that tasks run properly, make sure the specified user has permission to access Hive data.

      OSS-HDFS

      Note

      OSS-HDFS cluster storage is supported only for the Aliyun EMR5.x Hive3.1.x version.

      Parameter

      Description

      Cluster storage

      You can check the cluster storage type in the following ways:

      • If the cluster is not created: You can view the cluster storage type on the creation page of an Aliyun E-MapReduce 5.x Hadoop cluster.

      • If the cluster is already created: You can view the cluster storage type on the details page of an Aliyun E-MapReduce 5.x Hadoop cluster.

      Cluster storage root directory

      Enter the root directory of the cluster storage. You can obtain this information from the Aliyun E-MapReduce 5.x Hadoop cluster details.

      Important

      If the specified path includes an endpoint, Dataphin uses that endpoint by default. If it does not, Dataphin uses the bucket-level endpoint configured in core-site.xml. If no bucket-level endpoint is configured, Dataphin uses the global endpoint from core-site.xml. For more information, see Configure an endpoint to access Alibaba Cloud OSS-HDFS (JindoFS service).

      Configuration files

      Upload the core-site.xml and hive-metastore-site.xml configuration files, which you can export from the Hadoop cluster.

      AccessKey ID, AccessKey Secret

      Enter the AccessKey ID and AccessKey Secret for accessing the cluster's OSS. Use an existing AccessKey pair or create a new one. For instructions, see Create an AccessKey pair.

      Important
      • The configuration entered here has a higher priority than the AccessKey configured in core-site.xml.

      • To mitigate the risk of an AccessKey leak, the AccessKey Secret is displayed only once upon creation and cannot be retrieved later. Be sure to store it securely.

      Enable Kerberos

      Kerberos is an authentication protocol based on symmetric key technology that provides identity verification for other services and supports SSO. SSO allows a client to access multiple services, such as HBase and HDFS, after a single authentication.

      If your Hadoop cluster uses Kerberos authentication, configure the following parameters.

      • Kerberos configuration method

        Krb5 File Configuration: Upload a krb5 file for Kerberos authentication.

      • Hive configuration

        • JDBC URL: Enter the JDBC connection URL for Hive. The following three types of URLs are supported:

          • The connection address of the Hive Server is in the format jdbc:hive://host:port/dbname.

          • The connection address for ZooKeeper. For example: jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2.

          • The connection string for a service with Kerberos enabled is in the format jdbc:hive2://host:port/dbname;principal=hive/_HOST@xx.com.

            Note

            For E-MapReduce3.x, E-MapReduce5.x, and Cloudera Data Platform versions, if Kerberos authentication is enabled, you cannot specify multiple IP addresses in the JDBC URL.

        • Hive keytab file: Upload the keytab authentication file. You can obtain this file from the Hive Server.

        • Hive principal: Enter the principal corresponding to the Hive keytab file, for example, xxx/hdfsclient@xxx.xxx.

      Disable Kerberos

      If your Hadoop cluster does not use Kerberos authentication, configure the following parameters.

      • JDBC URL: Enter the JDBC connection URL for Hive. The following three types of URLs are supported:

        • The connection address of the Hive server. The format is jdbc:hive://host:port/dbname.

        • The connection address for ZooKeeper. For example, jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2.

        • The connection string for a Kerberos-enabled connection is in the format jdbc:hive2://host:port/dbname;principal=hive/_HOST@xx.com.

      • Username, Password: Enter the username and password for logging in to Hive.

        Note

        To ensure that tasks run properly, make sure the specified user has permission to access Hive data.

      Spark

      Note
      • Spark configuration is supported when the version is set to CDP7.x Hive3.1.3 and the data lake table format is set to Hudi.

      • Spark configuration is supported when the version is set to Aliyun EMR5.x Hive3.1.x and the data lake table format is set to Iceberg or Paimon.

      Parameter

      Description

      Spark

      Select Enable or Disable. This feature is enabled by default. If enabled, you must also configure the parameters below.

      Note

      You cannot disable Spark if Paimon is selected as the data lake table format.

      Service type

      Select Kyuubi or Livy. Kyuubi is selected by default.

      Spark JDBC URL

      Enter the JDBC connection URL in the following format: jdbc:hive2://{host}:{port}/{database name}.

      Authentication method

      Select No Authentication, LDAP, or Kerberos.

      Username, Password

      Enter the username and password for authentication. To ensure that tasks run properly, make sure this user has the required data permissions.

      Note

      If Authentication Method is set to No Authentication, you must enter a Username. If Authentication Method is set to LDAP, you must enter a Username and Password.

      Spark keytab file

      Upload the keytab authentication file. You can obtain this file from the Spark Server.

      Note

      This parameter is configurable when Authentication Method is set to Kerberos.

      Spark principal

      Enter the principal corresponding to the Spark keytab file, for example, xxx/hadoopclient@xxx.xxx.

      Note

      This parameter is configurable when Authentication Method is set to Kerberos.

      Real-time development

      Note

      Real-time development is supported only for CDH6.x Hive 2.1.1, CDP7.x Hive 3.1.3, and AsiaInfo DP5.x Hive 3.1.0. You can configure the real-time development module if you selected Real-time Development in the Enabled Modules section.

      Configuration File: Upload the hive-site.xml file. Flink SQL tasks will ignore the authentication information from the integration settings and use the Flink engine's authentication information to access the Hive data source.

      Amazon EMR

      Parameter

      Description

      Public DNS of the master node

      Dataphin uses the public DNS to resolve the VPC private DNS. Both Hive and Spark connect through the private DNS. Format: ec2-{public_ip}.{region}.compute.amazonaws.com.

      Key file (*.pem)

      The key pair used to access the master node's EC2 instance (the key pair specified when you created the EMR cluster).

      Configuration files

      You can upload the relevant cluster configuration files (core-site.xml, yarn-site.xml, hive-site.xml, hdfs-site.xml), or click Get Cluster Configurations (you must first enter the public DNS of the master node and upload the key file) to download the files from the master node.

      Database

      Enter the database name for the Amazon EMR compute engine.

      Engine type

      Select Hive or Spark. Hive is selected by default. If you select Hive, enter the Hive JDBC URL. If you select Spark, enter the Spark JDBC URL.

      • Hive JDBC URL: Enter the JDBC connection URL for Hive, or Auto-fetch the connection URL (you must first enter the public DNS of the master node and upload the key file). The Hive JDBC URL format is jdbc:hive2//host1:port1,host2:post2/. You do not need to enter the database name.

      • Spark JDBC URL: Enter the JDBC connection URL for Spark. The format is jdbc:hive2//{host:port}/{database name}.

      Username

      The designated username for Hive or Spark. Dataphin sets this as the JDBC username.

      Cluster storage

      Currently, only HDFS can be selected.

      Metadata acquisition method

      Amazon Glue: If you select Amazon Glue, you must also configure the Glue Region Code, Glue AccessKey ID, and Glue AccessKey Secret.

      • Glue region code: Enter the Region Code for Amazon Glue, for example, ap-northeast-3, us-east-1, or us-west-1.

      • Glue AccessKey ID, Glue AccessKey Secret: Enter the AccessKey ID and AccessKey Secret for accessing Amazon Glue.

  5. Select a Default Resource Group. This resource group is used to run tasks related to the current data source, such as database SQL queries, offline full database migrations, and data previews.

  6. Click Test connection, or click OK to save the configuration and create the Hive data source.

    When you click Test connection, the system verifies whether Dataphin can connect to the data source. If you click OK directly, the system automatically tests the connection for all selected clusters. You can create the data source even if the connection tests fail.

    Test Connection tests the connection for the Public Scheduling Cluster or Registered Scheduling Clusters that have been registered in Dataphin and are in normal use. The Public Scheduling Cluster is selected by default and cannot be deselected. If there are no resource groups under a Registered Scheduling Cluster, connection testing is not supported. You need to create a resource group first before testing the connection.

    • The selected clusters are only used to test network connectivity with the current data source and are not used for running related tasks later.

    • The test connection usually takes less than 2 minutes. If it times out, you can click the image icon to view the specific reason and retry.

    • Regardless of whether the test result is Connection Failed, Connection Successful, or Succeeded With Warning, the system will record the generation time of the final result.

      Note

      Only the test results for the Public Scheduling Cluster include three connection statuses: Succeeded With Warning, Connection Successful, and Connection Failed. The test results for Registered Scheduling Clusters in Dataphin only include two connection statuses: Connection Successful and Connection Failed.

    • When the test result is Connection Failed, you can click the image icon to view the specific failure reason.

    • When the test result is Succeeded With Warning, it means that the application cluster connection is successful but the scheduling cluster connection failed. The current data source cannot be used for data development and integration. You can click the image icon to view the log information.

    If the Hive metadata database has a public address, select a resource group from the Public Scheduling Cluster as the default resource group. When configuring integration tasks, run them on a registered scheduling cluster. If the Hive metadata database has an internal address, select a resource group from a Registered Scheduling Cluster as the default resource group. Then, configure a collection task in metadata and run integration tasks on the registered scheduling cluster.

    Important

    If the connection test fails, see Network Connectivity Solutions for troubleshooting instructions.

上一篇: Create a MaxCompute data source 下一篇: Add a Hologres data source
阿里云首页 智能数据建设与治理 Dataphin 相关技术圈