Create a Hive data source-Dataphin(Dataphin)-阿里云帮助中心

A Hive data source allows Dataphin to read business data from or write data to Hive. This topic describes the procedure.

Background information

Hive is a data warehouse tool built on Hadoop. It maps structured data files to database tables and provides SQL query capabilities. Hive transforms HQL or SQL statements into programs for execution engines like MapReduce and Tez. To use Hive with Dataphin for data development or to write data to Hive, you must first create a Hive data source. For more information, see the Hive official website.

Limits

To use Hive external tables from OSS for offline integration with an Aliyun E-MapReduce 5.x Hadoop compute engine, you must perform additional configuration. For configuration instructions, see .
To use a Hive data source for integration, add Dataphin's IP addresses to the Hive network whitelist to grant access.
Ensure network connectivity between the Dataphin application and scheduling clusters and the following services: the Hive service, HDFS NameNode (including its web UI and IPC) and DataNodes, KDC Server, metadata database or Hive Metastore, and ZooKeeper.
To use a Hive data source, you must first purchase the Metadata Collection-Big Data Storage feature and create a registered scheduling cluster that can access it.

Permissions

Only users with the following roles can create data sources: a super administrator, a data source administrator, a domain architect, a project administrator, or a custom global role with the Create Data Source permission.

Procedure

In the top navigation bar of the Dataphin homepage, choose management center > Data Source Management.
On the Data Sources page, click + New Data Source.
On the New Data Source page, in the Big Data Storage section, select Hive.
If you have used Hive recently, you can also select it from the Recently Used section. Alternatively, you can enter Hive in the search box to quickly find it.

On the New Hive Data Source page, configure the connection parameters for the data source.

In the Hive data source configuration: the Integration parameters enable data integration, the Real-time Development parameters support real-time development, and the Metadata Database parameters are used to obtain metadata.

Note

Typically, you should configure production and development data sources separately to isolate the environments and prevent development activities from impacting production. However, Dataphin also allows you to use the same data source for both environments by entering identical parameter values.

Basic information

Parameter	Description
Data source name	Enter a name for the data source. The name must meet the following rules: Can contain only Chinese characters, uppercase and lowercase letters, digits, underscores (_), and hyphens (-). Cannot exceed 64 characters in length.
Data source code	After you configure the data source code, you can reference tables in Flink SQL tasks by using the format `data_source_code.table_name` or `data_source_code.schema.table_name`. To automatically access the data source for the corresponding environment, use the variable format `${data_source_code}.table` or `${data_source_code}.schema.table`. For more information, see Flink SQL task development methods. Important The data source code cannot be modified after it is configured. You can preview data on the object details page in the asset directory and asset checklist only after the data source code is configured. In Flink SQL, only MySQL, Hologres, MaxCompute, Oracle, StarRocks, Hive, SelectDB, and GaussDB data warehouse service (DWS) data sources are currently supported.
Version	Select the version of your Hive data source. Dataphin supports the following versions: CDH5.x Hive 1.1.0 Aliyun EMR3.x Hive 2.3.5 Aliyun EMR5.x Hive 3.1.x CDH6.x Hive 2.1.1 FusionInsight 8.x Hive 3.1.0 CDP7.x Hive 3.1.3 AsiaInfo DP5.x Hive 3.1.0 Amazon EMR
Data source description	Enter a brief description of the data source, up to 128 characters.
Data lake table format	Select whether to enable or disable the data lake table format. It is disabled by default. If you enable it, you must select a table format. When Version is set to CDP7.x Hive3.1.3, only Hudi is supported. When Version is set to Aliyun EMR5.x Hive3.1.x, you can select Iceberg or Paimon. Note This parameter is configurable only when Version is set to CDP7.x Hive3.1.3 or Aliyun EMR5.x Hive3.1.x.
Data source configuration	Select the environment for your data source configuration: If your business data sources are separated into production and development environments, select Production + development data source. If your business data sources are not separated, select Production data source.
Tags	You can use tags to classify and label your data sources. For information on how to create tags, see Manage Data Source Tags.

Production/development data source

The required configuration parameters vary depending on the Hive version.

Other versions

Hive configuration

Metadata Acquisition Method: Supported methods include Metadata Database, and DLF. The required configuration varies depending on the selected method.

Metadata database

Note

If you enable only Real-time Development, you must use the metadata database method for metadata acquisition.

Parameter	Description
Database type	Select the database type that corresponds to the metadata database used in your cluster. Dataphin supports MySQL. The supported versions for MySQL are MySQL 5.1.43, MySQL 5.6/5.7, and MySQL 8.
JDBC URL	Enter the JDBC connection URL for the target database. The format of the connection URL is `jdbc:mysql://host:port/dbname`.
Username, Password	Enter the username and password for logging in to Hive.
Hive JDBC URL	Enter the JDBC connection URL for Hive. The following three types of connection URLs are supported: Hive Server: The connection address format is `jdbc:hive://host:port/dbname`. ZooKeeper: For example, `jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2`. If Kerberos is enabled for Hive: The connection address format is `jdbc:hive2://host:port/dbname;principal=hive/_HOST@xx.com`. Note For E-MapReduce3.x, E-MapReduce5.x, and Cloudera Data Platform versions, if Kerberos authentication is enabled, you cannot specify multiple IP addresses in the JDBC URL.

DLF

Note

The DLF acquisition method is supported only for the Aliyun EMR5.x Hive 3.1.x version.

Parameter	Description
Endpoint	Enter the endpoint for the region where the cluster's DLF data center is located. To obtain this information, see DLF Regions and Endpoints.
AccessKey ID, AccessKey Secret	Enter the AccessKey ID and AccessKey Secret of the account where the cluster is located. You can obtain the AccessKey ID and AccessKey Secret on the User Information Management page.
hive-site.xml	Upload the hive-site.xml configuration file for Hive. Note If Real-time Development is enabled, this configuration file is also used for real-time development.
Hive JDBC URL	Enter the JDBC connection URL for Hive. The following three types of connection URLs are supported: Hive Server: The connection address format is `jdbc:hive://host:port/dbname`. ZooKeeper: For example, `jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2`. If Kerberos is enabled for Hive: The connection URL format is `jdbc:hive2://host:port/dbname;principal=hive/_HOST@xx.com`.

Integration

Note

You can configure the integration module if you selected Integration in the Enabled Modules section.

Integration

Two storage modes are supported: HDFS and OSS-HDFS Cluster Storage. The required configuration information varies depending on the storage mode.

HDFS

Parameter	Description
NameNode	The NameNode manages the metadata of the HDFS cluster. Click Add and configure the NameNode information in the Add NameNode dialog box. Web UI Port: The port for accessing the web interfaces of the Hadoop cluster components. IPC Port: The port for inter-process communication (IPC). In a CDH5 environment, the default values for Web UI Port and IPC Port are `50070` and `8020`, respectively. You can specify the ports based on your actual configuration. Note The Test Connection can be completed successfully if either the Web UI Port or the IPC Port is entered correctly.
Configuration files	Upload the Hadoop configuration files `hdfs-site.xml` and `core-site.xml`. You can export these files from the Hadoop cluster. Note After you enable the data lake table format and select Hudi as the table format, you also need to upload the `hive-site.xml` configuration file if real-time development uses a Hive data source and the dp-hudi Connector.
Enable Kerberos	Kerberos is an authentication protocol based on symmetric key technology that provides identity verification for other services and supports Single Sign-On (SSO). SSO allows a client to access multiple services, such as HBase and HDFS, after a single authentication. If your Hadoop cluster uses Kerberos authentication, configure the following parameters. Kerberos configuration method KDC Server: The address of the KDC server, which facilitates Kerberos authentication. Note You can configure multiple KDC Server addresses, separated by commas (,). Krb5 File Configuration: Upload a krb5 file for Kerberos authentication. HDFS configuration HDFS keytab file: Upload the keytab authentication file. You can obtain this file from the HDFS Server. HDFS principal: Enter the principal corresponding to the HDFS keytab file, for example, `xxx/hdfsclient@xxx.xxx`. Hive configuration JDBC URL: Enter the JDBC connection URL for Hive. The following three types of URLs are supported: The connection address of the Hive server is in the format `jdbc:hive://host:port/dbname`. The connection address of ZooKeeper. For example, `jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2`. The connection string for a Kerberos-enabled connection is `jdbc:hive2://host:port/dbname;principal=hive/_HOST@xx.com`. Note For E-MapReduce3.x, E-MapReduce5.x, and Cloudera Data Platform versions, if Kerberos authentication is enabled, you cannot specify multiple IP addresses in the JDBC URL. Hive keytab file: Upload the keytab authentication file. You can obtain this file from the Hive Server. Hive principal: Enter the principal corresponding to the Hive keytab file, for example, `xxx/hdfsclient@xxx.xxx`.
Disable Kerberos	If your Hadoop cluster does not use Kerberos authentication, configure the following parameters. JDBC URL: Enter the JDBC connection URL for Hive. The following three types of URLs are supported: The connection address of the Hive server. The format is `jdbc:hive://host:port/dbname`. The connection address for ZooKeeper. For example, `jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2`. The connection string for a service with Kerberos enabled is `jdbc:hive2://host:port/dbname;principal=hive/_HOST@xx.com`. Note For E-MapReduce3.x, E-MapReduce5.x, and Cloudera Data Platform versions, if Kerberos authentication is enabled, you cannot specify multiple IP addresses in the JDBC URL. Username, Password: Enter the username and password for logging in to Hive. Note To ensure that tasks run properly, make sure the specified user has permission to access Hive data.

OSS-HDFS

Note

OSS-HDFS cluster storage is supported only for the Aliyun EMR5.x Hive3.1.x version.

Parameter	Description
Cluster storage	You can check the cluster storage type in the following ways: If the cluster is not created: You can view the cluster storage type on the creation page of an Aliyun E-MapReduce 5.x Hadoop cluster. If the cluster is already created: You can view the cluster storage type on the details page of an Aliyun E-MapReduce 5.x Hadoop cluster.
Cluster storage root directory	Enter the root directory of the cluster storage. You can obtain this information from the Aliyun E-MapReduce 5.x Hadoop cluster details. Important If the specified path includes an endpoint, Dataphin uses that endpoint by default. If it does not, Dataphin uses the bucket-level endpoint configured in core-site.xml. If no bucket-level endpoint is configured, Dataphin uses the global endpoint from core-site.xml. For more information, see Configure an endpoint to access Alibaba Cloud OSS-HDFS (JindoFS service).
Configuration files	Upload the `core-site.xml` and `hive-metastore-site.xml` configuration files, which you can export from the Hadoop cluster.
AccessKey ID, AccessKey Secret	Enter the AccessKey ID and AccessKey Secret for accessing the cluster's OSS. Use an existing AccessKey pair or create a new one. For instructions, see Create an AccessKey pair. Important The configuration entered here has a higher priority than the AccessKey configured in core-site.xml. To mitigate the risk of an AccessKey leak, the AccessKey Secret is displayed only once upon creation and cannot be retrieved later. Be sure to store it securely.
Enable Kerberos	Kerberos is an authentication protocol based on symmetric key technology that provides identity verification for other services and supports SSO. SSO allows a client to access multiple services, such as HBase and HDFS, after a single authentication. If your Hadoop cluster uses Kerberos authentication, configure the following parameters. Kerberos configuration method Krb5 File Configuration: Upload a krb5 file for Kerberos authentication. Hive configuration JDBC URL: Enter the JDBC connection URL for Hive. The following three types of URLs are supported: The connection address of the Hive Server is in the format `jdbc:hive://host:port/dbname`. The connection address for ZooKeeper. For example: `jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2`. The connection string for a service with Kerberos enabled is in the format `jdbc:hive2://host:port/dbname;principal=hive/_HOST@xx.com`. Note For E-MapReduce3.x, E-MapReduce5.x, and Cloudera Data Platform versions, if Kerberos authentication is enabled, you cannot specify multiple IP addresses in the JDBC URL. Hive keytab file: Upload the keytab authentication file. You can obtain this file from the Hive Server. Hive principal: Enter the principal corresponding to the Hive keytab file, for example, `xxx/hdfsclient@xxx.xxx`.
Disable Kerberos	If your Hadoop cluster does not use Kerberos authentication, configure the following parameters. JDBC URL: Enter the JDBC connection URL for Hive. The following three types of URLs are supported: The connection address of the Hive server. The format is `jdbc:hive://host:port/dbname`. The connection address for ZooKeeper. For example, `jdbc:hive2://zk01:2181,zk02:2181,zk03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2`. The connection string for a Kerberos-enabled connection is in the format `jdbc:hive2://host:port/dbname;principal=hive/_HOST@xx.com`. Username, Password: Enter the username and password for logging in to Hive. Note To ensure that tasks run properly, make sure the specified user has permission to access Hive data.

Spark

Note

Spark configuration is supported when the version is set to CDP7.x Hive3.1.3 and the data lake table format is set to Hudi.
Spark configuration is supported when the version is set to Aliyun EMR5.x Hive3.1.x and the data lake table format is set to Iceberg or Paimon.

Parameter	Description
Spark	Select Enable or Disable. This feature is enabled by default. If enabled, you must also configure the parameters below. Note You cannot disable Spark if Paimon is selected as the data lake table format.
Service type	Select Kyuubi or Livy. Kyuubi is selected by default.
Spark JDBC URL	Enter the JDBC connection URL in the following format: `jdbc:hive2://{host}:{port}/{database name}`.
Authentication method	Select No Authentication, LDAP, or Kerberos.
Username, Password	Enter the username and password for authentication. To ensure that tasks run properly, make sure this user has the required data permissions. Note If Authentication Method is set to No Authentication, you must enter a Username. If Authentication Method is set to LDAP, you must enter a Username and Password.
Spark keytab file	Upload the keytab authentication file. You can obtain this file from the Spark Server. Note This parameter is configurable when Authentication Method is set to Kerberos.
Spark principal	Enter the principal corresponding to the Spark keytab file, for example, `xxx/hadoopclient@xxx.xxx`. Note This parameter is configurable when Authentication Method is set to Kerberos.

Real-time development

Note

Real-time development is supported only for CDH6.x Hive 2.1.1, CDP7.x Hive 3.1.3, and AsiaInfo DP5.x Hive 3.1.0. You can configure the real-time development module if you selected Real-time Development in the Enabled Modules section.

Configuration File: Upload the hive-site.xml file. Flink SQL tasks will ignore the authentication information from the integration settings and use the Flink engine's authentication information to access the Hive data source.

Amazon EMR

Parameter	Description
Public DNS of the master node	Dataphin uses the public DNS to resolve the VPC private DNS. Both Hive and Spark connect through the private DNS. Format: `ec2-{public_ip}.{region}.compute.amazonaws.com`.
*Key file (.pem)**	The key pair used to access the master node's EC2 instance (the key pair specified when you created the EMR cluster).
Configuration files	You can upload the relevant cluster configuration files (core-site.xml, yarn-site.xml, hive-site.xml, hdfs-site.xml), or click Get Cluster Configurations (you must first enter the public DNS of the master node and upload the key file) to download the files from the master node.
Database	Enter the database name for the Amazon EMR compute engine.
Engine type	Select Hive or Spark. Hive is selected by default. If you select Hive, enter the Hive JDBC URL. If you select Spark, enter the Spark JDBC URL. Hive JDBC URL: Enter the JDBC connection URL for Hive, or Auto-fetch the connection URL (you must first enter the public DNS of the master node and upload the key file). The Hive JDBC URL format is `jdbc:hive2//host1:port1,host2:post2/`. You do not need to enter the database name. Spark JDBC URL: Enter the JDBC connection URL for Spark. The format is `jdbc:hive2//{host:port}/{database name}`.
Username	The designated username for Hive or Spark. Dataphin sets this as the JDBC `username`.
Cluster storage	Currently, only HDFS can be selected.
Metadata acquisition method	Amazon Glue: If you select Amazon Glue, you must also configure the Glue Region Code, Glue AccessKey ID, and Glue AccessKey Secret. Glue region code: Enter the Region Code for Amazon Glue, for example, `ap-northeast-3`, `us-east-1`, or `us-west-1`. Glue AccessKey ID, Glue AccessKey Secret: Enter the AccessKey ID and AccessKey Secret for accessing Amazon Glue.

Select a Default Resource Group. This resource group is used to run tasks related to the current data source, such as database SQL queries, offline full database migrations, and data previews.
Click Test connection, or click OK to save the configuration and create the Hive data source.
When you click Test connection, the system verifies whether Dataphin can connect to the data source. If you click OK directly, the system automatically tests the connection for all selected clusters. You can create the data source even if the connection tests fail.
Test Connection tests the connection for the Public Scheduling Cluster or Registered Scheduling Clusters that have been registered in Dataphin and are in normal use. The Public Scheduling Cluster is selected by default and cannot be deselected. If there are no resource groups under a Registered Scheduling Cluster, connection testing is not supported. You need to create a resource group first before testing the connection.
- The selected clusters are only used to test network connectivity with the current data source and are not used for running related tasks later.
- The test connection usually takes less than 2 minutes. If it times out, you can click the icon to view the specific reason and retry.
- Regardless of whether the test result is Connection Failed, Connection Successful, or Succeeded With Warning, the system will record the generation time of the final result.
  Note
  Only the test results for the Public Scheduling Cluster include three connection statuses: Succeeded With Warning, Connection Successful, and Connection Failed. The test results for Registered Scheduling Clusters in Dataphin only include two connection statuses: Connection Successful and Connection Failed.
- When the test result is Connection Failed, you can click the icon to view the specific failure reason.
- When the test result is Succeeded With Warning, it means that the application cluster connection is successful but the scheduling cluster connection failed. The current data source cannot be used for data development and integration. You can click the icon to view the log information.
If the Hive metadata database has a public address, select a resource group from the Public Scheduling Cluster as the default resource group. When configuring integration tasks, run them on a registered scheduling cluster. If the Hive metadata database has an internal address, select a resource group from a Registered Scheduling Cluster as the default resource group. Then, configure a collection task in metadata and run integration tasks on the registered scheduling cluster.
Important
If the connection test fails, see Network Connectivity Solutions for troubleshooting instructions.

上一篇: Create a MaxCompute data source 下一篇: Add a Hologres data source