MaxCompute provides a Hive+HDFS external schema that can be mapped to an EMR cluster or a standard open source Hadoop cluster. This schema supports real-time metadata retrieval and data access. You can map a Hive instance at the database level, which allows MaxCompute to read data from Hive without using the computing capabilities of Hadoop.
Introduction
Hive is a common open source data warehouse solution for big data. Its metadata is usually stored in Hive Metastore (HMS), and its data is stored on HDFS. MaxCompute lets you create this type of external data source. By creating a federation between MaxCompute and Hive, you can access and aggregate data from open source big data systems within your data warehouse.
The Hive+HDFS schema does not support Kerberos authentication. It supports only Hadoop instances that do not require authentication and are connected through an external data source. In MaxCompute, you can use the MaxCompute permission system to manage user access to mapped data objects within the external schema based on object names.
Limitations
Data lakehouse 2.0 capabilities can be built only in the China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Singapore, and Germany (Frankfurt) regions.
Hive 2.x is supported. Hive 3.x is also supported if it is compatible with version 2.x.
Pay-as-you-go SQL federated computing tasks for the Hive+HDFS mode are free of charge during the public preview period.
If the Hive+HDFS external data source is an E-MapReduce instance, only a self-managed RDS instance can be used for metadata. The DLF unified metadata service or the built-in MySQL database are not supported. Before you read data, confirm that the Hive component of the E-MapReduce instance is running properly.
Procedure
Step 1: Prepare test data
If you already have Hive data, you can skip this step.
Activate E-MapReduce. The VPC where MaxCompute and the Hadoop cluster are located must be in the same region.
-
Prepare Hive test data
-
登录E-MapReduce控制台,在左上角选择地域。
-
在My Clusters主页面,单击Create Cluster。
-
Software Configuration
在顶部菜单栏处,根据实际情况选择资源组。资源组默认显示账号全部资源。其他配置参数如下:
参数名
说明
Region
-
创建的集群会在对应的地域内,一旦创建便无法修改。
-
MaxCompute与Hadoop集群所在的VPC必须部署在同一地域。
Business Scenario
本示例中选择Custom Cluster。
Product Version
根据需求选择,推荐使用最新版本。
High Service Availability
默认不开启。打开High Service Availability开关后,EMR会把Master节点分布在不同的底层硬件上以降低故障风险。
Optional Services (Select One At Least)
本文实践选择
HADOOP-COMMON(3.2.1)、HDFS(3.2.1)、HIVE(3.1.3)以及YARN(3.2.1)。Collect Service Operational Logs
默认开启采集。
if you want to collect service operational logs, you can modify the collection status on the Basic Information tab of the cluster details page.
Metadata
本示例中使用Self-managed RDS。
云上创建RDS步骤参见创建RDS MySQL实例与配置数据库。
-
For an EMR cluster that is used in the production environment, we recommend that you use independent ApsaraDB RDS for MySQL. For example, you can use ApsaraDB RDS for MySQL 5.7 of Alibaba Cloud high-availability edition.
-
填写RDS连接信息hivemetastore-site.xml:
-
javax.jdo.option.ConnectionURL
JDBC元数据连接,例如:
jdbc:mysql://rds.host.name/hive_db_name -
javax.jdo.option.ConnectionUserName
元数据的数据库账号名。
-
javax.jdo.option.ConnectionPassword
元数据的数据库密码。
-
上述配置仅适用于本文实践,关于通过E-MapReduce控制台创建分析集群的详细操作请参见Quick start: Create and use a Data Lake cluster。
-
-
Hardware Configuration
参数名
说明
Billing Method
-
Subscription
-
Pay-as-you-go
在测试场景下,建议使用Pay-as-you-go,测试正常后可以释放该集群,再新建一个Subscription的生产集群正式使用。
Zone
集群创建后,无法直接更改可用区,请谨慎选择。
VPC
选择对应区域下的专有网络。如果没有,单击创建VPC前往新建。创建专有网络完成后,单击刷新,可以选择刚创建的VPC。
vSwitch
选择在对应专有网络下可用区的交换机,如果在这个可用区没有可用的交换机,则需要新创建一个。
Default Security Group
Important不支持使用ECS上创建的企业安全组。
如果已有在使用的安全组,可以直接选择使用或新建安全组。
Node Group
根据业务诉求,配置Master节点组、Core节点组或Task节点组信息。详情请参见Choose hardware and network。
本例中打开Master节点组下的Assign Public Network IP开关,其余使用默认值即可。
Cluster Scaling
配置弹性伸缩规则。
-
-
Basic Configuration
参数名
说明
Cluster Name
集群的名字,长度限制为1~64个字符,仅可使用中文、字母、数字、短划线(-)和下划线(_)。
Identity Credentials
-
Key Pair:如果希望在不输入密码的情况下进行身份验证,可以选择Key Pair方式,更多详情请参见Manage SSH key pairs。
-
Password:用于远程登录集群的Master节点。请记录该配置,登录集群时您需要输入该密码。
-
-
-
登录集群Master节点,进入Hive准备数据。
-
登录集群Master节点
-
登录E-MapReduce控制台,在左上角选择地域。
-
单击目标集群对应的Nodes。
-
在Nodes页签,单击节点组所在行的
图标,单击节点ID进入Instance详情页。 -
在节点ID的Instance页面,单击Connect。
-
在弹出的Remote connection对话框,使用默认的Workbench,单击Sign in now。
-
在登录实例窗口根据实际情况填写该实例信息。
Note在创建密钥对时,私钥会自动下载保存为.pem格式的私钥文件。认证方式可以选择SSH密钥认证,上传私钥文件,可以在不输入密码的情况下进行身份验证。
-
-
准备Hive测试数据。使用Workbench方式登录集群主节点后,在终端界面输入如下命令:
[user@emr-node ~]$ hive -- 创建库 hive> CREATE database IF NOT EXISTS myhive; -- 切换库 hive> USE myhive; -- 创建非分区表 hive> CREATE TABLE IF NOT EXISTS employees ( id INT, name STRING, age INT, department STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; -- 插入数据 hive> INSERT INTO employees VALUES(1, 'John', 25, 'Sales'), (2, 'Jane', 30, 'Marketing'), (3, 'Mike', 35, 'Engineering'), (4, 'Sarah', 28, 'HR'), (5, 'David', 32, 'Finance'), (6, 'Linda', 29, 'IT'), (7, 'Robert', 31, 'Operations'), (8, 'Emily', 27, 'Research'), (9, 'Michael', 33, 'Development'), (10, 'Chris', 26, 'Support'); -- 创建分区表 hive> CREATE TABLE employees_pt ( id INT, name STRING, age INT ) PARTITIONED BY (department STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; -- 插入数据 INSERT INTO employees_pt PARTITION (department='Sales') VALUES(1, 'John', 25),(2, 'Jane', 30),(3, 'Mike', 35); INSERT INTO employees_pt PARTITION (department='Marketing') VALUES(4, 'Sarah', 28),(5, 'David', 32); INSERT INTO employees_pt PARTITION (department='Engineering') VALUES(6, 'Linda', 29),(7, 'Robert', 31); INSERT INTO employees_pt PARTITION (department='HR') VALUES(8, 'Emily', 27),(9, 'Michael', 33),(10, 'Chris', 26);查询非分区表和分区表:
-- 查询非分区表 hive> SELECT * FROM employees; 1 John 25 Sales 2 Jane 30 Marketing 3 Mike 35 Engineering 4 Sarah 28 HR 5 David 32 Finance 6 Linda 29 IT 7 Robert 31 Operations 8 Emily 27 Research 9 Michael 33 Development 10 Chris 26 Support -- 查询分区表 hive> SELECT * FROM employees_pt; 6 Linda 29 Engineering 7 Robert 31 Engineering 8 Emily 27 HR 9 Michael 33 HR 10 Chris 26 HR 4 Sarah 28 Marketing 5 David 32 Marketing 1 John 25 Sales 2 Jane 30 Sales 3 Mike 35 Sales
-
-
Step 2: Create a Hive+HDFS external data source
-
Log in to the MaxCompute console and select a region in the upper-left corner.
-
In the left-side navigation pane, choose .
-
On the External Data Source page, click Create External Data Source.
-
In the Create External Data Source dialog box, configure the parameters. The following tables describe the parameters.
Parameter
Required
Description
External Data Source Type
Required
Select Hive+HDFS.
External Data Source Name
Required
Enter a custom name. Naming conventions:
Must start with a letter and can contain only lowercase letters, underscores, and digits.
Cannot exceed 128 characters.
For example,
hive_hdfs.Description
Optional
Enter a description as needed.
Network Connection Object
Required
The name of the network connection. Select or create a network connection from MaxCompute to the VPC network of the Alibaba Cloud E-MapReduce or Hadoop cluster.
ImportantThe VPC must be in the same region as the MaxCompute external data source and the project to which the external data source is attached.
For more information about the parameters, see the Create a network connection between MaxCompute and the target VPC network step in Network connection process.
Cluster Name
The name that represents the NameNode in a high-availability Hadoop cluster.
To obtain the cluster name of an EMR cluster, perform the following steps:
-
登录E-MapReduce控制台,在左上角选择地域。
-
单击目标集群ID,进入集群详情页面。
-
在Services页签,单击HDFS服务的Configure,进入Configure页面。
-
切换至hdfs-site.xml页签,在By Name栏搜索
dfs.nameservices,该配置项对应的值则为Cluster Name。
NameNode Address
The endpoints and port numbers of the active and standby NameNode services of the destination Hadoop cluster. The port number is usually 8020.
To obtain the NameNode address of an EMR cluster, perform the following steps:
-
登录E-MapReduce控制台,在左上角选择地域。
-
单击目标集群ID,进入集群详情页面。
-
在Services页签,单击HDFS服务的Status,进入Status页面。
-
在Components区域,单击NameNode前面的
图标,展开拓扑列表。 -
获取master-1-1节点对应的Private IP Address,NameNode Address格式即为
内网IP:8020。
HMS Service Address
The endpoints and port numbers of the Hive metadata services of the active and standby NameNodes in the destination Hadoop cluster. The port number is usually 9083.
-
登录E-MapReduce控制台,在左上角选择地域。
-
单击目标集群ID,进入集群详情页面。
-
在Services页签,单击Hive服务的Status,进入Status页面。
-
在Components区域,单击HiveRuntime前面的
图标,展开拓扑列表。 -
获取master-1-1节点对应的Private IP Address,HMS Service Address格式即为
内网IP:9083。
Fs.Defaultfs
Authentication Type
Only No authentication is supported.
Create vSwitch
MaxCompute accesses the data source over a VPC. By default, the reverse access 2.0 solution is used. This solution requires you to configure a vSwitch in a specific zone to establish the metadata access link.
The zones where vSwitches are available in each region are described in the UI prompts. In the VPC where the data source resides, select an existing vSwitch or create a new one that meets the zone requirements.
Foreign Server Supplemental Properties
Optional
Additional properties for the external data source. After you specify these properties, tasks that use this external data source can access the source system based on the defined behavior.
NoteFor information about the supported parameters, see subsequent updates to the official documentation. The specific parameters will be gradually made available as the product evolves.
-
Click OK to create the external data source.
Step 3: Create an external schema
-
Operations on external schemas must be performed in a MaxCompute project. Therefore, you must enable the schema-level feature for the project.
-
In the left-side navigation pane, choose .
-
On the Projects page, click Enable Schema in the Actions column for the target project.
-
Before you perform schema operations or query data from an external schema, you must enable the schema feature for the SQL syntax. To do this, add the
SET odps.namespace.schema=true;statement before your SQL statement.In a MaxCompute client, run the following command to create an external schema that references a Hive+HDFS external data source:
CREATE EXTERNAL SCHEMA IF NOT EXISTS <external_schema_name> WITH <hive_foreign_server_name> ON '<database_name>' ;The following table describes the parameters.
external_schema_name: The custom name of the external schema. In this example, the name istest_schema_hive.hive_foreign_server_name: The name of the external data source that you created. In this example, the name ishive_hdfs.database_name: The name of the Hive database. In this example, the name ismyhive.
Step 4: Use SQL to access the data source
List the names of Hive tables
Method 1:
SHOW tables IN <external_schema_name>; --external_schema_name is the name of the external schema.Method 2:
USE SCHEMA <external_schema_name>; --external_schema_name is the name of the external schema. SHOW tables;
Examples
Query the names of all tables in the external schema named test_schema.
USE SCHEMA test_schema; SHOW TABLES;Sample result:
ALIYUN$xxx@test.aliyunid.com:employees_pt ALIYUN$xxx@test.aliyunid.com:employeesQuery data from the
employeestable in thetest_schemaschema of thetestproject.SELECT * FROM test.test_schema.employees;Sample result:
+------------+------------+------------+------------+ | id | name | age | department | +------------+------------+------------+------------+ | 1 | John | 25 | Sales | | 2 | Jane | 30 | Marketing | | 3 | Mike | 35 | Engineering | | 4 | Sarah | 28 | HR | | 5 | David | 32 | Finance | | 6 | Linda | 29 | IT | | 7 | Robert | 31 | Operations | | 8 | Emily | 27 | Research | | 9 | Michael | 33 | Development | | 10 | Chris | 26 | Support | +------------+------------+------------+------------+Copy data from the federated external table
employeesto the data warehouse.MaxCompute does not allow you to run
CREATE TABLE ASoperations on an external schema. Therefore, you must run theUSE SCHEMA Default;statement to switch to the default schema before you run the subsequent statements:-- Copy data from the federated external table to the data warehouse. CREATE TABLE employees_copy AS SELECT * FROM test.test_schema.employees; -- Query the copied table data in the data warehouse. SELECT * FROM employees_copy;Sample result:
+------------+------------+------------+------------+ | id | name | age | department | +------------+------------+------------+------------+ | 1 | John | 25 | Sales | | 2 | Jane | 30 | Marketing | | 3 | Mike | 35 | Engineering | | 4 | Sarah | 28 | HR | | 5 | David | 32 | Finance | | 6 | Linda | 29 | IT | | 7 | Robert | 31 | Operations | | 8 | Emily | 27 | Research | | 9 | Michael | 33 | Development | | 10 | Chris | 26 | Support | +------------+------------+------------+------------+