Hive+HDFS external schema-MaxCompute(MaxCompute)-阿里云帮助中心

MaxCompute provides a Hive+HDFS external schema that can be mapped to an EMR cluster or a standard open source Hadoop cluster. This schema supports real-time metadata retrieval and data access. You can map a Hive instance at the database level, which allows MaxCompute to read data from Hive without using the computing capabilities of Hadoop.

Introduction

Hive is a common open source data warehouse solution for big data. Its metadata is usually stored in Hive Metastore (HMS), and its data is stored on HDFS. MaxCompute lets you create this type of external data source. By creating a federation between MaxCompute and Hive, you can access and aggregate data from open source big data systems within your data warehouse.

The Hive+HDFS schema does not support Kerberos authentication. It supports only Hadoop instances that do not require authentication and are connected through an external data source. In MaxCompute, you can use the MaxCompute permission system to manage user access to mapped data objects within the external schema based on object names.

Limitations

Data lakehouse 2.0 capabilities can be built only in the China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Singapore, and Germany (Frankfurt) regions.
Hive 2.x is supported. Hive 3.x is also supported if it is compatible with version 2.x.
Pay-as-you-go SQL federated computing tasks for the Hive+HDFS mode are free of charge during the public preview period.
If the Hive+HDFS external data source is an E-MapReduce instance, only a self-managed RDS instance can be used for metadata. The DLF unified metadata service or the built-in MySQL database are not supported. Before you read data, confirm that the Hive component of the E-MapReduce instance is running properly.

Procedure

Step 1: Prepare test data

If you already have Hive data, you can skip this step.

Activate E-MapReduce. The VPC where MaxCompute and the Hadoop cluster are located must be in the same region.

Prepare Hive test data

登录E-MapReduce控制台，在左上角选择地域。

在My Clusters主页面，单击Create Cluster。

Software Configuration

在顶部菜单栏处，根据实际情况选择资源组。资源组默认显示账号全部资源。其他配置参数如下：

参数名	说明
Region	创建的集群会在对应的地域内，一旦创建便无法修改。 MaxCompute与Hadoop集群所在的VPC必须部署在同一地域。
Business Scenario	本示例中选择Custom Cluster。
Product Version	根据需求选择，推荐使用最新版本。
High Service Availability	默认不开启。打开High Service Availability开关后，EMR会把Master节点分布在不同的底层硬件上以降低故障风险。
Optional Services (Select One At Least)	本文实践选择`HADOOP-COMMON(3.2.1)`、`HDFS(3.2.1)`、`HIVE(3.1.3)`以及`YARN(3.2.1)`。
Collect Service Operational Logs	默认开启采集。 if you want to collect service operational logs, you can modify the collection status on the Basic Information tab of the cluster details page.
Metadata	本示例中使用Self-managed RDS。云上创建RDS步骤参见创建RDS MySQL实例与配置数据库。 For an EMR cluster that is used in the production environment, we recommend that you use independent ApsaraDB RDS for MySQL. For example, you can use ApsaraDB RDS for MySQL 5.7 of Alibaba Cloud high-availability edition. 填写RDS连接信息hivemetastore-site.xml： javax.jdo.option.ConnectionURL JDBC元数据连接，例如：`jdbc:mysql://rds.host.name/hive_db_name` javax.jdo.option.ConnectionUserName 元数据的数据库账号名。 javax.jdo.option.ConnectionPassword 元数据的数据库密码。

上述配置仅适用于本文实践，关于通过E-MapReduce控制台创建分析集群的详细操作请参见Quick start: Create and use a Data Lake cluster。

Hardware Configuration

参数名	说明
Billing Method	Subscription Pay-as-you-go 在测试场景下，建议使用Pay-as-you-go，测试正常后可以释放该集群，再新建一个Subscription的生产集群正式使用。
Zone	集群创建后，无法直接更改可用区，请谨慎选择。
VPC	选择对应区域下的专有网络。如果没有，单击创建VPC前往新建。创建专有网络完成后，单击刷新，可以选择刚创建的VPC。
vSwitch	选择在对应专有网络下可用区的交换机，如果在这个可用区没有可用的交换机，则需要新创建一个。
Default Security Group	Important 不支持使用ECS上创建的企业安全组。如果已有在使用的安全组，可以直接选择使用或新建安全组。
Node Group	根据业务诉求，配置Master节点组、Core节点组或Task节点组信息。详情请参见Choose hardware and network。本例中打开Master节点组下的Assign Public Network IP开关，其余使用默认值即可。
Cluster Scaling	配置弹性伸缩规则。

Basic Configuration

参数名	说明
Cluster Name	集群的名字，长度限制为1～64个字符，仅可使用中文、字母、数字、短划线（-）和下划线（_）。
Identity Credentials	Key Pair：如果希望在不输入密码的情况下进行身份验证，可以选择Key Pair方式，更多详情请参见Manage SSH key pairs。 Password：用于远程登录集群的Master节点。请记录该配置，登录集群时您需要输入该密码。

登录集群Master节点，进入Hive准备数据。

登录集群Master节点
1. 登录E-MapReduce控制台，在左上角选择地域。
2. 单击目标集群对应的Nodes。
3. 在Nodes页签，单击节点组所在行的图标，单击节点ID进入Instance详情页。
4. 在节点ID的Instance页面，单击Connect。
5. 在弹出的Remote connection对话框，使用默认的Workbench，单击Sign in now。
6. 在登录实例窗口根据实际情况填写该实例信息。
  
  Note
  在创建密钥对时，私钥会自动下载保存为.pem格式的私钥文件。认证方式可以选择SSH密钥认证，上传私钥文件，可以在不输入密码的情况下进行身份验证。

准备Hive测试数据。使用Workbench方式登录集群主节点后，在终端界面输入如下命令：

[user@emr-node ~]$ hive

-- 创建库
hive> CREATE database IF NOT EXISTS myhive;

-- 切换库
hive> USE myhive;

-- 创建非分区表
hive> CREATE TABLE IF NOT EXISTS employees (
        id INT,
        name STRING,
        age INT,
        department STRING
      )
      ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
      STORED AS TEXTFILE;

-- 插入数据
hive> INSERT INTO employees VALUES(1, 'John', 25, 'Sales'),
    (2, 'Jane', 30, 'Marketing'),
    (3, 'Mike', 35, 'Engineering'),
    (4, 'Sarah', 28, 'HR'),
    (5, 'David', 32, 'Finance'),
    (6, 'Linda', 29, 'IT'),
    (7, 'Robert', 31, 'Operations'),
    (8, 'Emily', 27, 'Research'),
    (9, 'Michael', 33, 'Development'),
    (10, 'Chris', 26, 'Support');

-- 创建分区表
hive> CREATE TABLE employees_pt (
      id INT,
      name STRING,
      age INT
    )
  PARTITIONED BY (department STRING)
  ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

-- 插入数据
INSERT INTO employees_pt PARTITION (department='Sales') VALUES(1, 'John', 25),(2, 'Jane', 30),(3, 'Mike', 35);
INSERT INTO employees_pt PARTITION (department='Marketing') VALUES(4, 'Sarah', 28),(5, 'David', 32);
INSERT INTO employees_pt PARTITION (department='Engineering') VALUES(6, 'Linda', 29),(7, 'Robert', 31);
INSERT INTO employees_pt PARTITION (department='HR') VALUES(8, 'Emily', 27),(9, 'Michael', 33),(10, 'Chris', 26);

查询非分区表和分区表：

-- 查询非分区表
hive> SELECT * FROM employees;

1       John    25      Sales
2       Jane    30      Marketing
3       Mike    35      Engineering
4       Sarah   28      HR
5       David   32      Finance
6       Linda   29      IT
7       Robert  31      Operations
8       Emily   27      Research
9       Michael 33      Development
10      Chris   26      Support

-- 查询分区表
hive> SELECT * FROM employees_pt;

6       Linda   29      Engineering
7       Robert  31      Engineering
8       Emily   27      HR
9       Michael 33      HR
10      Chris   26      HR
4       Sarah   28      Marketing
5       David   32      Marketing
1       John    25      Sales
2       Jane    30      Sales
3       Mike    35      Sales

Step 2: Create a Hive+HDFS external data source

Log in to the MaxCompute console and select a region in the upper-left corner.
In the left-side navigation pane, choose Manage Configurations > External Data Source .
On the External Data Source page, click Create External Data Source.

In the Create External Data Source dialog box, configure the parameters. The following tables describe the parameters.

Parameter	Required	Description
External Data Source Type	Required	Select Hive+HDFS.
External Data Source Name	Required	Enter a custom name. Naming conventions: Must start with a letter and can contain only lowercase letters, underscores, and digits. Cannot exceed 128 characters. For example, `hive_hdfs`.
Description	Optional	Enter a description as needed.
Network Connection Object	Required	The name of the network connection. Select or create a network connection from MaxCompute to the VPC network of the Alibaba Cloud E-MapReduce or Hadoop cluster. Important The VPC must be in the same region as the MaxCompute external data source and the project to which the external data source is attached. For more information about the parameters, see the Create a network connection between MaxCompute and the target VPC network step in Network connection process.
Cluster Name		The name that represents the NameNode in a high-availability Hadoop cluster. To obtain the cluster name of an EMR cluster, perform the following steps: 登录E-MapReduce控制台，在左上角选择地域。单击目标集群ID，进入集群详情页面。在Services页签，单击HDFS服务的Configure，进入Configure页面。切换至hdfs-site.xml页签，在By Name栏搜索`dfs.nameservices`，该配置项对应的值则为Cluster Name。
NameNode Address		The endpoints and port numbers of the active and standby NameNode services of the destination Hadoop cluster. The port number is usually 8020. To obtain the NameNode address of an EMR cluster, perform the following steps: 登录E-MapReduce控制台，在左上角选择地域。单击目标集群ID，进入集群详情页面。在Services页签，单击HDFS服务的Status，进入Status页面。在Components区域，单击NameNode前面的图标，展开拓扑列表。获取master-1-1节点对应的Private IP Address，NameNode Address格式即为`内网IP:8020`。
HMS Service Address		The endpoints and port numbers of the Hive metadata services of the active and standby NameNodes in the destination Hadoop cluster. The port number is usually 9083. 登录E-MapReduce控制台，在左上角选择地域。单击目标集群ID，进入集群详情页面。在Services页签，单击Hive服务的Status，进入Status页面。在Components区域，单击HiveRuntime前面的图标，展开拓扑列表。获取master-1-1节点对应的Private IP Address，HMS Service Address格式即为`内网IP:9083`。
Fs.Defaultfs
Authentication Type		Only No authentication is supported.
Create vSwitch		MaxCompute accesses the data source over a VPC. By default, the reverse access 2.0 solution is used. This solution requires you to configure a vSwitch in a specific zone to establish the metadata access link. The zones where vSwitches are available in each region are described in the UI prompts. In the VPC where the data source resides, select an existing vSwitch or create a new one that meets the zone requirements.
Foreign Server Supplemental Properties	Optional	Additional properties for the external data source. After you specify these properties, tasks that use this external data source can access the source system based on the defined behavior. Note For information about the supported parameters, see subsequent updates to the official documentation. The specific parameters will be gradually made available as the product evolves.

Click OK to create the external data source.

Step 3: Create an external schema

Operations on external schemas must be performed in a MaxCompute project. Therefore, you must enable the schema-level feature for the project.
1. In the left-side navigation pane, choose Manage Configurations > Projects.
2. On the Projects page, click Enable Schema in the Actions column for the target project.
Before you perform schema operations or query data from an external schema, you must enable the schema feature for the SQL syntax. To do this, add the SET odps.namespace.schema=true; statement before your SQL statement.
In a MaxCompute client, run the following command to create an external schema that references a Hive+HDFS external data source:
```
CREATE EXTERNAL SCHEMA IF NOT EXISTS <external_schema_name>
WITH  <hive_foreign_server_name>
ON '<database_name>' ;
```
The following table describes the parameters.
- external_schema_name: The custom name of the external schema. In this example, the name is test_schema_hive.
- hive_foreign_server_name: The name of the external data source that you created. In this example, the name is hive_hdfs.
- database_name: The name of the Hive database. In this example, the name is myhive.

Step 4: Use SQL to access the data source

List the names of Hive tables

Method 1:

SHOW tables IN <external_schema_name>; --external_schema_name is the name of the external schema.

Method 2:

USE SCHEMA <external_schema_name>; --external_schema_name is the name of the external schema.
SHOW tables;

Examples

Query the names of all tables in the external schema named test_schema.

USE SCHEMA test_schema;
SHOW TABLES;

Sample result:

ALIYUN$xxx@test.aliyunid.com:employees_pt
ALIYUN$xxx@test.aliyunid.com:employees

Query data from the employees table in the test_schema schema of the test project.

SELECT * FROM test.test_schema.employees;

Sample result:

+------------+------------+------------+------------+
| id         | name       | age        | department | 
+------------+------------+------------+------------+
| 1          | John       | 25         | Sales      | 
| 2          | Jane       | 30         | Marketing  | 
| 3          | Mike       | 35         | Engineering | 
| 4          | Sarah      | 28         | HR         | 
| 5          | David      | 32         | Finance    | 
| 6          | Linda      | 29         | IT         | 
| 7          | Robert     | 31         | Operations | 
| 8          | Emily      | 27         | Research   | 
| 9          | Michael    | 33         | Development | 
| 10         | Chris      | 26         | Support    | 
+------------+------------+------------+------------+

Copy data from the federated external table employees to the data warehouse.

MaxCompute does not allow you to run CREATE TABLE AS operations on an external schema. Therefore, you must run the USE SCHEMA Default; statement to switch to the default schema before you run the subsequent statements:

-- Copy data from the federated external table to the data warehouse.
CREATE TABLE employees_copy AS SELECT * FROM test.test_schema.employees;

-- Query the copied table data in the data warehouse.
SELECT * FROM employees_copy;

Sample result:

+------------+------------+------------+------------+
| id         | name       | age        | department | 
+------------+------------+------------+------------+
| 1          | John       | 25         | Sales      | 
| 2          | Jane       | 30         | Marketing  | 
| 3          | Mike       | 35         | Engineering | 
| 4          | Sarah      | 28         | HR         | 
| 5          | David      | 32         | Finance    | 
| 6          | Linda      | 29         | IT         | 
| 7          | Robert     | 31         | Operations | 
| 8          | Emily      | 27         | Research   | 
| 9          | Michael    | 33         | Development | 
| 10         | Chris      | 26         | Support    | 
+------------+------------+------------+------------+