How to import Hive data into an AnalyticDB for MySQL lakehouse-AnalyticDB(AnalyticDB)-阿里云帮助中心

Features

AnalyticDB for MySQL lets you migrate Hive metadata and data to OSS in a single click. You can also migrate multiple databases and tables in parallel.

Prerequisites

An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.
A job resource group is created for the AnalyticDB for MySQL cluster.
A database account is created for the AnalyticDB for MySQL cluster.
- If you use an Alibaba Cloud account, you need to only create a privileged account.
- If you use a Resource Access Management (RAM) user, you must create a privileged account and a standard account and associate the standard account with the RAM user.
Create an E-MapReduce cluster or a self-managed CDH cluster on ECS instances in the same region as the AnalyticDB for MySQL cluster. The cluster must meet the following requirements:

E-MapReduce cluster: The business scenario of the cluster is New Data Lake, the metadata type is Self-managed RDS or Built-in MySQL, and the Hive service is enabled. The Hive Storage Mode must be HDFS, which means the Data Lake Storage checkbox is not selected. For more information, see Create a cluster.
Important
- E-MapReduce clusters with DLF Unified Metadata as the metadata type do not support migrating Hive data to OSS.
- If the Data Lake Storage of your E-MapReduce cluster is Data Lake Storage, the data is already in OSS. You can use the metadata discovery feature to import the data into AnalyticDB for MySQL. For more information, see Import data by using metadata discovery.
A self-managed CDH cluster on ECS instances.

Create Hive databases and partitioned tables. For more information, see Basic Hive operations.

Billing

Migrating data to OSS by using the data migration feature of AnalyticDB for MySQL incurs the following fees.

The elastic resource fees for the ACUs of your AnalyticDB for MySQL cluster. For more information about billable items, see Billable items of Data Lakehouse Edition and Billable items of Enterprise Edition and Basic Edition.
The storage fees, GET request fees, and PUT and other request fees for OSS. For more information about billable items, see Billing overview.

Workflow

Step 1: Create a Hive data source.
Step 2: Create a data migration job.
Step 3: Start the data migration job.
Step 4: Analyze the data.
Step 5 (Optional): Manage the data migration job.

Create a Hive data source

Note

If you have already added a Hive data source, you can skip this step and create the data migration job directly. For more information, see Create a data migration job.

Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Data Ingestion > Data Sources.
In the upper-left corner, click Create Data Source.

On the Create Data Source page, configure the parameters. The following tables describe the parameters.

Alibaba Cloud Instance

Parameter	Description
Data Source Type	Select Hive.
Data Source Name	The system automatically generates a name based on the data source type and current time. You can edit this name as needed.
Data Source Description	A description of the data source, such as its use case and business constraints.
Deployment Mode	If you use the Hive service of E-MapReduce, select Alibaba Cloud Instance.
Instance	Select the E-MapReduce instance to which the data source points.
Hive Metastore URI	The connection address of the Hive Metastore. The format is `thrift://<IP address of the master node>:<Port number>`. The default port number is 9083. To view the IP address of the master node: Log on to the EMR on ECS console. On the Nodes tab, click the icon next to the emr-master node to view the private IP address of the master node.

Self-managed CDH on ECS

Parameter	Description
Data Source Type	Select Hive.
Data Source Name	The system automatically generates a name based on the data source type and current time. You can edit this name as needed.
Data Source Description	A description of the data source, such as its use case and business constraints.
Deployment Mode	If you have a self-managed CDH cluster on ECS instances, select Self-managed CDH on ECS.
Instance	Select the ECS instance to which the data source points.
Hive Metastore URI	The connection address of the Hive Metastore. This is the public IP address of the ECS instance that corresponds to the CDH master node. The format is `thrift://<IP address of the master node>:<Port number>`. The default port number is 9083.
Host Configuration Information	Enter each host-to-IP mapping on a new line. Example: 192.168.2.153 master.cdh 192.168.2.154 node1.cdh 192.168.2.155 node2.cdh

After you configure the parameters, click Create.

Create a data migration job

In the left-side navigation pane, click Data Migration.
In the upper-right corner, click Create Migration Job.

On the Create Migration Job page, configure parameters in the Source and Destination Settings, Database/Table Migration Settings, and Migration Settings sections.

Source and destination settings

Parameter	Description
Job Name	The name of the data migration job. The system automatically generates a name based on the data source type and current time. You can edit this name as needed.
Data Source	Select an existing Hive data source or create a new one.
Destination Type	Only Data Lake - OSS Storage is supported.
OSS Path	The storage path in OSS for the data of the AnalyticDB for MySQL Data Lakehouse Edition cluster. Important All buckets in the same region as the AnalyticDB for MySQL cluster are displayed. You can select any one of them. Plan the storage path with caution. You cannot change the storage path after it is created. We recommend that you select an empty directory and ensure that its OSS path does not have a prefix relationship with the OSS paths of other jobs. This prevents data from being overwritten. For example, if the OSS paths of two data migration jobs are `oss://adb_demo/test/sls1/` and `oss://adb_demo/test/`, the paths have a prefix relationship, which may cause data to be overwritten during data migration.

Database/table migration settings

Important

If a database or table name exists in both the whitelist and the blacklist, the blacklist takes precedence, and the service does not migrate the database or table.

Parameter	Description
Database/Table Migration Whitelist	The job migrates databases and tables that match an expression. Enter regular expressions for database and table names. Separate multiple expressions with commas (,).
Database/Table Migration Blacklist	The job does not migrate databases and tables that match an expression. Enter regular expressions for database and table names. Separate multiple expressions with commas (,).

Migration settings

Parameter	Description
Handling Same Named Destination Table	Specifies how to handle a destination table that already exists: Skip this table (do not migrate): Skips only the current table. The migration of other tables continues. Report error and pause migration: Pauses the migration job. The migration for the table with the same name and all other tables stops.
Job Resource Group	Specify the job resource group in which the job runs.
Required ACUs	Specify the number of ACUs for the job resource group. The minimum value is 4. The maximum value is the number of available computing resources in the job resource group. Allocate more ACUs to improve migration performance and job stability.
Parallel Tasks	The default value is 1. The maximum value is 8. If you increase this value, multiple migration tasks are started at the same time, and each task migrates one table. However, each task requires at least 4 ACUs. If the number of ACUs is insufficient, the migration tasks run sequentially.
Advanced Settings	Allows you to specify custom configurations for the job. To do so, contact technical support.

After you configure the parameters, click Submit.

Start the data migration job

On the Data Migration page, find the data migration job that you created and click Resume in the Actions column.
Click Search in the upper-right corner. When the status changes to Starting, the data migration job has started.

Data analysis

After the migration task is successful, you can also use the metadata discovery feature to import data from OSS to the Enterprise Edition, Basic Edition, or Data Lakehouse Edition, and then analyze the data imported to the Enterprise Edition, Basic Edition, or Data Lakehouse Edition in Spark JAR Job Development. For more information about Spark development, see Spark development editor and Spark offline application development.

In the left-side navigation pane, choose Job Development > Spark JAR Development.

Enter the sample statements in the default template and click Run Now.

-- Here is just an example of SparkSQL. Modify the content and run your spark program.

conf spark.driver.resourceSpec=medium;
conf spark.executor.instances=2;
conf spark.executor.resourceSpec=medium;
conf spark.app.name=Spark SQL Test;
conf spark.adb.connectors=oss;

-- Here are your sql statements
show tables from lakehouse20220413156_adbTest;

Optional: On the Applications tab, click Logs in the Actions column to view the Spark SQL run log.

Job management

On the Data Migration page, you can perform the following operations in the Actions column.

Actions	Description
Start	Start the data migration job.
View Details	View the detailed job configuration, number of migrated tables, and migration details.
Edit	Modify job configuration properties.
Pause	Pauses the current migration job.
Delete	Delete the current migration job.