Hive data import
AnalyticDB for MySQL supports migrating data from Hive to OSS using the data migration feature. This topic describes how to add a Hive data source, create and start a data migration job, analyze the data after migration, and manage the data migration job.
Features
AnalyticDB for MySQL lets you migrate Hive metadata and data to OSS in a single click. You can also migrate multiple databases and tables in parallel.
Prerequisites
An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.
A job resource group is created for the AnalyticDB for MySQL cluster.
A database account is created for the AnalyticDB for MySQL cluster.
If you use an Alibaba Cloud account, you need to only create a privileged account.
If you use a Resource Access Management (RAM) user, you must create a privileged account and a standard account and associate the standard account with the RAM user.
-
Create an E-MapReduce cluster or a self-managed CDH cluster on ECS instances in the same region as the AnalyticDB for MySQL cluster. The cluster must meet the following requirements:
-
E-MapReduce cluster: The business scenario of the cluster is New Data Lake, the metadata type is Self-managed RDS or Built-in MySQL, and the Hive service is enabled. The Hive Storage Mode must be HDFS, which means the Data Lake Storage checkbox is not selected. For more information, see Create a cluster.
Important-
E-MapReduce clusters with DLF Unified Metadata as the metadata type do not support migrating Hive data to OSS.
-
If the Data Lake Storage of your E-MapReduce cluster is Data Lake Storage, the data is already in OSS. You can use the metadata discovery feature to import the data into AnalyticDB for MySQL. For more information, see Import data by using metadata discovery.
-
-
A self-managed CDH cluster on ECS instances.
-
Create Hive databases and partitioned tables. For more information, see Basic Hive operations.
Billing
Migrating data to OSS by using the data migration feature of AnalyticDB for MySQL incurs the following fees.
-
The elastic resource fees for the ACUs of your AnalyticDB for MySQL cluster. For more information about billable items, see Billable items of Data Lakehouse Edition and Billable items of Enterprise Edition and Basic Edition.
-
The storage fees, GET request fees, and PUT and other request fees for OSS. For more information about billable items, see Billing overview.
Workflow
-
Step 1: Create a Hive data source.
-
Step 2: Create a data migration job.
-
Step 3: Start the data migration job.
-
Step 4: Analyze the data.
-
Step 5 (Optional): Manage the data migration job.
Create a Hive data source
If you have already added a Hive data source, you can skip this step and create the data migration job directly. For more information, see Create a data migration job.
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
-
In the left-side navigation pane, choose Data Ingestion > Data Sources.
-
In the upper-left corner, click Create Data Source.
-
On the Create Data Source page, configure the parameters. The following tables describe the parameters.
Alibaba Cloud Instance
Parameter
Description
Data Source Type
Select Hive.
Data Source Name
The system automatically generates a name based on the data source type and current time. You can edit this name as needed.
Data Source Description
A description of the data source, such as its use case and business constraints.
Deployment Mode
If you use the Hive service of E-MapReduce, select Alibaba Cloud Instance.
Instance
Select the E-MapReduce instance to which the data source points.
Hive Metastore URI
The connection address of the Hive Metastore. The format is
thrift://<IP address of the master node>:<Port number>. The default port number is 9083.To view the IP address of the master node:
Log on to the EMR on ECS console. On the Nodes tab, click the
icon next to the emr-master node to view the private IP address of the master node.Self-managed CDH on ECS
Parameter
Description
Data Source Type
Select Hive.
Data Source Name
The system automatically generates a name based on the data source type and current time. You can edit this name as needed.
Data Source Description
A description of the data source, such as its use case and business constraints.
Deployment Mode
If you have a self-managed CDH cluster on ECS instances, select Self-managed CDH on ECS.
Instance
Select the ECS instance to which the data source points.
Hive Metastore URI
The connection address of the Hive Metastore. This is the public IP address of the ECS instance that corresponds to the CDH master node. The format is
thrift://<IP address of the master node>:<Port number>. The default port number is 9083.Host Configuration Information
Enter each host-to-IP mapping on a new line.
Example:
192.168.2.153 master.cdh
192.168.2.154 node1.cdh
192.168.2.155 node2.cdh
-
After you configure the parameters, click Create.
Create a data migration job
-
In the left-side navigation pane, click Data Migration.
-
In the upper-right corner, click Create Migration Job.
-
On the Create Migration Job page, configure parameters in the Source and Destination Settings, Database/Table Migration Settings, and Migration Settings sections.
Source and destination settings
Parameter
Description
Job Name
The name of the data migration job. The system automatically generates a name based on the data source type and current time. You can edit this name as needed.
Data Source
Select an existing Hive data source or create a new one.
Destination Type
Only Data Lake - OSS Storage is supported.
OSS Path
The storage path in OSS for the data of the AnalyticDB for MySQL Data Lakehouse Edition cluster.
Important-
All buckets in the same region as the AnalyticDB for MySQL cluster are displayed. You can select any one of them. Plan the storage path with caution. You cannot change the storage path after it is created.
-
We recommend that you select an empty directory and ensure that its OSS path does not have a prefix relationship with the OSS paths of other jobs. This prevents data from being overwritten. For example, if the OSS paths of two data migration jobs are
oss://adb_demo/test/sls1/andoss://adb_demo/test/, the paths have a prefix relationship, which may cause data to be overwritten during data migration.
Database/table migration settings
ImportantIf a database or table name exists in both the whitelist and the blacklist, the blacklist takes precedence, and the service does not migrate the database or table.
Parameter
Description
Database/Table Migration Whitelist
The job migrates databases and tables that match an expression. Enter regular expressions for database and table names. Separate multiple expressions with commas (,).
Database/Table Migration Blacklist
The job does not migrate databases and tables that match an expression. Enter regular expressions for database and table names. Separate multiple expressions with commas (,).
Migration settings
Parameter
Description
Handling Same Named Destination Table
Specifies how to handle a destination table that already exists:
-
Skip this table (do not migrate): Skips only the current table. The migration of other tables continues.
-
Report error and pause migration: Pauses the migration job. The migration for the table with the same name and all other tables stops.
Job Resource Group
Specify the job resource group in which the job runs.
Required ACUs
Specify the number of ACUs for the job resource group. The minimum value is 4. The maximum value is the number of available computing resources in the job resource group. Allocate more ACUs to improve migration performance and job stability.
Parallel Tasks
The default value is 1. The maximum value is 8.
If you increase this value, multiple migration tasks are started at the same time, and each task migrates one table. However, each task requires at least 4 ACUs. If the number of ACUs is insufficient, the migration tasks run sequentially.
Advanced Settings
Allows you to specify custom configurations for the job. To do so, contact technical support.
-
-
After you configure the parameters, click Submit.
Start the data migration job
-
On the Data Migration page, find the data migration job that you created and click Resume in the Actions column.
-
Click Search in the upper-right corner. When the status changes to Starting, the data migration job has started.
Data analysis
After the migration task is successful, you can also use the metadata discovery feature to import data from OSS to the Enterprise Edition, Basic Edition, or Data Lakehouse Edition, and then analyze the data imported to the Enterprise Edition, Basic Edition, or Data Lakehouse Edition in Spark JAR Job Development. For more information about Spark development, see Spark development editor and Spark offline application development.
-
-
In the left-side navigation pane, choose .
-
Enter the sample statements in the default template and click Run Now.
-- Here is just an example of SparkSQL. Modify the content and run your spark program. conf spark.driver.resourceSpec=medium; conf spark.executor.instances=2; conf spark.executor.resourceSpec=medium; conf spark.app.name=Spark SQL Test; conf spark.adb.connectors=oss; -- Here are your sql statements show tables from lakehouse20220413156_adbTest; -
Optional: On the Applications tab, click Logs in the Actions column to view the Spark SQL run log.
-
Job management
On the Data Migration page, you can perform the following operations in the Actions column.
|
Actions |
Description |
|
Start |
Start the data migration job. |
|
View Details |
View the detailed job configuration, number of migrated tables, and migration details. |
|
Edit |
Modify job configuration properties. |
|
Pause |
Pauses the current migration job. |
|
Delete |
Delete the current migration job. |