Hive data import

更新时间:
复制 MD 格式

AnalyticDB for MySQL supports migrating data from Hive to OSS using the data migration feature. This topic describes how to add a Hive data source, create and start a data migration job, analyze the data after migration, and manage the data migration job.

Features

AnalyticDB for MySQL lets you migrate Hive metadata and data to OSS in a single click. You can also migrate multiple databases and tables in parallel.

Prerequisites

  • E-MapReduce cluster: The business scenario of the cluster is New Data Lake, the metadata type is Self-managed RDS or Built-in MySQL, and the Hive service is enabled. The Hive Storage Mode must be HDFS, which means the Data Lake Storage checkbox is not selected. For more information, see Create a cluster.

    Important
    • E-MapReduce clusters with DLF Unified Metadata as the metadata type do not support migrating Hive data to OSS.

    • If the Data Lake Storage of your E-MapReduce cluster is Data Lake Storage, the data is already in OSS. You can use the metadata discovery feature to import the data into AnalyticDB for MySQL. For more information, see Import data by using metadata discovery.

  • A self-managed CDH cluster on ECS instances.

Billing

Migrating data to OSS by using the data migration feature of AnalyticDB for MySQL incurs the following fees.

Workflow

Create a Hive data source

Note

If you have already added a Hive data source, you can skip this step and create the data migration job directly. For more information, see Create a data migration job.

  1. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.

  2. In the left-side navigation pane, choose Data Ingestion > Data Sources.

  3. In the upper-left corner, click Create Data Source.

  4. On the Create Data Source page, configure the parameters. The following tables describe the parameters.

    Alibaba Cloud Instance

    Parameter

    Description

    Data Source Type

    Select Hive.

    Data Source Name

    The system automatically generates a name based on the data source type and current time. You can edit this name as needed.

    Data Source Description

    A description of the data source, such as its use case and business constraints.

    Deployment Mode

    If you use the Hive service of E-MapReduce, select Alibaba Cloud Instance.

    Instance

    Select the E-MapReduce instance to which the data source points.

    Hive Metastore URI

    The connection address of the Hive Metastore. The format is thrift://<IP address of the master node>:<Port number>. The default port number is 9083.

    To view the IP address of the master node:

    Log on to the EMR on ECS console. On the Nodes tab, click the 加号..png icon next to the emr-master node to view the private IP address of the master node.

    Self-managed CDH on ECS

    Parameter

    Description

    Data Source Type

    Select Hive.

    Data Source Name

    The system automatically generates a name based on the data source type and current time. You can edit this name as needed.

    Data Source Description

    A description of the data source, such as its use case and business constraints.

    Deployment Mode

    If you have a self-managed CDH cluster on ECS instances, select Self-managed CDH on ECS.

    Instance

    Select the ECS instance to which the data source points.

    Hive Metastore URI

    The connection address of the Hive Metastore. This is the public IP address of the ECS instance that corresponds to the CDH master node. The format is thrift://<IP address of the master node>:<Port number>. The default port number is 9083.

    Host Configuration Information

    Enter each host-to-IP mapping on a new line.

    Example:

    192.168.2.153 master.cdh

    192.168.2.154 node1.cdh

    192.168.2.155 node2.cdh

  5. After you configure the parameters, click Create.

Create a data migration job

  1. In the left-side navigation pane, click Data Migration.

  2. In the upper-right corner, click Create Migration Job.

  3. On the Create Migration Job page, configure parameters in the Source and Destination Settings, Database/Table Migration Settings, and Migration Settings sections.

    Source and destination settings

    Parameter

    Description

    Job Name

    The name of the data migration job. The system automatically generates a name based on the data source type and current time. You can edit this name as needed.

    Data Source

    Select an existing Hive data source or create a new one.

    Destination Type

    Only Data Lake - OSS Storage is supported.

    OSS Path

    The storage path in OSS for the data of the AnalyticDB for MySQL Data Lakehouse Edition cluster.

    Important
    • All buckets in the same region as the AnalyticDB for MySQL cluster are displayed. You can select any one of them. Plan the storage path with caution. You cannot change the storage path after it is created.

    • We recommend that you select an empty directory and ensure that its OSS path does not have a prefix relationship with the OSS paths of other jobs. This prevents data from being overwritten. For example, if the OSS paths of two data migration jobs are oss://adb_demo/test/sls1/ and oss://adb_demo/test/, the paths have a prefix relationship, which may cause data to be overwritten during data migration.

    Database/table migration settings

    Important

    If a database or table name exists in both the whitelist and the blacklist, the blacklist takes precedence, and the service does not migrate the database or table.

    Parameter

    Description

    Database/Table Migration Whitelist

    The job migrates databases and tables that match an expression. Enter regular expressions for database and table names. Separate multiple expressions with commas (,).

    Database/Table Migration Blacklist

    The job does not migrate databases and tables that match an expression. Enter regular expressions for database and table names. Separate multiple expressions with commas (,).

    Migration settings

    Parameter

    Description

    Handling Same Named Destination Table

    Specifies how to handle a destination table that already exists:

    • Skip this table (do not migrate): Skips only the current table. The migration of other tables continues.

    • Report error and pause migration: Pauses the migration job. The migration for the table with the same name and all other tables stops.

    Job Resource Group

    Specify the job resource group in which the job runs.

    Required ACUs

    Specify the number of ACUs for the job resource group. The minimum value is 4. The maximum value is the number of available computing resources in the job resource group. Allocate more ACUs to improve migration performance and job stability.

    Parallel Tasks

    The default value is 1. The maximum value is 8.

    If you increase this value, multiple migration tasks are started at the same time, and each task migrates one table. However, each task requires at least 4 ACUs. If the number of ACUs is insufficient, the migration tasks run sequentially.

    Advanced Settings

    Allows you to specify custom configurations for the job. To do so, contact technical support.

  4. After you configure the parameters, click Submit.

Start the data migration job

  1. On the Data Migration page, find the data migration job that you created and click Resume in the Actions column.

  2. Click Search in the upper-right corner. When the status changes to Starting, the data migration job has started.

Data analysis

After the migration task is successful, you can also use the metadata discovery feature to import data from OSS to the Enterprise Edition, Basic Edition, or Data Lakehouse Edition, and then analyze the data imported to the Enterprise Edition, Basic Edition, or Data Lakehouse Edition in Spark JAR Job Development. For more information about Spark development, see Spark development editor and Spark offline application development.

    1. In the left-side navigation pane, choose Job Development > Spark JAR Development.

    2. Enter the sample statements in the default template and click Run Now.

      -- Here is just an example of SparkSQL. Modify the content and run your spark program.
      
      conf spark.driver.resourceSpec=medium;
      conf spark.executor.instances=2;
      conf spark.executor.resourceSpec=medium;
      conf spark.app.name=Spark SQL Test;
      conf spark.adb.connectors=oss;
      
      -- Here are your sql statements
      show tables from lakehouse20220413156_adbTest;
    3. Optional: On the Applications tab, click Logs in the Actions column to view the Spark SQL run log.

Job management

On the Data Migration page, you can perform the following operations in the Actions column.

Actions

Description

Start

Start the data migration job.

View Details

View the detailed job configuration, number of migrated tables, and migration details.

Edit

Modify job configuration properties.

Pause

Pauses the current migration job.

Delete

Delete the current migration job.