Read Lindorm data using Spark SQL

更新时间:
复制 MD 格式

AnalyticDB for MySQL supports accessing Lindorm databases by using Spark SQL. This topic describes how to use Spark SQL to access data in Hive tables and wide tables in Lindorm.

Prerequisites

  • Create a single-zone Lindorm instance. For more information, see Create an instance.

    Important

    You must activate LindormTable, Lindorm Distributed Processing System (LDPS), and LindormDFS for the Lindorm instance. To activate LDPS, see Activate a service.

  • Activate Object Storage Service (OSS), and create a bucket and a directory. For more information, see Activate OSS, Create a bucket, and Manage directories.

  • An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.

  • A job resource group is created for the AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster.

  • A database account is created for the AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster.

Preparations

  1. In the Lindorm console, go to the Instance Details page to obtain the vSwitch ID.

  2. In the Lindorm console, go to the Access Control page. Click the Security Group tab to view the security group ID of your Lindorm instance. If no security group is added, see Add a security group.

Accessing Hive data in Lindorm

  1. Obtain the hdfs-site configuration information for your HDFS client.

    Note

    To obtain the hdfs-site configuration information, contact Lindorm technical support (DingTalk ID: s0s3eg3).

  2. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.

  3. In the left-side navigation pane, choose Job Development > SQL Development.

  4. In the SQLConsole window, select the Spark engine and a job resource group.

  5. In the SQL Console window, enter the following commands:

    SET spark.adb.connectors = oss,external_hive;
    SET spark.adb.eni.enabled=true;
    SET spark.adb.eni.vswitchId=vsw-bp14pj8h0k5p0kwu3****;
    SET spark.adb.eni.securityGroupId=sg-bp11m93k021tp4ca****;
    SET spark.sql.catalog.lindorm_catalog = org.apache.spark.sql.hive.V2HiveExternalCatalog;
    SET spark.adb.eni.adbHostAlias.enabled=true;
    SET spark.sql.catalogImplementation = hive;
    -- HDFS configurations
    SET spark.sql.catalog.lindorm_catalog.spark.hadoop.hive.metastore.uris=thrift://ld-bp1ttz8833x9c****-proxy-ldps-hms.lindorm.aliyuncs.com:9083;
    SET spark.hadoop.dfs.nameservices=ld-bp1ttz8833x9c****;
    SET spark.hadoop.dfs.client.failover.proxy.provider.ld-bp1ttz8833x9c****=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider;
    SET spark.hadoop.dfs.ha.automatic-failover.enabled=true;
    SET spark.hadoop.dfs.ha.namenodes.ld-bp1ttz8833x9c****=nn1,nn2;
    SET spark.hadoop.dfs.namenode.rpc-address.ld-bp1ttz8833x9c****.nn1=ld-bp1ttz8833x9c-master1-001.lindorm.rds.aliyuncs.com:8020;
    SET spark.hadoop.dfs.namenode.rpc-address.ld-bp1ttz8833x9c****.nn2=ld-bp1ttz8833x9c****-master2-001.lindorm.rds.aliyuncs.com:8020;
    SET spark.hadoop.dfs.namenode.http-address.ld-bp1ttz8833x9c****.nn1=ld-bp1ttz8833x9c****-master1-001.lindorm.rds.aliyuncs.com:50071;
    SET spark.hadoop.dfs.namenode.http-address.ld-bp1ttz8833x9c****.nn2=ld-bp1ttz8833x9c****-master2-001.lindorm.rds.aliyuncs.com:50071;
    
    -- Create an OSS external database and table
    CREATE DATABASE external_oss_db location 'oss://testBucketName/warehouse/db';
    CREATE TABLE external_oss_db.test(id int, name string) using parquet location 'oss://testBucketName/warehouse/db/test';
    
    -- Insert data from the Hive table into the OSS external table
    INSERT INTO external_oss_db.test SELECT * FROM lindorm_catalog.spark_test.test;
    
    -- Read data from the OSS external table
    SELECT * FROM external_oss_db.test;

    Parameter

    Required

    Description

    spark.adb.connectors

    Yes

    The names of the built-in connectors of AnalyticDB for MySQL Spark that you want to enable. Separate multiple names with commas (,). Valid values: oss, hudi, delta, adb, odps, external_hive, jindo, and default.

    This example uses the oss,external_hive connectors.

    spark.adb.eni.enabled

    Yes

    Specifies whether to enable Elastic Network Interface (ENI).

    If you use external tables to access other external data sources, you must enable ENI. Valid values:

    • true: Enable

    • false: Disable

    In this example, set this parameter to true.

    spark.adb.eni.vswitchId

    Yes

    The vSwitch ID of the Lindorm instance that you obtained in the "Preparations" section.

    spark.adb.eni.securityGroupId

    Yes

    The security group ID of the Lindorm instance that you obtained in the "Preparations" section.

    spark.sql.catalog.lindorm_catalog

    Yes

    The implementation class for the data source catalog.

    Set this parameter to org.apache.spark.sql.hive.V2HiveExternalCatalog.

    Note

    You can customize the catalog_name part of the parameter name. This example uses lindorm_catalog.

    spark.adb.eni.adbHostAlias.enabled

    Yes

    Specifies whether to automatically write the domain name resolution information that AnalyticDB for MySQL requires to a mapping table of domain names and IP addresses. Valid values:

    • true: Enable.

    • false: Disable.

    If you use an ENI to read data from or write data to EMR Hive, you must set this parameter to true.

    HDFS configurations

    Yes

    The hdfs-site configuration information that you obtained in Step 1. The format is set spark.hadoop.<name>=<value>.

    Note

    Replace <name> and <value> with the actual values from your configuration information.

    location

    Yes

    The OSS storage path for the external database and external table. The format is oss://<bucketname/database/table>. This example uses oss://testBucketName/warehouse/db/test.

    Note

    Replace <bucketname/database/table> with your actual OSS path.

  6. Click Execute.

  7. After the Spark job completes successfully, go to the Spark JAR Development page. On the Applications tab, find the job and click Logs in the Actions column to view the results. For more information, see Spark development editor.

Accessing data in Lindorm wide tables

  1. Obtain the LindormTable VPC endpoint for access using the HBase Java API. For more information, see View connection endpoints.

  2. Download and decompress the JAR packages required to connect to the Lindorm instance, and then upload the extracted JAR packages to OSS. For more information, see Upload files.

  3. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.

  4. In the left-side navigation pane, choose Job Development > SQL Development.

  5. In the SQLConsole window, select the Spark engine and a job resource group.

  6. In the SQL Console window, enter the following commands:

    SET spark.adb.eni.enabled=true;
    SET spark.adb.eni.vswitchId=vsw-bp14pj8h0k5p0kwu3****;
    SET spark.adb.eni.securityGroupId=sg-bp11m93k021tp4ca****;
    SET spark.sql.catalog.lindorm_table = com.alibaba.lindorm.ldspark.datasources.v2.LindormCatalog;
    SET spark.sql.catalog.lindorm_table.url=ld-bp1ttz8833x9c****-proxy-lindorm.lindorm.rds.aliyuncs.com:30020;
    SET spark.sql.catalog.lindorm_table.username=test;
    SET spark.sql.catalog.lindorm_table.password=password;
    add jar "oss://<bucketname>/hbase_jars/*";
    
    -- Create an OSS external database and table
    CREATE DATABASE IF NOT exists external_oss_db location 'oss://testBucketName/warehouse/db';
    CREATE TABLE IF NOT exists external_oss_db.lindorm_table_2(id string, name string) using parquet location 'oss://testBucketName/warehouse/db/lindorm_table_2';
    
    -- Insert data from the Lindorm wide table into the OSS external table
    INSERT INTO external_oss_db.lindorm_table_2 SELECT * FROM lindorm_table.spark_test.tb;
    
    -- Read data from the OSS external table
    SELECT * FROM external_oss_db.lindorm_table_2;

    Parameter description:

    Parameter

    Required

    Description

    spark.adb.eni.enabled

    Yes

    Specifies whether to enable Elastic Network Interface (ENI).

    If you use external tables to access other external data sources, you must enable ENI. Valid values:

    • true: Enable

    • false: Disable

    In this example, set this parameter to true.

    spark.adb.eni.vswitchId

    Yes

    The vSwitch ID of the Lindorm instance that you obtained in the "Preparations" section.

    spark.adb.eni.securityGroupId

    Yes

    The security group ID of the Lindorm instance that you obtained in the "Preparations" section.

    spark.sql.catalog.lindorm_table

    Yes

    The implementation class for the data source catalog.

    Set this parameter to com.alibaba.lindorm.ldspark.datasources.v2.LindormCatalog.

    Note

    You can customize the catalog_name part of the parameter name. This example uses lindorm_table.

    spark.sql.catalog.lindorm_table.url

    Yes

    The VPC endpoint for access using the HBase Java API that you obtained in Step 1.

    spark.sql.catalog.lindorm_table.username

    Yes

    The username and password for LindormTable.

    If you forget your username or password, you can change the password in the cluster management system of LindormTable. For more information, see Change a user password.

    spark.sql.catalog.lindorm_table.password

    Yes

    jar

    Yes

    The OSS path to the directory where you uploaded the JAR packages in Step 2. Use /* to include all JARs in the directory. Example: oss://testBucketName/hbase_jars/*.

    location

    Yes

    The OSS storage path for the external database and external table. The format is oss://<bucketname/database/table>. This example uses oss://testBucketName/warehouse/db/lindorm_table_2.

    Note

    Replace <bucketname/database/table> with your actual OSS path.

  7. Click Execute.

  8. After the Spark job completes successfully, go to the Spark JAR Development page. On the Applications tab, find the job and click Logs in the Actions column to view the results. For more information, see Spark development editor.