Read Lindorm data using Spark SQL
AnalyticDB for MySQL supports accessing Lindorm databases by using Spark SQL. This topic describes how to use Spark SQL to access data in Hive tables and wide tables in Lindorm.
Prerequisites
-
Create a single-zone Lindorm instance. For more information, see Create an instance.
ImportantYou must activate LindormTable, Lindorm Distributed Processing System (LDPS), and LindormDFS for the Lindorm instance. To activate LDPS, see Activate a service.
-
Activate Object Storage Service (OSS), and create a bucket and a directory. For more information, see Activate OSS, Create a bucket, and Manage directories.
-
An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.
A job resource group is created for the AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster.
A database account is created for the AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster.
Preparations
-
In the Lindorm console, go to the Instance Details page to obtain the vSwitch ID.
-
In the Lindorm console, go to the Access Control page. Click the Security Group tab to view the security group ID of your Lindorm instance. If no security group is added, see Add a security group.
Accessing Hive data in Lindorm
-
Obtain the
hdfs-siteconfiguration information for your HDFS client.NoteTo obtain the
hdfs-siteconfiguration information, contact Lindorm technical support (DingTalk ID: s0s3eg3). Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
-
In the left-side navigation pane, choose .
-
In the SQLConsole window, select the Spark engine and a job resource group.
-
In the SQL Console window, enter the following commands:
SET spark.adb.connectors = oss,external_hive; SET spark.adb.eni.enabled=true; SET spark.adb.eni.vswitchId=vsw-bp14pj8h0k5p0kwu3****; SET spark.adb.eni.securityGroupId=sg-bp11m93k021tp4ca****; SET spark.sql.catalog.lindorm_catalog = org.apache.spark.sql.hive.V2HiveExternalCatalog; SET spark.adb.eni.adbHostAlias.enabled=true; SET spark.sql.catalogImplementation = hive; -- HDFS configurations SET spark.sql.catalog.lindorm_catalog.spark.hadoop.hive.metastore.uris=thrift://ld-bp1ttz8833x9c****-proxy-ldps-hms.lindorm.aliyuncs.com:9083; SET spark.hadoop.dfs.nameservices=ld-bp1ttz8833x9c****; SET spark.hadoop.dfs.client.failover.proxy.provider.ld-bp1ttz8833x9c****=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider; SET spark.hadoop.dfs.ha.automatic-failover.enabled=true; SET spark.hadoop.dfs.ha.namenodes.ld-bp1ttz8833x9c****=nn1,nn2; SET spark.hadoop.dfs.namenode.rpc-address.ld-bp1ttz8833x9c****.nn1=ld-bp1ttz8833x9c-master1-001.lindorm.rds.aliyuncs.com:8020; SET spark.hadoop.dfs.namenode.rpc-address.ld-bp1ttz8833x9c****.nn2=ld-bp1ttz8833x9c****-master2-001.lindorm.rds.aliyuncs.com:8020; SET spark.hadoop.dfs.namenode.http-address.ld-bp1ttz8833x9c****.nn1=ld-bp1ttz8833x9c****-master1-001.lindorm.rds.aliyuncs.com:50071; SET spark.hadoop.dfs.namenode.http-address.ld-bp1ttz8833x9c****.nn2=ld-bp1ttz8833x9c****-master2-001.lindorm.rds.aliyuncs.com:50071; -- Create an OSS external database and table CREATE DATABASE external_oss_db location 'oss://testBucketName/warehouse/db'; CREATE TABLE external_oss_db.test(id int, name string) using parquet location 'oss://testBucketName/warehouse/db/test'; -- Insert data from the Hive table into the OSS external table INSERT INTO external_oss_db.test SELECT * FROM lindorm_catalog.spark_test.test; -- Read data from the OSS external table SELECT * FROM external_oss_db.test;Parameter
Required
Description
spark.adb.connectors
Yes
The names of the built-in connectors of AnalyticDB for MySQL Spark that you want to enable. Separate multiple names with commas (,). Valid values: oss, hudi, delta, adb, odps, external_hive, jindo, and default.
This example uses the
oss,external_hiveconnectors.spark.adb.eni.enabled
Yes
Specifies whether to enable Elastic Network Interface (ENI).
If you use external tables to access other external data sources, you must enable ENI. Valid values:
-
true: Enable
-
false: Disable
In this example, set this parameter to
true.spark.adb.eni.vswitchId
Yes
The vSwitch ID of the Lindorm instance that you obtained in the "Preparations" section.
spark.adb.eni.securityGroupId
Yes
The security group ID of the Lindorm instance that you obtained in the "Preparations" section.
spark.sql.catalog.lindorm_catalog
Yes
The implementation class for the data source catalog.
Set this parameter to
org.apache.spark.sql.hive.V2HiveExternalCatalog.NoteYou can customize the
catalog_namepart of the parameter name. This example useslindorm_catalog.spark.adb.eni.adbHostAlias.enabled
Yes
Specifies whether to automatically write the domain name resolution information that AnalyticDB for MySQL requires to a mapping table of domain names and IP addresses. Valid values:
-
true: Enable.
-
false: Disable.
If you use an ENI to read data from or write data to EMR Hive, you must set this parameter to true.
HDFS configurations
Yes
The
hdfs-siteconfiguration information that you obtained in Step 1. The format isset spark.hadoop.<name>=<value>.NoteReplace <name> and <value> with the actual values from your configuration information.
location
Yes
The OSS storage path for the external database and external table. The format is
oss://<bucketname/database/table>. This example usesoss://testBucketName/warehouse/db/test.NoteReplace
<bucketname/database/table>with your actual OSS path. -
-
Click Execute.
-
After the Spark job completes successfully, go to the Spark JAR Development page. On the Applications tab, find the job and click Logs in the Actions column to view the results. For more information, see Spark development editor.
Accessing data in Lindorm wide tables
-
Obtain the LindormTable VPC endpoint for access using the HBase Java API. For more information, see View connection endpoints.
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
-
In the left-side navigation pane, choose .
-
In the SQLConsole window, select the Spark engine and a job resource group.
-
In the SQL Console window, enter the following commands:
SET spark.adb.eni.enabled=true; SET spark.adb.eni.vswitchId=vsw-bp14pj8h0k5p0kwu3****; SET spark.adb.eni.securityGroupId=sg-bp11m93k021tp4ca****; SET spark.sql.catalog.lindorm_table = com.alibaba.lindorm.ldspark.datasources.v2.LindormCatalog; SET spark.sql.catalog.lindorm_table.url=ld-bp1ttz8833x9c****-proxy-lindorm.lindorm.rds.aliyuncs.com:30020; SET spark.sql.catalog.lindorm_table.username=test; SET spark.sql.catalog.lindorm_table.password=password; add jar "oss://<bucketname>/hbase_jars/*"; -- Create an OSS external database and table CREATE DATABASE IF NOT exists external_oss_db location 'oss://testBucketName/warehouse/db'; CREATE TABLE IF NOT exists external_oss_db.lindorm_table_2(id string, name string) using parquet location 'oss://testBucketName/warehouse/db/lindorm_table_2'; -- Insert data from the Lindorm wide table into the OSS external table INSERT INTO external_oss_db.lindorm_table_2 SELECT * FROM lindorm_table.spark_test.tb; -- Read data from the OSS external table SELECT * FROM external_oss_db.lindorm_table_2;Parameter description:
Parameter
Required
Description
spark.adb.eni.enabled
Yes
Specifies whether to enable Elastic Network Interface (ENI).
If you use external tables to access other external data sources, you must enable ENI. Valid values:
-
true: Enable
-
false: Disable
In this example, set this parameter to
true.spark.adb.eni.vswitchId
Yes
The vSwitch ID of the Lindorm instance that you obtained in the "Preparations" section.
spark.adb.eni.securityGroupId
Yes
The security group ID of the Lindorm instance that you obtained in the "Preparations" section.
spark.sql.catalog.lindorm_table
Yes
The implementation class for the data source catalog.
Set this parameter to
com.alibaba.lindorm.ldspark.datasources.v2.LindormCatalog.NoteYou can customize the
catalog_namepart of the parameter name. This example useslindorm_table.spark.sql.catalog.lindorm_table.url
Yes
The VPC endpoint for access using the HBase Java API that you obtained in Step 1.
spark.sql.catalog.lindorm_table.username
Yes
The username and password for LindormTable.
If you forget your username or password, you can change the password in the cluster management system of LindormTable. For more information, see Change a user password.
spark.sql.catalog.lindorm_table.password
Yes
jar
Yes
The OSS path to the directory where you uploaded the JAR packages in Step 2. Use
/*to include all JARs in the directory. Example:oss://testBucketName/hbase_jars/*.location
Yes
The OSS storage path for the external database and external table. The format is
oss://<bucketname/database/table>. This example usesoss://testBucketName/warehouse/db/lindorm_table_2.NoteReplace
<bucketname/database/table>with your actual OSS path. -
-
Click Execute.
-
After the Spark job completes successfully, go to the Spark JAR Development page. On the Applications tab, find the job and click Logs in the Actions column to view the results. For more information, see Spark development editor.