Access MaxCompute data with Spark SQL-AnalyticDB(AnalyticDB)-阿里云帮助中心

Use Spark SQL in AnalyticDB for MySQL to read from and write to MaxCompute.

Prerequisites

An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.
A database account is created for the AnalyticDB for MySQL cluster.
- If you use an Alibaba Cloud account, you need to only create a privileged account.
- If you use a Resource Access Management (RAM) user, you must create a privileged account and a standard account and associate the standard account with the RAM user.
A job resource group has been created in the AnalyticDB for MySQL cluster.
MaxCompute and AnalyticDB for MySQL must be in the same region. For more information, see Activate MaxCompute.
A MaxCompute project has been created. For more information, see Create a MaxCompute Project.

Procedure

Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Job Development > SQL Development.
In the SQL Console window, select the Spark engine and a job resource group.

Write a Spark SQL job.

SET spark.adb.version=3.3;
SET spark.adb.connectors=odps;
SET spark.sql.catalog.odps=org.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog;
SET spark.sql.extensions=org.apache.spark.sql.execution.datasources.v2.odps.extension.OdpsExtensions;
SET spark.hadoop.odps.project.name=spark_on_maxcompute;
SET spark.hadoop.odps.end.point=https://service.cn-hangzhou-vpc.maxcompute.aliyun-inc.com/api;
SET spark.hadoop.odps.access.id=yourAccessKeyID;
SET spark.hadoop.odps.access.key=yourAccessKeySecret;
SET spark.sql.catalog.odps.tableWriteProvider=tunnel;
SET spark.sql.catalog.odps.tableReadProvider=tunnel;

-- In this example, the project name is 'spark_on_maxcompute'. Make sure that this name is the same as the value of the spark.hadoop.odps.project.name parameter.
CREATE TABLE odps.spark_on_maxcompute.spark_test(k1 int, k2 string) partitioned BY (part int);
INSERT INTO odps.spark_on_maxcompute.spark_test PARTITION(part=1) VALUES(1, 'aaa');
SELECT * FROM odps.spark_on_maxcompute.spark_test WHERE part=1;

The following table describes the parameters.

Parameter	Description
spark.adb.version	The Spark version. Only versions `3.3` and `3.5` support accessing the MaxCompute data source. You can use either version.
spark.adb.connectors	The name of the data source. When you read data from or write data to MaxCompute, set this parameter to `odps`.
spark.sql.catalog.<catalog_name>	Spark SQL supports the configuration of data sources. The `catalog_name` in the parameter name is customizable. When you read data from or write data to MaxCompute, this parameter must be set to `org.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog`. For example: `spark.sql.catalog.odps=org.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog`.
spark.sql.extensions	Spark SQL extension. Set this to `org.apache.spark.sql.execution.datasources.v2.odps.extension.OdpsExtensions` to read data from and write data to MaxCompute.
spark.hadoop.odps.project.name	MaxCompute project name. The example in this topic is `spark_on_maxcompute`. Note This is the MaxCompute project name, not the workspace name. You can log on to the MaxCompute console, switch the region in the upper-left corner, and in the left navigation bar, select Manage Configurations > Projects to view the specific MaxCompute project name.
spark.hadoop.odps.endpoint	The VPC endpoint of the region where MaxCompute is deployed.
spark.hadoop.odps.access.id	The AccessKey pair of an Alibaba Cloud account or a RAM user that has the permissions to access the MaxCompute project.
spark.hadoop.odps.access.key
spark.sql.catalog.odps.tableWriteProvider	The type of interface used to write data. When you write data to MaxCompute, the value must be `tunnel`. This indicates that data is written to MaxCompute by using Tunnel. Note If you only need to read MaxCompute data, you do not need to configure this parameter.
spark.sql.catalog.odps.tableReadProvider	The type of interface used to read data. When you read data from MaxCompute, set this parameter to `tunnel`, which indicates that data is read from MaxCompute by using Tunnel. Note If you only need to write data to MaxCompute, you do not need to configure this parameter.

Click Execute.
After the Spark job runs successfully, go to the Spark JAR Development page, click the Applications tab, and click Logs to view the table data. For more information, see Spark Development Editor.