MaxCompute open storage allows Spark to use a connector to call the Storage API and read data directly from MaxCompute. This approach simplifies the data reading process and improves access performance. Integrating Spark with MaxCompute data storage provides efficient, flexible, and powerful data processing and analysis capabilities.
Scope
When a third-party engine accesses MaxCompute:
You can read data from standard tables, partitioned tables, clustered tables, Delta Tables, and materialized views.
You cannot read data from MaxCompute foreign tables or logical views.
The connector does not support reading the JSON data type.
For open storage (pay-as-you-go), the default request concurrency limit is 1,000 per tenant. The transmission rate for each concurrent request is 10 MB/s.
Procedure
Purchase an exclusive resource group for Data Transmission Service (subscription) or activate open storage (pay-as-you-go) resources.
Deploy a Spark development environment.
Click Spark to download a Spark package for versions
Spark 3.2.x - Spark 3.5.x, and then decompress the package to a local folder.To build the Spark development environment on a Linux operating system, see Build a Linux development environment.
To build the Spark development environment on a Windows operating system, see Build a Windows development environment.
Download and compile the Spark connector. Currently, only Spark versions from 3.2.x to 3.5.x are supported. This topic uses Spark 3.3.1 as an example.
Use the
git clonecommand to download the Spark connector package. Ensure that Git is installed in your environment. Otherwise, an error occurs when you run the command.## Download the Spark connector. git clone https://github.com/aliyun/aliyun-maxcompute-data-collectors.git ## Switch to the spark-connector folder. cd aliyun-maxcompute-data-collectors/spark-connector ## Compile the connector. mvn clean package ## Location of the datasource JAR package. datasource/target/spark-odps-datasource-3.3.1-odps0.43.0.jar ## Copy the datasource JAR package to the $SPARK_HOME/jars/ folder. cp datasource/target/spark-odps-datasource-3.3.1-odps0.43.0.jar $SPARK_HOME/jars/Configure MaxCompute account access information.
In the
conffolder of your Spark installation, create aspark-defaults.conffile:cd $SPARK_HOME/conf vim spark-defaults.confAdd the following account information to the
spark-defaults.conffile:## Configure the account in spark-defaults.conf. spark.hadoop.odps.project.name=doc_test spark.hadoop.odps.access.id=L******************** spark.hadoop.odps.access.key=******************* spark.hadoop.odps.end.point=http://service.cn-beijing.maxcompute.aliyun.com/api spark.hadoop.odps.tunnel.quota.name=ot_xxxx_p#ot_xxxx ## Configure the MaxCompute catalog. spark.sql.catalog.odps=org.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog spark.sql.extensions=org.apache.spark.sql.execution.datasources.v2.odps.extension.OdpsExtensionsAccess MaxCompute through the Spark connector.
Run the following command in the
binfolder of your Spark installation to start the Spark SQL client:cd $SPARK_HOME/bin spark-sqlQuery the tables in the MaxCompute project:
SHOW tables in odps.doc_test;doc_testis a sample MaxCompute project name. Replace it with the name of your MaxCompute project.Create a table:
CREATE TABLE odps.doc_test.mc_test_table (name STRING, num BIGINT);Read data from the table:
SELECT * FROM odps.doc_test.mc_test_table;Create a partitioned table:
CREATE TABLE odps.doc_test.mc_test_table_pt (name STRING, num BIGINT) PARTITIONED BY (pt1 STRING, pt2 STRING);Read data from the partitioned table:
SELECT * FROM odps.doc_test.mc_test_table_pt;The following output is returned:
test1 1 2018 0601 test2 2 2018 0601 Time taken: 1.312 seconds, Fetched 2 row(s)Delete the table:
DROP TABLE IF EXISTS odps.doc_test.mc_test_table;