Read from and write to OSS external tables by using Spark SQL
Use Spark SQL to read and write Object Storage Service (OSS) external tables in Parquet format on an AnalyticDB for MySQL Data Lakehouse Edition cluster.
When you run a Spark SQL statement, the system returns a success or failure message only — it does not return query results. To view data, check the application logs on the Spark JAR Development page after the statement completes.
Prerequisites
Before you begin, ensure that you have:
-
An AnalyticDB for MySQL Data Lakehouse Edition cluster
-
A job resource group created for the cluster. For more information, see Create a resource group
-
A database account created for the cluster:
-
Alibaba Cloud account: create a privileged account. For more information, see the Create a privileged account section of the Create a database account topic
-
Resource Access Management (RAM) user: create both a privileged account and a standard account, and associate the standard account with the RAM user. For more information, see Create a database account and Associate or disassociate a database account with or from a RAM user
-
Workflow overview
This workflow covers the full lifecycle of OSS external tables in AnalyticDB for MySQL:
-
Open the SQL Development editor and select the Spark engine.
-
Create an external database backed by an OSS path.
-
Create external tables (non-partitioned or partitioned) in that database.
-
Write data using
INSERT INTOorINSERT OVERWRITEstatements. -
Query data and view results in the application logs.
-
(Optional) Drop the tables to remove metadata when they are no longer needed.
Step 1: Open the SQL Development editor
-
Log on to the AnalyticDB for MySQL console. In the upper-left corner, select a region. In the left-side navigation pane, click Clusters.
-
On the Data Lakehouse Edition tab, find the cluster and click the cluster ID.
-
In the left-side navigation pane, choose Job Development > SQL Development.
-
On the SQLConsole tab, select the Spark engine and the job resource group.
Step 2: Create an external database and OSS external tables
Run the following statements in batch or interactive mode. For more information, see Spark SQL execution modes.
Create an external database
CREATE DATABASE IF NOT EXISTS test_db COMMENT 'demo db'
LOCATION 'oss://<bucket_name>/test' -- Replace with your OSS bucket path
WITH DBPROPERTIES(k1='v1', k2='v2');
Set dbproperties to 'auto.create.location'='true' to have the OSS path created automatically. If you omit this setting and the OSS path does not exist, create the path manually in your bucket before running the statement.
Create external tables
The database and all external tables must use the same bucket. Each external table's LOCATION path must be at least one directory level deeper than the database path and must be within the database path.
-- Non-partitioned table: use for flat datasets without partition columns
CREATE TABLE IF NOT EXISTS test_db.test_tbl(id INT, name STRING, age INT)
USING parquet
LOCATION 'oss://<bucket_name>/test/test_tbl/'
TBLPROPERTIES ('parquet.compress'='SNAPPY');
-- Partitioned table: use when queries filter on a specific column (here: location)
CREATE TABLE IF NOT EXISTS test_db.test_tbl_partitioned(id INT, name STRING, age INT)
USING parquet PARTITIONED BY (location STRING)
LOCATION 'oss://<bucket_name>/test/test_tbl_partitioned/'
TBLPROPERTIES ('parquet.compress'='SNAPPY');
Set tblproperties to 'auto.create.location'='true' to have the OSS path created automatically. If you omit this setting and the OSS path does not exist, create the path manually before running the statement.
Step 3: Write data to the OSS external tables
By default, multiple Spark jobs cannot write to different partitions of the same OSS external table concurrently. To enable concurrent multi-job partition writes, add the following parameter to each job:
The <bucket_name> value must match the bucket where the OSS external table is stored. Multiple Spark jobs can share the same OSS path value.
spark.hadoop.fs.aliyun.oss.upload.basedir=oss://<bucket_name>/test/upload;
Run the following statements in batch or interactive mode. For more information, see Spark SQL execution modes.
-- Append rows to a non-partitioned table
INSERT INTO test_db.test_tbl VALUES(1, 'adb', 10);
-- Overwrite all data in a non-partitioned table
INSERT OVERWRITE test_db.test_tbl VALUES(2, 'spark', 10);
-- Append rows to a specific (static) partition: use when you know the target partition at write time
INSERT INTO TABLE test_db.test_tbl_partitioned PARTITION(location='hangzhou') VALUES(1, 'adb', 10);
-- Overwrite all data in a specific (static) partition
INSERT OVERWRITE TABLE test_db.test_tbl_partitioned PARTITION(location='hangzhou') VALUES(1, 'adb', 10);
-- Overwrite data using a dynamic partition: Spark determines the partition value from the data row
INSERT OVERWRITE TABLE test_db.test_tbl_partitioned PARTITION(location) VALUES(1, 'adb', 10, 'beijing');
Use static partitions when you know the target partition at write time. Use dynamic partitions when the partition value is derived from the data itself.
Step 4: Query data
Run the following statements in batch or interactive mode. For more information, see Spark SQL execution modes.
SELECT * FROM test_db.test_tbl;
SELECT * FROM test_db.test_tbl_partitioned;
To view the query results, go to the Spark JAR Development page and click Logs in the Actions column for your application on the Applications tab. For more information, see the View information about a Spark application section of the Spark editor topic.
Step 5: (Optional) Drop the external tables
Run the following statements to remove the external table metadata:
DROP TABLE IF EXISTS test_db.test_tbl;
DROP TABLE IF EXISTS test_db.test_tbl_partitioned;
Dropping the tables removes only the metadata. The data files stored in OSS are not deleted.