Read from and write to OSS external tables by using Spark SQL

更新时间:
复制 MD 格式

Use Spark SQL to read and write Object Storage Service (OSS) external tables in Parquet format on an AnalyticDB for MySQL Data Lakehouse Edition cluster.

Note

When you run a Spark SQL statement, the system returns a success or failure message only — it does not return query results. To view data, check the application logs on the Spark JAR Development page after the statement completes.

Prerequisites

Before you begin, ensure that you have:

Workflow overview

This workflow covers the full lifecycle of OSS external tables in AnalyticDB for MySQL:

  1. Open the SQL Development editor and select the Spark engine.

  2. Create an external database backed by an OSS path.

  3. Create external tables (non-partitioned or partitioned) in that database.

  4. Write data using INSERT INTO or INSERT OVERWRITE statements.

  5. Query data and view results in the application logs.

  6. (Optional) Drop the tables to remove metadata when they are no longer needed.

Step 1: Open the SQL Development editor

  1. Log on to the AnalyticDB for MySQL console. In the upper-left corner, select a region. In the left-side navigation pane, click Clusters.

  2. On the Data Lakehouse Edition tab, find the cluster and click the cluster ID.

  3. In the left-side navigation pane, choose Job Development > SQL Development.

  4. On the SQLConsole tab, select the Spark engine and the job resource group.

Step 2: Create an external database and OSS external tables

Note

Run the following statements in batch or interactive mode. For more information, see Spark SQL execution modes.

Create an external database

CREATE DATABASE IF NOT EXISTS test_db COMMENT 'demo db'
LOCATION 'oss://<bucket_name>/test'   -- Replace with your OSS bucket path
WITH DBPROPERTIES(k1='v1', k2='v2');
Note

Set dbproperties to 'auto.create.location'='true' to have the OSS path created automatically. If you omit this setting and the OSS path does not exist, create the path manually in your bucket before running the statement.

Create external tables

The database and all external tables must use the same bucket. Each external table's LOCATION path must be at least one directory level deeper than the database path and must be within the database path.

-- Non-partitioned table: use for flat datasets without partition columns
CREATE TABLE IF NOT EXISTS test_db.test_tbl(id INT, name STRING, age INT)
USING parquet
LOCATION 'oss://<bucket_name>/test/test_tbl/'
TBLPROPERTIES ('parquet.compress'='SNAPPY');

-- Partitioned table: use when queries filter on a specific column (here: location)
CREATE TABLE IF NOT EXISTS test_db.test_tbl_partitioned(id INT, name STRING, age INT)
USING parquet PARTITIONED BY (location STRING)
LOCATION 'oss://<bucket_name>/test/test_tbl_partitioned/'
TBLPROPERTIES ('parquet.compress'='SNAPPY');
Note

Set tblproperties to 'auto.create.location'='true' to have the OSS path created automatically. If you omit this setting and the OSS path does not exist, create the path manually before running the statement.

Step 3: Write data to the OSS external tables

Important

By default, multiple Spark jobs cannot write to different partitions of the same OSS external table concurrently. To enable concurrent multi-job partition writes, add the following parameter to each job:

The <bucket_name> value must match the bucket where the OSS external table is stored. Multiple Spark jobs can share the same OSS path value.

spark.hadoop.fs.aliyun.oss.upload.basedir=oss://<bucket_name>/test/upload;
Note

Run the following statements in batch or interactive mode. For more information, see Spark SQL execution modes.

-- Append rows to a non-partitioned table
INSERT INTO test_db.test_tbl VALUES(1, 'adb', 10);

-- Overwrite all data in a non-partitioned table
INSERT OVERWRITE test_db.test_tbl VALUES(2, 'spark', 10);

-- Append rows to a specific (static) partition: use when you know the target partition at write time
INSERT INTO TABLE test_db.test_tbl_partitioned PARTITION(location='hangzhou') VALUES(1, 'adb', 10);

-- Overwrite all data in a specific (static) partition
INSERT OVERWRITE TABLE test_db.test_tbl_partitioned PARTITION(location='hangzhou') VALUES(1, 'adb', 10);

-- Overwrite data using a dynamic partition: Spark determines the partition value from the data row
INSERT OVERWRITE TABLE test_db.test_tbl_partitioned PARTITION(location) VALUES(1, 'adb', 10, 'beijing');

Use static partitions when you know the target partition at write time. Use dynamic partitions when the partition value is derived from the data itself.

Step 4: Query data

Note

Run the following statements in batch or interactive mode. For more information, see Spark SQL execution modes.

SELECT * FROM test_db.test_tbl;
SELECT * FROM test_db.test_tbl_partitioned;

To view the query results, go to the Spark JAR Development page and click Logs in the Actions column for your application on the Applications tab. For more information, see the View information about a Spark application section of the Spark editor topic.

Step 5: (Optional) Drop the external tables

Run the following statements to remove the external table metadata:

DROP TABLE IF EXISTS test_db.test_tbl;
DROP TABLE IF EXISTS test_db.test_tbl_partitioned;

Dropping the tables removes only the metadata. The data files stored in OSS are not deleted.