MaxCompute supports schemaless queries on Parquet foreign tables in OSS. You can export the parsed dataset to OSS or write it to an internal table. You can also embed the dataset as a subquery in SQL operations for flexible data lake processing.
Background information
When you use Spark for ad hoc exploration of structured data, a fixed data warehouse model is not required. If you must define table schemas and manually map file fields before you can read from or write to OSS, managing partitions becomes cumbersome and inflexible.
Using the LOAD instruction, MaxCompute automatically parses the Parquet file format and reads the data as a schema-aware dataset. You can then select specific columns directly from this dataset for data processing similar to table operations. You can use the UNLOAD command to export results directly to OSS, or use the CREATE TABLE AS command to import them into an internal table. You can also use this dataset as a subquery in other SQL statements, which enables flexible MaxCompute operations on data lake data.
Applicability
Schemaless Query does not support treating subdirectories in an OSS bucket as partitions.
Syntax
SELECT *, <col_name>, <table_alias>.<col_name>
FROM
LOCATION '<location_path>'
('key'='value' [, 'key1'='value1', ...])
[AS <table_alias>];
Parameter descriptions
|
Parameter |
Required |
Description |
|
* |
Yes |
Queries all fields in the Parquet file. |
|
col_name |
Yes |
Queries a field by its known column name in the Parquet file. The DECIMAL type only supports |
|
table_alias.col_name |
Yes |
Queries a field by its known column name using the full path of table alias and column name. |
|
table_alias |
No |
Custom table alias. |
|
location_path |
Yes |
|
|
key&value |
Yes |
Parameters and their values for the query statement. See the following table for details. |
Key-value parameter table
|
Key |
Required |
Description |
Value |
Default |
|
file_format |
Yes |
Specifies the file format at the location. Only Parquet is supported. Other formats return an error. |
parquet |
parquet |
|
rolearn |
No |
Specifies the RoleARN required to access the location.
Note
If you do not specify a RoleARN in the SQL statement, the system uses the ARN of the |
|
acs:ram::1234****:role/aliyunodpsdefaultrole |
|
file_pattern_blacklist |
No |
Specifies a blacklist of files to exclude. Files whose names match the pattern are not read. |
|
None |
|
file_pattern_whitelist |
No |
Specifies a whitelist of files to include. Only files whose names match the pattern are read. |
|
|
Examples
Example 1: Read OSS data using blacklist and whitelist parameters
-
Prepare data.
-
Log in to the OSS console.
-
In the left-side navigation pane, click Buckets.
-
On the Buckets page, click Create Bucket.
-
Create the OSS bucket directory
object-table-test/schema/. -
Prepare a Parquet file for reading and whitelist validation. Run the following Python code locally to create the file.
import pandas as pd # Sample data data = [ {'id': 3, 'name': 'Charlie', 'age': 35}, {'id': 4, 'name': 'David', 'age': 40}, {'id': 5, 'name': 'Eve', 'age': 28} ] df = pd.DataFrame(data) df['id'] = df['id'].astype('int32') df['name'] = df['name'].astype('str') df['age'] = df['age'].astype('int32') output_filename = 'sample_data.parquet' df.to_parquet(output_filename, index=False, engine='pyarrow') -
Upload the Parquet file to the
object-table-test/schema/OSS bucket directory.-
Log in to the OSS console.
-
In the
object-table-test/schema/bucket directory, click Upload File.
-
-
Prepare a CSV file in the
object-table-test/schema/OSS bucket directory to validate the blacklist parameter.
-
-
Read the Parquet file.
Log on to the MaxCompute client (odpscmd) and execute the following SQL commands.
-
Add
test_oss.csvto the blacklist and read the Parquet file from OSS.SELECT id, name, age FROM location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' ( 'file_format'='parquet', 'rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole', 'file_pattern_blacklist'='.*test_oss.*' ); -
Add
sample_datato the whitelist and read the Parquet file from OSS.SELECT id, name, age FROM location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' ( 'file_format'='parquet', 'rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole', 'file_pattern_whitelist'='.*sample_data.*' );
In both cases, the
sample_datafile is read. The following result is returned:+------------+------------+------------+ | id | name | age | +------------+------------+------------+ | 3 | Charlie | 35 | | 4 | David | 40 | | 5 | Eve | 28 | +------------+------------+------------+ -
Example 2: Read data written by Spark
-
Create the OSS bucket directory
object-table-test/spark/. -
Generate Parquet data using Serverless Spark. For more information, see Create an SQL job. You can skip this step if Parquet files that are written by Spark already exist in the OSS directory.
-
Log on to the E-MapReduce console and select a region in the upper-left corner.
-
In the left-side navigation pane, choose .
-
On the Spark page, click the name of your target workspace to open it. Alternatively, click Create Workspace. After the new workspace is created, click its name to open it.
-
In the left-side navigation pane, choose Data Development, and then create a Spark SQL file to run the following SQL statements:
CREATE TABLE example_table_parquet04 ( id STRING, name STRING, age STRING, salary DOUBLE, is_active BOOLEAN, created_at TIMESTAMP, details STRUCT<department:STRING, position:STRING> ) USING PARQUET; INSERT INTO example_table_parquet04 VALUES ('1', 'Alice', '30', 5000.50, TRUE, TIMESTAMP '2024-01-01 10:00:00', STRUCT('HR', 'Manager')), ('2', 'Bob', '25', 6000.75, FALSE, TIMESTAMP '2024-02-01 11:00:00', STRUCT('Engineering', 'Developer')), ('3', 'Charlie','35', 7000.00, TRUE, TIMESTAMP '2024-03-01 12:00:00', STRUCT('Marketing', 'Analyst')), ('4', 'David', '40', 8000.25, FALSE, TIMESTAMP '2024-04-01 13:00:00', STRUCT('Sales', 'Representative')), ('5', 'Eve', '28', 5500.50, TRUE, TIMESTAMP '2024-05-01 14:00:00', STRUCT('Support', 'Technician')); SELECT * FROM example_table_parquet04;
-
-
Log on to the OSS console and view the generated data files at the destination path.
-
Log on to the MaxCompute client, add
_SUCCESSto the blacklist, and read the Parquet file from OSS.SELECT * FROM location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/spark/example_table_parquet04/' ( 'file_format'='parquet', 'rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole', 'file_pattern_blacklist'='.*_SUCCESS.*' );The following result is returned:
+----+---------+-----+------------+-----------+---------------------+----------------------------------------------+ | id | name | age | salary | is_active | created_at | details | +----+---------+-----+------------+-----------+---------------------+----------------------------------------------+ | 1 | Alice | 30 | 5000.5 | true | 2024-01-01 10:00:00 | {department:HR, position:Manager} | | 2 | Bob | 25 | 6000.75 | false | 2024-02-01 11:00:00 | {department:Engineering, position:Developer} | | 3 | Charlie | 35 | 7000.0 | true | 2024-03-01 12:00:00 | {department:Marketing, position:Analyst} | | 4 | David | 40 | 8000.25 | false | 2024-04-01 13:00:00 | {department:Sales, position:Representative} | | 5 | Eve | 28 | 5500.5 | true | 2024-05-01 14:00:00 | {department:Support, position:Technician} | +----+---------+-----+------------+-----------+---------------------+----------------------------------------------+
For more information about Schemaless Query operations, see Read Parquet data from a data lake using Schemaless Query.
Example 3: Use Schemaless Query in a subquery
-
Prepare data.
Log on to the OSS console and upload the test file part-00001.snappy.parquet to the OSS bucket directory
object-table-test/schema/. For more information, see Upload OSS files. -
Log on to the MaxCompute client and create an internal table to store the data that is automatically discovered from OSS.
CREATE TABLE ow_test ( id INT, name STRING, age INT ); -
Use Schemaless Query to read data from OSS.
SELECT * FROM LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' ( 'file_format'='parquet' );The following result is returned:
+------------+------------+------------+ | id | name | age | +------------+------------+------------+ | 3 | Charlie | 35 | | 4 | David | 40 | | 5 | Eve | 28 | +------------+------------+------------+ -
Use the OSS data as a subquery in an outer SQL statement to query the ow_test table.
INSERT OVERWRITE TABLE ow_test SELECT id,name,age FROM ( SELECT * FROM LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' ( 'file_format'='parquet' ) ); SELECT * FROM ow_test;The following result is returned:
+------------+------------+------------+ | id | name | age | +------------+------------+------------+ | 3 | Charlie | 35 | | 4 | David | 40 | | 5 | Eve | 28 | +------------+------------+------------+
Example 4: Save Schemaless Query results to an internal data warehouse table
-
Prepare data.
Log on to the OSS console and upload the test file part-00001.snappy.parquet to the OSS bucket directory
object-table-test/schema/. For more information, see Upload OSS files. -
Log on to the MaxCompute client and use Schemaless Query to read data from OSS.
SELECT * FROM LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' ( 'file_format'='parquet' );The following result is returned:
+------------+------------+------------+ | id | name | age | +------------+------------+------------+ | 3 | Charlie | 35 | | 4 | David | 40 | | 5 | Eve | 28 | +------------+------------+------------+ -
Use the
CREATE TABLE ASstatement to copy the data that is automatically discovered from OSS into an internal table and then query the table to view the result.CREATE TABLE ow_test_2 AS SELECT * FROM LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' ( 'file_format'='parquet' ); -- Query the result table ow_test_2 SELECT * FROM ow_test_2;The following result is returned:
+------------+------------+------------+ | id | name | age | +------------+------------+------------+ | 3 | Charlie | 35 | | 4 | David | 40 | | 5 | Eve | 28 | +------------+------------+------------+
Example 5: UNLOAD Schemaless Query results back to the data lake
-
Prepare data.
Log on to the OSS console and upload the test file part-00001.snappy.parquet to the OSS bucket directory
object-table-test/schema/. For more information, see Upload OSS files. -
Log on to the MaxCompute client and use Schemaless Query to read data from OSS.
SELECT * FROM LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' ( 'file_format'='parquet' );The following result is returned:
+------------+------------+------------+ | id | name | age | +------------+------------+------------+ | 3 | Charlie | 35 | | 4 | David | 40 | | 5 | Eve | 28 | +------------+------------+------------+ -
Use the UNLOAD command to export the automatically discovered results to OSS. For more information about UNLOAD operations, see UNLOAD.
UNLOAD FROM ( SELECT * FROM LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' ('file_format'='parquet') ) INTO LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/unload/ow_test_3/' ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH SERDEPROPERTIES ('odps.external.data.enable.extension'='true') STORED AS PARQUET;View the generated file in the OSS directory: After the command is successfully executed, the Parquet file
20250715070650775g0etr15glr2_M1_1_0_0-0_TableSink1.parquetis generated in the OSS pathunload/ow_test_3/. The file size is 0.449 KB and the storage class is Standard.