Feature: Schemaless Query-MaxCompute(MaxCompute)-阿里云帮助中心

MaxCompute supports schemaless queries on Parquet foreign tables in OSS. You can export the parsed dataset to OSS or write it to an internal table. You can also embed the dataset as a subquery in SQL operations for flexible data lake processing.

Background information

When you use Spark for ad hoc exploration of structured data, a fixed data warehouse model is not required. If you must define table schemas and manually map file fields before you can read from or write to OSS, managing partitions becomes cumbersome and inflexible.

Using the LOAD instruction, MaxCompute automatically parses the Parquet file format and reads the data as a schema-aware dataset. You can then select specific columns directly from this dataset for data processing similar to table operations. You can use the UNLOAD command to export results directly to OSS, or use the CREATE TABLE AS command to import them into an internal table. You can also use this dataset as a subquery in other SQL statements, which enables flexible MaxCompute operations on data lake data.

Applicability

Schemaless Query does not support treating subdirectories in an OSS bucket as partitions.

Syntax

SELECT *, <col_name>, <table_alias>.<col_name> 
FROM 
 LOCATION '<location_path>'  
 ('key'='value' [, 'key1'='value1', ...]) 
 [AS <table_alias>];

Parameter descriptions

Parameter	Required	Description
*	Yes	Queries all fields in the Parquet file.
col_name	Yes	Queries a field by its known column name in the Parquet file. The DECIMAL type only supports `precision <=38 && scale<=18`.
table_alias.col_name	Yes	Queries a field by its known column name using the full path of table alias and column name.
table_alias	No	Custom table alias.
location_path	Yes	The location of the Parquet file. It must be an OSS directory with the format `oss://oss_endpoint/bucket_name/path/`. The next level under path supports partition directories in the format `partition_name=partition_value`. Schemaless Query does not support treating subdirectories in the location as partitions, so partition pruning is not supported.
key&value	Yes	Parameters and their values for the query statement. See the following table for details.

Key-value parameter table

Key	Required	Description	Value	Default
file_format	Yes	Specifies the file format at the location. Only Parquet is supported. Other formats return an error.	parquet	parquet
rolearn	No	Specifies the RoleARN required to access the location. Log in to the RAM console. In the left navigation bar, select Identities > Roles. In the Basic Information section, you can obtain the ARN. Note If you do not specify a RoleARN in the SQL statement, the system uses the ARN of the `AliyunODPSDefaultRole` role by default.	`acs:ram::xxxxxx:role/aliyunodpsdefaultrole`	acs:ram::1234****:role/aliyunodpsdefaultrole
file_pattern_blacklist	No	Specifies a blacklist of files to exclude. Files whose names match the pattern are not read.	`"._SUCCESS$,.\\.hive_staging.*"`	None
file_pattern_whitelist	No	Specifies a whitelist of files to include. Only files whose names match the pattern are read.	`"._20250124_..parquet"`	`.*`

Examples

Example 1: Read OSS data using blacklist and whitelist parameters

Prepare data.
1. Log in to the OSS console.
2. In the left-side navigation pane, click Buckets.
3. On the Buckets page, click Create Bucket.
4. Create the OSS bucket directory object-table-test/schema/.
5. Prepare a Parquet file for reading and whitelist validation. Run the following Python code locally to create the file.
```
import pandas as pd
# Sample data
data = [
    {'id': 3, 'name': 'Charlie', 'age': 35},
    {'id': 4, 'name': 'David', 'age': 40},
    {'id': 5, 'name': 'Eve', 'age': 28}
]
df = pd.DataFrame(data)
df['id'] = df['id'].astype('int32')
df['name'] = df['name'].astype('str')
df['age'] = df['age'].astype('int32')
output_filename = 'sample_data.parquet'
df.to_parquet(output_filename, index=False, engine='pyarrow')
```
6. Upload the Parquet file to the object-table-test/schema/ OSS bucket directory.
  1. Log in to the OSS console.
  2. In the object-table-test/schema/ bucket directory, click Upload File.
7. Prepare a CSV file in the object-table-test/schema/ OSS bucket directory to validate the blacklist parameter.

Read the Parquet file.

Log on to the MaxCompute client (odpscmd) and execute the following SQL commands.

Add test_oss.csv to the blacklist and read the Parquet file from OSS.

SELECT id, name, age 
FROM location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/'
(
'file_format'='parquet',
'rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole',
'file_pattern_blacklist'='.*test_oss.*'
);

Add sample_data to the whitelist and read the Parquet file from OSS.

SELECT id, name, age 
FROM location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/'
(
'file_format'='parquet',
'rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole',
'file_pattern_whitelist'='.*sample_data.*'
);

In both cases, the sample_data file is read. The following result is returned:

+------------+------------+------------+
| id         | name       | age        |
+------------+------------+------------+
| 3          | Charlie    | 35         |
| 4          | David      | 40         |
| 5          | Eve        | 28         |
+------------+------------+------------+

Example 2: Read data written by Spark

Create the OSS bucket directory object-table-test/spark/.

Generate Parquet data using Serverless Spark. For more information, see Create an SQL job. You can skip this step if Parquet files that are written by Spark already exist in the OSS directory.

Log on to the E-MapReduce console and select a region in the upper-left corner.
In the left-side navigation pane, choose EMR Serverless > Spark.
On the Spark page, click the name of your target workspace to open it. Alternatively, click Create Workspace. After the new workspace is created, click its name to open it.

In the left-side navigation pane, choose Data Development, and then create a Spark SQL file to run the following SQL statements:

CREATE TABLE example_table_parquet04 (
    id STRING,
    name STRING,
    age STRING,
    salary DOUBLE,
    is_active BOOLEAN,
    created_at TIMESTAMP,
    details STRUCT<department:STRING, position:STRING>
)
USING PARQUET;
INSERT INTO example_table_parquet04 VALUES
('1', 'Alice', '30', 5000.50, TRUE, TIMESTAMP '2024-01-01 10:00:00', STRUCT('HR', 'Manager')),
('2', 'Bob', '25', 6000.75, FALSE, TIMESTAMP '2024-02-01 11:00:00', STRUCT('Engineering', 'Developer')),
('3', 'Charlie','35', 7000.00, TRUE, TIMESTAMP '2024-03-01 12:00:00', STRUCT('Marketing', 'Analyst')),
('4', 'David', '40', 8000.25, FALSE, TIMESTAMP '2024-04-01 13:00:00', STRUCT('Sales', 'Representative')),
('5', 'Eve', '28', 5500.50, TRUE, TIMESTAMP '2024-05-01 14:00:00', STRUCT('Support', 'Technician'));
SELECT * FROM example_table_parquet04;

Log on to the OSS console and view the generated data files at the destination path.

Log on to the MaxCompute client, add _SUCCESS to the blacklist, and read the Parquet file from OSS.

SELECT  *
FROM location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/spark/example_table_parquet04/'
(
'file_format'='parquet',
'rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole',
'file_pattern_blacklist'='.*_SUCCESS.*'
);

The following result is returned:

+----+---------+-----+------------+-----------+---------------------+----------------------------------------------+
| id | name    | age | salary     | is_active | created_at          | details                                      |
+----+---------+-----+------------+-----------+---------------------+----------------------------------------------+
| 1  | Alice   | 30  | 5000.5     | true      | 2024-01-01 10:00:00 | {department:HR, position:Manager}            |
| 2  | Bob     | 25  | 6000.75    | false     | 2024-02-01 11:00:00 | {department:Engineering, position:Developer} |
| 3  | Charlie | 35  | 7000.0     | true      | 2024-03-01 12:00:00 | {department:Marketing, position:Analyst}     |
| 4  | David   | 40  | 8000.25    | false     | 2024-04-01 13:00:00 | {department:Sales, position:Representative}  |
| 5  | Eve     | 28  | 5500.5     | true      | 2024-05-01 14:00:00 | {department:Support, position:Technician}    |
+----+---------+-----+------------+-----------+---------------------+----------------------------------------------+

For more information about Schemaless Query operations, see Read Parquet data from a data lake using Schemaless Query.

Example 3: Use Schemaless Query in a subquery

Prepare data.

Log on to the OSS console and upload the test file part-00001.snappy.parquet to the OSS bucket directory object-table-test/schema/. For more information, see Upload OSS files.
Log on to the MaxCompute client and create an internal table to store the data that is automatically discovered from OSS.
```
CREATE TABLE ow_test (
    id INT,
    name STRING,
    age INT
);
```

Use Schemaless Query to read data from OSS.

SELECT * FROM 
LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' 
(
  'file_format'='parquet'
);

The following result is returned:

+------------+------------+------------+
| id         | name       | age        |
+------------+------------+------------+
| 3          | Charlie    | 35         |
| 4          | David      | 40         |
| 5          | Eve        | 28         |
+------------+------------+------------+

Use the OSS data as a subquery in an outer SQL statement to query the ow_test table.

INSERT OVERWRITE TABLE ow_test
SELECT id,name,age FROM 
(
    SELECT * FROM 
    LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' 
    (
      'file_format'='parquet'
    )
);
SELECT * FROM ow_test;

The following result is returned:

+------------+------------+------------+
| id         | name       | age        |
+------------+------------+------------+
| 3          | Charlie    | 35         |
| 4          | David      | 40         |
| 5          | Eve        | 28         |
+------------+------------+------------+

Example 4: Save Schemaless Query results to an internal data warehouse table

Prepare data.

Log on to the OSS console and upload the test file part-00001.snappy.parquet to the OSS bucket directory object-table-test/schema/. For more information, see Upload OSS files.

Log on to the MaxCompute client and use Schemaless Query to read data from OSS.

SELECT * FROM 
LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' 
(
  'file_format'='parquet'
);

The following result is returned:

+------------+------------+------------+
| id         | name       | age        |
+------------+------------+------------+
| 3          | Charlie    | 35         |
| 4          | David      | 40         |
| 5          | Eve        | 28         |
+------------+------------+------------+

Use the CREATE TABLE AS statement to copy the data that is automatically discovered from OSS into an internal table and then query the table to view the result.

CREATE TABLE ow_test_2 AS 
  SELECT * FROM 
  LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' 
  (
    'file_format'='parquet'
   );
-- Query the result table ow_test_2
SELECT * FROM ow_test_2;

The following result is returned:

+------------+------------+------------+
| id         | name       | age        |
+------------+------------+------------+
| 3          | Charlie    | 35         |
| 4          | David      | 40         |
| 5          | Eve        | 28         |
+------------+------------+------------+

Example 5: UNLOAD Schemaless Query results back to the data lake

Prepare data.

Log on to the OSS console and upload the test file part-00001.snappy.parquet to the OSS bucket directory object-table-test/schema/. For more information, see Upload OSS files.

Log on to the MaxCompute client and use Schemaless Query to read data from OSS.

SELECT * FROM 
LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' 
(
  'file_format'='parquet'
);

The following result is returned:

+------------+------------+------------+
| id         | name       | age        |
+------------+------------+------------+
| 3          | Charlie    | 35         |
| 4          | David      | 40         |
| 5          | Eve        | 28         |
+------------+------------+------------+

Use the UNLOAD command to export the automatically discovered results to OSS. For more information about UNLOAD operations, see UNLOAD.

UNLOAD FROM (
  SELECT * FROM 
  LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' 
  ('file_format'='parquet')
) 
INTO 
LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/unload/ow_test_3/'
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ('odps.external.data.enable.extension'='true')
STORED AS PARQUET;

View the generated file in the OSS directory: After the command is successfully executed, the Parquet file 20250715070650775g0etr15glr2_M1_1_0_0-0_TableSink1.parquet is generated in the OSS path unload/ow_test_3/. The file size is 0.449 KB and the storage class is Standard.