Feature: Schemaless Query

更新时间:
复制 MD 格式

MaxCompute supports schemaless queries on Parquet foreign tables in OSS. You can export the parsed dataset to OSS or write it to an internal table. You can also embed the dataset as a subquery in SQL operations for flexible data lake processing.

Background information

When you use Spark for ad hoc exploration of structured data, a fixed data warehouse model is not required. If you must define table schemas and manually map file fields before you can read from or write to OSS, managing partitions becomes cumbersome and inflexible.

Using the LOAD instruction, MaxCompute automatically parses the Parquet file format and reads the data as a schema-aware dataset. You can then select specific columns directly from this dataset for data processing similar to table operations. You can use the UNLOAD command to export results directly to OSS, or use the CREATE TABLE AS command to import them into an internal table. You can also use this dataset as a subquery in other SQL statements, which enables flexible MaxCompute operations on data lake data.

Applicability

Schemaless Query does not support treating subdirectories in an OSS bucket as partitions.

Syntax

SELECT *, <col_name>, <table_alias>.<col_name> 
FROM 
 LOCATION '<location_path>'  
 ('key'='value' [, 'key1'='value1', ...]) 
 [AS <table_alias>];

Parameter descriptions

Parameter

Required

Description

*

Yes

Queries all fields in the Parquet file.

col_name

Yes

Queries a field by its known column name in the Parquet file.

The DECIMAL type only supports precision <=38 && scale<=18.

table_alias.col_name

Yes

Queries a field by its known column name using the full path of table alias and column name.

table_alias

No

Custom table alias.

location_path

Yes

  • The location of the Parquet file. It must be an OSS directory with the format oss://oss_endpoint/bucket_name/path/.

  • The next level under path supports partition directories in the format partition_name=partition_value.

  • Schemaless Query does not support treating subdirectories in the location as partitions, so partition pruning is not supported.

key&value

Yes

Parameters and their values for the query statement. See the following table for details.

Key-value parameter table

Key

Required

Description

Value

Default

file_format

Yes

Specifies the file format at the location. Only Parquet is supported. Other formats return an error.

parquet

parquet

rolearn

No

Specifies the RoleARN required to access the location.

  • 登录RAM控制台

  • 在左侧导航栏选择Identities > Roles

  • Basic Information区域,可以获取ARN信息。

Note

If you do not specify a RoleARN in the SQL statement, the system uses the ARN of the AliyunODPSDefaultRole role by default.

acs:ram::xxxxxx:role/aliyunodpsdefaultrole

acs:ram::1234****:role/aliyunodpsdefaultrole

file_pattern_blacklist

No

Specifies a blacklist of files to exclude. Files whose names match the pattern are not read.

".*_SUCCESS$,.*\\.hive_staging.*"

None

file_pattern_whitelist

No

Specifies a whitelist of files to include. Only files whose names match the pattern are read.

".*_20250124_.*.parquet"

.*

Examples

Example 1: Read OSS data using blacklist and whitelist parameters

  1. Prepare data.

    1. Log in to the OSS console.

    2. In the left-side navigation pane, click Buckets.

    3. On the Buckets page, click Create Bucket.

    4. Create the OSS bucket directory object-table-test/schema/.

    5. Prepare a Parquet file for reading and whitelist validation. Run the following Python code locally to create the file.

      import pandas as pd
      # Sample data
      data = [
          {'id': 3, 'name': 'Charlie', 'age': 35},
          {'id': 4, 'name': 'David', 'age': 40},
          {'id': 5, 'name': 'Eve', 'age': 28}
      ]
      df = pd.DataFrame(data)
      df['id'] = df['id'].astype('int32')
      df['name'] = df['name'].astype('str')
      df['age'] = df['age'].astype('int32')
      output_filename = 'sample_data.parquet'
      df.to_parquet(output_filename, index=False, engine='pyarrow')
      
    6. Upload the Parquet file to the object-table-test/schema/ OSS bucket directory.

      1. Log in to the OSS console.

      2. In the object-table-test/schema/ bucket directory, click Upload File.

    7. Prepare a CSV file in the object-table-test/schema/ OSS bucket directory to validate the blacklist parameter.

  2. Read the Parquet file.

    Log on to the MaxCompute client (odpscmd) and execute the following SQL commands.

    • Add test_oss.csv to the blacklist and read the Parquet file from OSS.

      SELECT id, name, age 
      FROM location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/'
      (
      'file_format'='parquet',
      'rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole',
      'file_pattern_blacklist'='.*test_oss.*'
      );
    • Add sample_data to the whitelist and read the Parquet file from OSS.

      SELECT id, name, age 
      FROM location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/'
      (
      'file_format'='parquet',
      'rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole',
      'file_pattern_whitelist'='.*sample_data.*'
      );

    In both cases, the sample_data file is read. The following result is returned:

    +------------+------------+------------+
    | id         | name       | age        |
    +------------+------------+------------+
    | 3          | Charlie    | 35         |
    | 4          | David      | 40         |
    | 5          | Eve        | 28         |
    +------------+------------+------------+

Example 2: Read data written by Spark

  1. Create the OSS bucket directory object-table-test/spark/.

  2. Generate Parquet data using Serverless Spark. For more information, see Create an SQL job. You can skip this step if Parquet files that are written by Spark already exist in the OSS directory.

    1. Log on to the E-MapReduce console and select a region in the upper-left corner.

    2. In the left-side navigation pane, choose EMR Serverless > Spark.

    3. On the Spark page, click the name of your target workspace to open it. Alternatively, click Create Workspace. After the new workspace is created, click its name to open it.

    4. In the left-side navigation pane, choose Data Development, and then create a Spark SQL file to run the following SQL statements:

      CREATE TABLE example_table_parquet04 (
          id STRING,
          name STRING,
          age STRING,
          salary DOUBLE,
          is_active BOOLEAN,
          created_at TIMESTAMP,
          details STRUCT<department:STRING, position:STRING>
      )
      USING PARQUET;
      INSERT INTO example_table_parquet04 VALUES
      ('1', 'Alice', '30', 5000.50, TRUE, TIMESTAMP '2024-01-01 10:00:00', STRUCT('HR', 'Manager')),
      ('2', 'Bob', '25', 6000.75, FALSE, TIMESTAMP '2024-02-01 11:00:00', STRUCT('Engineering', 'Developer')),
      ('3', 'Charlie','35', 7000.00, TRUE, TIMESTAMP '2024-03-01 12:00:00', STRUCT('Marketing', 'Analyst')),
      ('4', 'David', '40', 8000.25, FALSE, TIMESTAMP '2024-04-01 13:00:00', STRUCT('Sales', 'Representative')),
      ('5', 'Eve', '28', 5500.50, TRUE, TIMESTAMP '2024-05-01 14:00:00', STRUCT('Support', 'Technician'));
      SELECT * FROM example_table_parquet04;
  3. Log on to the OSS console and view the generated data files at the destination path.

  4. Log on to the MaxCompute client, add _SUCCESS to the blacklist, and read the Parquet file from OSS.

    SELECT  *
    FROM location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/spark/example_table_parquet04/'
    (
    'file_format'='parquet',
    'rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole',
    'file_pattern_blacklist'='.*_SUCCESS.*'
    );

    The following result is returned:

    +----+---------+-----+------------+-----------+---------------------+----------------------------------------------+
    | id | name    | age | salary     | is_active | created_at          | details                                      |
    +----+---------+-----+------------+-----------+---------------------+----------------------------------------------+
    | 1  | Alice   | 30  | 5000.5     | true      | 2024-01-01 10:00:00 | {department:HR, position:Manager}            |
    | 2  | Bob     | 25  | 6000.75    | false     | 2024-02-01 11:00:00 | {department:Engineering, position:Developer} |
    | 3  | Charlie | 35  | 7000.0     | true      | 2024-03-01 12:00:00 | {department:Marketing, position:Analyst}     |
    | 4  | David   | 40  | 8000.25    | false     | 2024-04-01 13:00:00 | {department:Sales, position:Representative}  |
    | 5  | Eve     | 28  | 5500.5     | true      | 2024-05-01 14:00:00 | {department:Support, position:Technician}    |
    +----+---------+-----+------------+-----------+---------------------+----------------------------------------------+

For more information about Schemaless Query operations, see Read Parquet data from a data lake using Schemaless Query.

Example 3: Use Schemaless Query in a subquery

  1. Prepare data.

    Log on to the OSS console and upload the test file part-00001.snappy.parquet to the OSS bucket directory object-table-test/schema/. For more information, see Upload OSS files.

  2. Log on to the MaxCompute client and create an internal table to store the data that is automatically discovered from OSS.

    CREATE TABLE ow_test (
        id INT,
        name STRING,
        age INT
    );
  3. Use Schemaless Query to read data from OSS.

    SELECT * FROM 
    LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' 
    (
      'file_format'='parquet'
    );

    The following result is returned:

    +------------+------------+------------+
    | id         | name       | age        |
    +------------+------------+------------+
    | 3          | Charlie    | 35         |
    | 4          | David      | 40         |
    | 5          | Eve        | 28         |
    +------------+------------+------------+
  4. Use the OSS data as a subquery in an outer SQL statement to query the ow_test table.

    INSERT OVERWRITE TABLE ow_test
    SELECT id,name,age FROM 
    (
        SELECT * FROM 
        LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' 
        (
          'file_format'='parquet'
        )
    );
    SELECT * FROM ow_test;

    The following result is returned:

    +------------+------------+------------+
    | id         | name       | age        |
    +------------+------------+------------+
    | 3          | Charlie    | 35         |
    | 4          | David      | 40         |
    | 5          | Eve        | 28         |
    +------------+------------+------------+

Example 4: Save Schemaless Query results to an internal data warehouse table

  1. Prepare data.

    Log on to the OSS console and upload the test file part-00001.snappy.parquet to the OSS bucket directory object-table-test/schema/. For more information, see Upload OSS files.

  2. Log on to the MaxCompute client and use Schemaless Query to read data from OSS.

    SELECT * FROM 
    LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' 
    (
      'file_format'='parquet'
    );

    The following result is returned:

    +------------+------------+------------+
    | id         | name       | age        |
    +------------+------------+------------+
    | 3          | Charlie    | 35         |
    | 4          | David      | 40         |
    | 5          | Eve        | 28         |
    +------------+------------+------------+
  3. Use the CREATE TABLE AS statement to copy the data that is automatically discovered from OSS into an internal table and then query the table to view the result.

    CREATE TABLE ow_test_2 AS 
      SELECT * FROM 
      LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' 
      (
        'file_format'='parquet'
       );
    -- Query the result table ow_test_2
    SELECT * FROM ow_test_2;

    The following result is returned:

    +------------+------------+------------+
    | id         | name       | age        |
    +------------+------------+------------+
    | 3          | Charlie    | 35         |
    | 4          | David      | 40         |
    | 5          | Eve        | 28         |
    +------------+------------+------------+

Example 5: UNLOAD Schemaless Query results back to the data lake

  1. Prepare data.

    Log on to the OSS console and upload the test file part-00001.snappy.parquet to the OSS bucket directory object-table-test/schema/. For more information, see Upload OSS files.

  2. Log on to the MaxCompute client and use Schemaless Query to read data from OSS.

    SELECT * FROM 
    LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' 
    (
      'file_format'='parquet'
    );

    The following result is returned:

    +------------+------------+------------+
    | id         | name       | age        |
    +------------+------------+------------+
    | 3          | Charlie    | 35         |
    | 4          | David      | 40         |
    | 5          | Eve        | 28         |
    +------------+------------+------------+
  3. Use the UNLOAD command to export the automatically discovered results to OSS. For more information about UNLOAD operations, see UNLOAD.

    UNLOAD FROM (
      SELECT * FROM 
      LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' 
      ('file_format'='parquet')
    ) 
    INTO 
    LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/unload/ow_test_3/'
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    WITH SERDEPROPERTIES ('odps.external.data.enable.extension'='true')
    STORED AS PARQUET;

    View the generated file in the OSS directory: After the command is successfully executed, the Parquet file 20250715070650775g0etr15glr2_M1_1_0_0-0_TableSink1.parquet is generated in the OSS path unload/ow_test_3/. The file size is 0.449 KB and the storage class is Standard.