This feature is available for invitational preview. To enable it, submit a ticket.
Data Discovery automatically scans Object Storage Service (OSS) buckets, detects file formats and schemas, and registers the data as foreign tables in MaxCompute. Once registered, query the data directly using SQL or MaxFrame — no manual DDL required. MaxCompute applies enterprise-grade access control, data masking, and row-level permissions to all discovered data.
How it works
Each discovery task run performs the following steps:
-
Scans the OSS path you specify at the configured frequency (every 5 minutes to every 7 days).
-
Detects the file format and infers the schema from each data file.
-
Maps the folder hierarchy to table names and partition columns using the discovery rule.
-
Registers matching tables, partitions, and schemas as foreign tables in the target MaxCompute project and schema.
After each run, you can query the registered foreign tables with SQL immediately.
Use cases
Automated log analysis: Application logs written continuously to OSS in JSON or CSV format, partitioned by date, are automatically detected and registered as foreign tables. Analysts can query new partitions with SQL as soon as they appear — no manual ingestion pipeline needed.
Specifications
| Dimension | Details |
|---|---|
| Supported data source | OSS |
| Supported file formats | Parquet (uncompressed, SNAPPY, ZSTD, GZIP); ORC (uncompressed, SNAPPY, ZLIB); JSON (uncompressed, BZIP2, GZIP, LZ4, DEFLATE); CSV (uncompressed, SNAPPY, GZIP) |
| Discovery frequency | 5 minutes / 15 minutes / 60 minutes / 1 day / 7 days |
| Path-to-table mapping rule | oss://<LOCATION path>/<foreign table>/<partition (optional)>/<file> |
| Task limit | 100 per Alibaba Cloud account |
| Available regions | China (Beijing), China (Shenzhen) |
Limitations
-
Regions: Data Discovery is available only in the China (Beijing) and China (Shenzhen) regions.
-
Permissions: Only an Alibaba Cloud account owner or a user with the tenant-level Datascan_Admin role can create and manage discovery tasks.
Role Permissions Datascan_Admin List, view, create, update, and delete data discovery tasks. -
Schema conflicts: If a newly discovered foreign table has the same name as an existing user-created table in the target schema, the task skips creating that foreign table.
-
Task deletion: Deleting a task does not delete the foreign tables already registered. However, their schemas are no longer updated based on changes in the data lake.
Prerequisites
Before you begin, ensure that you have:
-
An OSS bucket in the same region and under the same Alibaba Cloud account as your MaxCompute project
-
A data lake connection configured as the access credential for the OSS bucket
-
A MaxCompute project with schema-level syntax enabled
-
The tenant-level Datascan_Admin role (for Resource Access Management (RAM) users)
Grant the Datascan_Admin role
An Alibaba Cloud account owner or a user with the tenant-level Super_Administrator or Admin role can grant the Datascan_Admin role.
-
Log on to the MaxCompute console and select a region in the top-left corner.
-
In the left navigation pane, choose Manage Configurations > Tenants.
-
On the Tenants page, click the Roles tab.
-
Select Datascan_Admin, then click New Authorization in the Actions column.
-
In the Newly Added Authorization dialog box, add the users to authorize and click OK.
For more information about tenant-level roles, see Grant permissions to a tenant-level role.
Create a discovery task
-
Log on to the MaxCompute console and select a region in the top-left corner.
-
In the left navigation pane, choose MaxLake > Data Discovery.
-
On the Data Discovery page, click Create a data discovery task.
-
In the Create Task dialog box, configure the following parameters and click Create. How the path-to-table mapping rule works The task maps the folder hierarchy under the Location path to foreign tables and partition columns:
-
The first row of a CSV file is used as column names. The task automatically sets
skip.header.line.count=1on the foreign table to skip the header during reads. -
The default quote character is a double quotation mark (
"). Fields containing a line break, a double quotation mark (escaped as""), or a comma must be enclosed in double quotation marks.
Basic configuration
Parameter Description Task Name A unique name for the task within the tenant. Task Description An optional description of the task. Task cycle How often the task scans for new data: 5 minutes / 15 minutes / 60 minutes / 1 day / 7 days. Lake Data Configuration
Parameter Description Connection Select a data lake connection as the access credential for OSS. Location The OSS path to scan. Format: oss://<Bucket name>/<OSS path>/. The bucket must be in the same region and under the same Alibaba Cloud account as the discovery task.Discovery Format The file format to detect: Parquet, ORC, JSON, or CSV. Result Value Foreign table ods_vehicle_gps_rawPartition columns dt,hhSchema Inferred from vin1_2025-09-16_01.parquetCatalog Configuration
Parameter Description Project Select a MaxCompute project with schema-level syntax enabled. Schema Select a schema. If a newly discovered foreign table has the same name as an existing user-created table in the schema, the task does not create the foreign table. oss://<LOCATION path>/<foreign table>/<partition (optional)>/<file>Example: If Location is set to
oss://maxlake/and a file exists at:oss://maxlake/ods_vehicle_gps_raw/dt=2025-09-16/hh=01/vin1_2025-09-16_01.parquetThe task creates the following: CSV format notes
-
View discovery results
-
Log on to the MaxCompute console and select a region in the top-left corner.
-
In the left navigation pane, choose MaxLake > Data Discovery.
-
Find the target task and click Browse Results in the Operation column.
-
On the details page, review the following sections:
-
Basic Information: Discover the name, Discovery Configuration, and Recently Found Time.
-
Recently discovered results: Discovered foreign tables, including Table Name and Table partition. Click a table to query its schema and data using SQL.
-
Historical Discovery Record: Run history showing discovery time and the number of tables discovered per run. Task logs for the most recent 2,000 runs or the last 180 days are retained. Task logs that do not meet these conditions are deleted.
-
Manage discovery tasks
-
Log on to the MaxCompute console and select a region in the top-left corner.
-
In the left navigation pane, choose MaxLake > Data Discovery.
-
On the Data Discovery page, use the following controls in the task list:
Action How to Pause or resume a task Click the Scheduling switch in the Status column. Trigger an immediate run Click Trigger once immediately in the Operation column. Edit task name, description, or schedule Click Edit in the Operation column. Delete a task Click Delete in the Operation column. Registered foreign tables are not deleted, but their schemas are no longer updated.