MaxCompute supports external schemas for Data Lake Formation (DLF) and Object Storage Service (OSS). You can map a DLF Legacy Catalog and configure a connection to OSS. This lets you access metadata and data in real time.
Introduction
A DLF and OSS data source is a common data lake architecture that uses OSS as the data lake storage service and DLF Legacy (hereinafter referred to as DLF) as the metadata management service. MaxCompute lets you create this type of external data source. The combination of MaxCompute, DLF, and OSS creates a data lakehouse. This provides more flexible and efficient data management and processing.
To access metadata in DLF and data in OSS, you must specify a RAM role that has permissions to access both DLF and OSS. This role is used by the MaxCompute external schema. After you map the data source, you can use the MaxCompute permission system to manage the mapped data objects in the external schema based on object names.
A DLF and OSS external schema maps to your own OSS at the catalog.database level. You must ensure that the metadata in DLF is synchronized with the metadata of the data in OSS.
Usage notes
Region restrictions: The data lakehouse 2.0 feature is available only in the China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Singapore, and Germany (Frankfurt) regions.
MaxCompute, DLF, and OSS must be deployed in the same region.
Version restrictions: When you run the
SHOW tables IN <external_schema_name>command to access the data source:If you run the command in the MaxCompute client (odpscmd), the client version must be 0.46.8 or later. You can run the
SHOW VERSION;command to view the current version.If you run the command in DataWorks DataStudio, the MaxCompute CMD version integrated into the DataWorks resource group must be 0.46.8 or later.
Procedure
Step 1: Enable services and grant permissions
-
Grant permissions
When you build a data lakehouse with MaxCompute, DLF, and OSS, you must grant permissions. This is because the MaxCompute project account does not have permissions to access DLF or OSS by default. You can grant permissions in one of the following ways:
-
One-click authorization: Use this method if the account used to create the MaxCompute project is the same as the account used to deploy DLF and OSS. Click Authorize DLF and OSS to grant the required permissions.
Custom authorization: Use this method regardless of whether the account used to create the MaxCompute project is the same as the account used to deploy DLF and OSS. For more information, see Custom authorization for DLF.
-
Step 2: Create a DLF+OSS external data source
-
Log in to the MaxCompute console and select a region in the upper-left corner.
-
In the left-side navigation pane, choose .
-
On the External Data Source page, click Create External Data Source.
-
In the Create External Data Source dialog box, configure the parameters. The following tables describe the parameters.
Parameter
Required
Description
External Data Source Type
Required
Select DLF+OSS.
External Data Source Name
Required
Enter a custom name. The name must follow these rules:
It must start with a letter and can contain only lowercase letters, underscores (_), and digits.
It cannot exceed 128 characters in length.
Example:
dlf_oss_schema.Description
Optional
Enter a description as needed.
Region
Required
The current region is used by default.
DLF endpoint
Required
The DLF Endpoint of the current region is used by default.
OSS endpoint
Required
The OSS Endpoint of the current region is used by default.
Role ARN
Required
The Alibaba Cloud Resource Name (ARN) of the RAM role. This role must have permissions to access both the DLF and OSS services.
-
Log in to the RAM console.
-
In the left navigation bar, select .
-
In the Basic Information section, you can obtain the ARN.
Example:
acs:ram::124****:role/aliyunodpsdefaultrole.Foreign Server Supplemental Properties
Optional
Specifies additional properties for the external data source. After you specify these properties, tasks that use this data source can access the source system based on the defined behavior.
NoteFor information about the supported parameters, see the official documentation. More parameters will be supported as the product evolves.
-
Click OK to create the external data source.
Step 3: Create a DLF data catalog and a database
-
Create a DLF data catalog
-
登录数据湖构建(DLF)控制台,在左上角选择地域。
-
在左侧导航栏,选择。
-
在Metadata页面,单击Catalog List页签。
-
在New Catalog页面,自定义Catalog ID,Location填写OSS Bucket目录。
Example Catalog ID:
external_schema_dlf_catalogExample Location:
oss://external-schema/dlf-oss
-
-
Create a database
-
在Metadata页面,单击Database页签。
-
在新建的Catalog List下单击Create Database。配置如下参数:
Parameter
Required
Description
Catalog
Required
Example:
external_schema_dlf_catalog.Database Name:
Required
Enter a custom database name. The name must be 1 to 128 characters in length, start with a letter, and can contain letters, digits, and underscores (_). Example:
dlf_oss_db.Database Description:
Optional
Enter a custom description.
Select Path:
Required
The storage location of the database. Example:
oss://external-schema/dlf-oss/database.
-
Step 4: Create an external schema
-
You can perform external schema operations only in a MaxCompute project that supports schemas.
-
In the left-side navigation pane, choose .
-
On the Projects page, click Enable Schema in the Actions column for the target project.
-
Before you perform schema operations or query data from an external schema, you must enable schema support for SQL syntax. To do this, add the
SET odps.namespace.schema=true;statement before your SQL statements.In the MaxCompute client, run the following command to create an external schema that references the DLF and OSS external data source:
CREATE EXTERNAL SCHEMA IF NOT EXISTS <external_schema_name> with <dlfoss_foreign_server_name> ON '<dlf_calaog_id>.<database_name>';The following table describes the parameters.
external_schema_name: The name of the external schema. Example:test_schema.dlfoss_foreign_server_name: The name of the external data source that you created. The project to which the external schema belongs must be in the same region as the external data source. Example:dlf_oss_schema.dlf_calaog_id: The ID of the DLF data catalog. Example:external_schema_dlf_catalogdatabase_name: The name of the database in the specified DLF data catalog. Example:dlf_oss_db.
Step 5: Use SQL to access the data source
You can list, view, and delete the created external schema. The commands and methods are the same as those for internal schemas. For more information, see Schema operations.
List the names of tables in DLF and OSS
Method 1:
SHOW tables IN <external_schema_name>; -- external_schema_name is the name of the external schema.Method 2:
USE SCHEMA <external_schema_name>; -- external_schema_name is the name of the external schema. SHOW tables;
Examples
Query the names of all tables in the test_schema external schema.
USE SCHEMA test_schema; SHOW TABLES;Sample result:
ALIYUN$xxx@test.aliyunid.com:hellocsv ALIYUN$xxx@test.aliyunid.com:t1 ALIYUN$xxx@test.aliyunid.com:t2 ALIYUN$xxx@test.aliyunid.com:t3Query data from the
hellocsvtable in thetest_schemaschema of thetest_projectproject.SELECT * FROM test_project.test_schema.hellocsv;Copy data from the federated foreign table
hellocsvto a data warehouse.MaxCompute does not allow you to run
CREATE TABLE ASoperations on external schema objects. You must switch to the default schema before you run the following statements:-- Copy data from the federated foreign table to the data warehouse. CREATE TABLE hellocsv_copy AS SELECT * FROM test_project.test_schema.hellocsv; -- Query the copied data in the data warehouse. SELECT * FROM hellocsv_copy;
Viewing an external data source
On the MaxCompute External Data Source page, you can view the external data sources you created. The following table describes the information displayed for each data source.
Parameter | Description |
Data Source Name | The name of the external data source. |
Type | The type of the external data source. The following types are supported: DLF+OSS, Paimon_DLF, Hive+HDFS, Filesystem Catalog, and Hologres. |
Network Connection | If an external data source uses a network connection, the network connection name is displayed. Currently, only the Hive+HDFS type supports network connections. |
Owner Account | The account that created the external data source. External schemas use this data source to access the source system, and the access permissions are derived from the identity specified by the creator.
|
Mounted Projects | The number of projects to which the external data source is mounted.
|
Creation Time | The time when the external data source was created. |
Update Time | The time when the editable properties of the external data source were last modified. |
Actions > Details | You can view the properties of the external data source. On the details page, click Edit to modify the Modifying information such as permissions may invalidate authorization relationships configured in the project due to changes in the scope of visible objects. |
Actions > Delete | Deletes the external data source. Important Deleting an external data source revokes access permissions for all dependent tasks. MaxCompute also deletes the configured mounted relationships between this external data source and all projects. |