DLF+OSS external schema

更新时间:
复制 MD 格式

MaxCompute supports external schemas for Data Lake Formation (DLF) and Object Storage Service (OSS). You can map a DLF Legacy Catalog and configure a connection to OSS. This lets you access metadata and data in real time.

Introduction

A DLF and OSS data source is a common data lake architecture that uses OSS as the data lake storage service and DLF Legacy (hereinafter referred to as DLF) as the metadata management service. MaxCompute lets you create this type of external data source. The combination of MaxCompute, DLF, and OSS creates a data lakehouse. This provides more flexible and efficient data management and processing.

To access metadata in DLF and data in OSS, you must specify a RAM role that has permissions to access both DLF and OSS. This role is used by the MaxCompute external schema. After you map the data source, you can use the MaxCompute permission system to manage the mapped data objects in the external schema based on object names.

A DLF and OSS external schema maps to your own OSS at the catalog.database level. You must ensure that the metadata in DLF is synchronized with the metadata of the data in OSS.

Usage notes

  • Region restrictions: The data lakehouse 2.0 feature is available only in the China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Singapore, and Germany (Frankfurt) regions.

  • MaxCompute, DLF, and OSS must be deployed in the same region.

  • Version restrictions: When you run the SHOW tables IN <external_schema_name> command to access the data source:

    • If you run the command in the MaxCompute client (odpscmd), the client version must be 0.46.8 or later. You can run the SHOW VERSION; command to view the current version.

    • If you run the command in DataWorks DataStudio, the MaxCompute CMD version integrated into the DataWorks resource group must be 0.46.8 or later.

Procedure

Step 1: Enable services and grant permissions

  1. Enable Data Lake Formation (DLF)

  2. Activate OSS

  3. Grant permissions

    When you build a data lakehouse with MaxCompute, DLF, and OSS, you must grant permissions. This is because the MaxCompute project account does not have permissions to access DLF or OSS by default. You can grant permissions in one of the following ways:

    • One-click authorization: Use this method if the account used to create the MaxCompute project is the same as the account used to deploy DLF and OSS. Click Authorize DLF and OSS to grant the required permissions.

    • Custom authorization: Use this method regardless of whether the account used to create the MaxCompute project is the same as the account used to deploy DLF and OSS. For more information, see Custom authorization for DLF.

Step 2: Create a DLF+OSS external data source

  1. Log in to the MaxCompute console and select a region in the upper-left corner.

  2. In the left-side navigation pane, choose Manage Configurations > External Data Source .

  3. On the External Data Source page, click Create External Data Source.

  4. In the Create External Data Source dialog box, configure the parameters. The following tables describe the parameters.

    Parameter

    Required

    Description

    External Data Source Type

    Required

    Select DLF+OSS.

    External Data Source Name

    Required

    Enter a custom name. The name must follow these rules:

    • It must start with a letter and can contain only lowercase letters, underscores (_), and digits.

    • It cannot exceed 128 characters in length.

    Example: dlf_oss_schema.

    Description

    Optional

    Enter a description as needed.

    Region

    Required

    The current region is used by default.

    DLF endpoint

    Required

    The DLF Endpoint of the current region is used by default.

    OSS endpoint

    Required

    The OSS Endpoint of the current region is used by default.

    Role ARN

    Required

    The Alibaba Cloud Resource Name (ARN) of the RAM role. This role must have permissions to access both the DLF and OSS services.

    1. Log in to the RAM console.

    2. In the left navigation bar, select Identities > Roles.

    3. In the Basic Information section, you can obtain the ARN.

    Example: acs:ram::124****:role/aliyunodpsdefaultrole.

    Foreign Server Supplemental Properties

    Optional

    Specifies additional properties for the external data source. After you specify these properties, tasks that use this data source can access the source system based on the defined behavior.

    Note

    For information about the supported parameters, see the official documentation. More parameters will be supported as the product evolves.

  5. Click OK to create the external data source.

Step 3: Create a DLF data catalog and a database

  1. Create a DLF data catalog

    1. 登录数据湖构建(DLF)控制台,在左上角选择地域。

    2. 在左侧导航栏,选择Metadata > Metadata

    3. Metadata页面,单击Catalog List页签。

    4. New Catalog页面,自定义Catalog IDLocation填写OSS Bucket目录。

      1. Example Catalog ID: external_schema_dlf_catalog

      2. Example Location: oss://external-schema/dlf-oss

  2. Create a database

    1. Metadata页面,单击Database页签。

    2. 在新建的Catalog List下单击Create Database。配置如下参数:

      Parameter

      Required

      Description

      Catalog

      Required

      Example: external_schema_dlf_catalog.

      Database Name:

      Required

      Enter a custom database name. The name must be 1 to 128 characters in length, start with a letter, and can contain letters, digits, and underscores (_). Example: dlf_oss_db.

      Database Description:

      Optional

      Enter a custom description.

      Select Path:

      Required

      The storage location of the database. Example: oss://external-schema/dlf-oss/database.

Step 4: Create an external schema

  1. You can perform external schema operations only in a MaxCompute project that supports schemas.

    1. In the left-side navigation pane, choose Manage Configurations > Projects.

    2. On the Projects page, click Enable Schema in the Actions column for the target project.

  2. Before you perform schema operations or query data from an external schema, you must enable schema support for SQL syntax. To do this, add the SET odps.namespace.schema=true; statement before your SQL statements.

  3. In the MaxCompute client, run the following command to create an external schema that references the DLF and OSS external data source:

    CREATE EXTERNAL SCHEMA IF NOT EXISTS <external_schema_name>
    with <dlfoss_foreign_server_name>
    ON '<dlf_calaog_id>.<database_name>';

    The following table describes the parameters.

    • external_schema_name: The name of the external schema. Example: test_schema.

    • dlfoss_foreign_server_name: The name of the external data source that you created. The project to which the external schema belongs must be in the same region as the external data source. Example: dlf_oss_schema.

    • dlf_calaog_id: The ID of the DLF data catalog. Example: external_schema_dlf_catalog

    • database_name: The name of the database in the specified DLF data catalog. Example: dlf_oss_db.

Step 5: Use SQL to access the data source

You can list, view, and delete the created external schema. The commands and methods are the same as those for internal schemas. For more information, see Schema operations.

  • List the names of tables in DLF and OSS

    • Method 1:

      SHOW tables IN <external_schema_name>; -- external_schema_name is the name of the external schema.
    • Method 2:

      USE SCHEMA <external_schema_name>; -- external_schema_name is the name of the external schema.
      SHOW tables;
  • Examples

    • Query the names of all tables in the test_schema external schema.

      USE SCHEMA test_schema;
      SHOW TABLES;

      Sample result:

      ALIYUN$xxx@test.aliyunid.com:hellocsv
      ALIYUN$xxx@test.aliyunid.com:t1
      ALIYUN$xxx@test.aliyunid.com:t2
      ALIYUN$xxx@test.aliyunid.com:t3
    • Query data from the hellocsv table in the test_schema schema of the test_project project.

      SELECT * FROM test_project.test_schema.hellocsv;
    • Copy data from the federated foreign table hellocsv to a data warehouse.

      MaxCompute does not allow you to run CREATE TABLE AS operations on external schema objects. You must switch to the default schema before you run the following statements:

      -- Copy data from the federated foreign table to the data warehouse.
      CREATE TABLE hellocsv_copy AS SELECT * FROM test_project.test_schema.hellocsv;
      
      -- Query the copied data in the data warehouse.
      SELECT * FROM hellocsv_copy;

Viewing an external data source

On the MaxCompute External Data Source page, you can view the external data sources you created. The following table describes the information displayed for each data source.

Parameter

Description

Data Source Name

The name of the external data source.

Type

The type of the external data source. The following types are supported: DLF+OSS, Paimon_DLF, Hive+HDFS, Filesystem Catalog, and Hologres.

Network Connection

If an external data source uses a network connection, the network connection name is displayed. Currently, only the Hive+HDFS type supports network connections.

Owner Account

The account that created the external data source. External schemas use this data source to access the source system, and the access permissions are derived from the identity specified by the creator.

  • The creator can mount the external data source to specific projects to specify which projects can use the external data source. A project administrator can then use policies to grant users in the project permissions to use the external data source.

  • If project-level tenant resource access control is not enabled, any user who creates an external project or external schema can use this external data source. Access to the external system is still based on the RAM role permissions specified by the creator in the external data source.

Mounted Projects

The number of projects to which the external data source is mounted.

  • If project-level tenant resource access control is enabled for a project, the project cannot access the external system through this external data source unless a tenant administrator mounts the data source to the project.

  • If project-level tenant resource access control is not enabled for a project, the system does not check the mounted relationship between the project and the referenced external data source.

Creation Time

The time when the external data source was created.

Update Time

The time when the editable properties of the external data source were last modified.

Actions > Details

You can view the properties of the external data source. On the details page, click Edit to modify the basic information of the data source.

Modifying information such as permissions may invalidate authorization relationships configured in the project due to changes in the scope of visible objects.

Actions > Delete

Deletes the external data source.

Important

Deleting an external data source revokes access permissions for all dependent tasks. MaxCompute also deletes the configured mounted relationships between this external data source and all projects.