Paimon-DLF external project

更新时间:
复制 MD 格式

MaxCompute provides external projects that map to a Data Lake Formation (DLF) catalog for real-time access to metadata and data. These projects delegate permission management to DLF and support metadata access and read/write operations on data in DLF-managed Object Storage Service (OSS). This approach enables you to map an entire data lake at the catalog level, facilitating Paimon-based cross-engine collaboration. The Paimon_DLF external project is currently in invitational preview.

Limitations

  • Table format: Only Paimon tables stored in OSS that are fully managed by DLF are supported.

  • Write limitations: Dynamic bucket tables and cross-partition tables do not support INSERT INTO or INSERT OVERWRITE.

  • Query limitations: You cannot use SHOW TABLES on cross-partition tables that have fixed buckets and whose primary key does not include the partition key.

  • External projects: You cannot perform schema evolution or partition operations.

  • Other limitations:

    • You cannot perform resource or function operations.

    • You cannot use these tables as a data source for DataWorks Data Integration.

  • Data type mappings

Procedure

Step 1: Grant permissions

Grant permissions to a RAM user

If you are a RAM user, attach the following permission policies. For information about how to add permissions, see Manage permissions for a RAM user.

  • AliyunRAMFullAccess: If a RAM user does not have this permission, an Alibaba Cloud account must grant it.

  • AliyunMaxComputeFullAccess: Required to create external data sources and external projects.

  • AliyunDLFReadOnlyAccess: Required to create external projects. This permission policy is necessary because creating a Paimon_DLF external project requires the List permission for the DLF catalog.

Authorize and activate DLF

  1. Authorize cloud resource access

    The first time you use Data Lake Formation (DLF), you must authorize it to access the required cloud resources.

    Log on to the Data Lake Formation (DLF) console, and in the upper-left corner, select a region.

    To the right of Permissions to access cloud resources are granted., click Authorize.

  2. Activate Data Lake Formation

    Select a region. MaxCompute and DLF must be deployed in the same region. Before you use DLF, you must activate the service to ensure its availability in the target region.

    To the right of DLF is activated., click Activate.

Grant DLF permissions to MaxCompute

MaxCompute accesses Paimon_DLF data through an external data source and an external project. Before you proceed, you must grant the necessary DLF data permissions to the MaxCompute access account.

The "Use task executor identity" option passes the task executor's identity to DLF. Both MaxCompute and DLF then enforce permissions based on this identity. Therefore, you must first create a service-linked role for MaxCompute and grant it the necessary permissions. Follow these steps:

  1. 登录RAM控制台

  2. 在左侧导航栏选择Identities > Roles

  3. Roles页面,单击Create Role

  4. Create Role页面的右上角,单击Create Service Linked Role

  5. Create Service Linked Role页面,选择Select ServiceAliyunServiceRoleForMaxComputeLakehouse,单击Create Service Linked Role

    If you are prompted that the role already exists, it means the role has already been authorized and you can ignore the prompt.

Permission model

  • Control plane: These permissions are checked when you perform operations in the MaxCompute console, primarily during the creation of a MaxCompute external project and its binding to a DLF catalog.

    These permissions are managed by RAM. Your Alibaba Cloud account must log on to the RAM console to configure the required permissions. For more information, see Manage permissions for a RAM user.

  • Data plane: These permissions are checked when you read from or write to Paimon tables after you log on to the external project. This typically occurs when you create, maintain, or use schemas, tables, and other resources within the bound DLF catalog by executing SQL statements in MaxCompute.

    These permissions are managed by the DLF console. Your Alibaba Cloud account must log on to the DLF console to configure the required permissions. For more information, see Grant permissions to a RAM user.

Step 2: Create a Paimon_DLF external data source

  1. Log in to the MaxCompute console and select a region in the upper-left corner.

  2. In the left-side navigation pane, choose Manage Configurations > External Data Source .

  3. On the External Data Source page, click Create External Data Source.

  4. In the Create External Data Source dialog box, configure the parameters. The following tables describe the parameters.

    Parameter

    Required

    Description

    External Data Source Type

    Yes

    Select Paimon_DLF.

    External Data Source Name

    Yes

    Enter a custom name. The name must meet the following requirements:

    • It must start with a letter and can contain only lowercase letters, digits, and underscores (_).

    • It cannot exceed 128 characters in length.

    For example, you can enter paimon_dlf.

    Description

    No

    Enter a description as needed.

    Region

    Yes

    The current region is selected by default.

    Authentication and Authorization

    Yes

    The default value is an Alibaba Cloud RAM role.

    Service-linked Role

    Yes

    The role is generated by default.

    Endpoint

    Yes

    The endpoint is automatically generated. For the China (Hangzhou) region, the endpoint is cn-hangzhou-intranet.dlf.aliyuncs.com.

    Foreign Server Supplemental Properties

    No

    These properties define how tasks that use this data source access the source system.

    Note

    For information about the supported parameters, see future updates to the official documentation. The available parameters will be expanded as the product evolves.

  5. Click OK to create the external data source.

  6. On the External Data Source page, find the data source and click Details in the Actions column to view its details.

Step 3: Create an external project

  1. Log in to the MaxCompute console and select a region in the upper-left corner.

  2. In the left-side navigation pane, choose Manage Configurations > Projects.

  3. On the External Project tab, click Create Project.

  4. In the Create Project dialog box, configure the settings and click OK.

    Parameter

    Required

    Description

    Project Type

    Yes

    The default value is External Project.

    Region

    Yes

    The current region is selected by default and cannot be changed.

    Project Name (Globally Unique)

    Yes

    The name must be 3 to 28 characters in length, start with a letter, and contain only letters, digits, and underscores (_).

    MaxCompute Foreign Server Type

    No

    The default value is Paimon_DLF.

    MaxCompute Foreign Server

    No

    • Use Existing: Displays a list of existing external data sources.

    • Create Foreign Server: Allows you to create and use a new external data source.

    MaxCompute Foreign Server Name

    Yes

    • If you chose to use an existing data source, select its name from the drop-down list.

    • If you chose to create a new data source, its name is automatically used.

    Data Catalog

    Yes

    The DLF data catalog.

    Billing Method

    Yes

    Select Subscription or Pay-as-you-go.

    Default Quota

    Yes

    Select an existing quota.

    Description

    No

    Enter a custom project description.

Step 4: Use SQL to access the data source

Important

Deleting an external project does not delete the underlying data because the project is only a mapping to the data source.

However, unlike with standard external tables, running a DROP TABLE or DROP SCHEMA command in an external project sends the request to the peer service. This permanently deletes the corresponding table or database. Use DROP operations with caution.

  1. Choose a connection tool to log on to the external project.

    The following example shows how to use an SQL node for data development in a DataWorks workspace (new version).

    1. Log on to the DataWorks console, and select a region in the upper-left corner.

    2. In the left-side navigation pane, select Workspace.

    3. On the Workspaces page, click Create Workspace or click the name of an existing workspace.

    4. On the Workspace Details page, click Computing Resource in the left-side navigation pane.

    5. On the Computing Resource page, click Associate Computing Resources, and select MaxCompute.

    6. In the Associate MaxCompute Computing Resource dialog box, fill in the Basic Information.

      For the MaxCompute project, select the external project.

  2. List the schemas in the external project.

    -- Enable schema syntax at the session level.
    SET odps.namespace.schema=true;
    SHOW schemas;
    -- Sample result. 
    ID = 20250919****am4qb
    default
    system
    OK
  3. List the tables in a schema within the external project.

    If you need to access tables from a production external project in the current environment, add use external_project_pro; before all SQL statements.

    -- schema_name is the schema name returned by SHOW SCHEMAS in the external project.
    SET odps.namespace.schema=true;
    USE schema <schema_name>;
    SHOW tables;
    -- Result
    ID = 20250919****am4qb
    acs:ram::<uid>:root  emp
    OK
  4. Create a new schema in the external project.

    -- In this example, schema_name is schema_demo_test.
    CREATE schema <schema_name>;
  5. Use the new schema.

    USE schema <schema_name>;
  6. Create a table and insert data into the schema.

    If you are logged on as a RAM user, this operation requires permissions on the DLF catalog. For information about how to grant permissions, see Data authorization management.

    • Command format:

      -- Create a table.
      CREATE TABLE [IF NOT EXISTS] <table_name> 
      (
        <col_name> <data_type>,
        ...
      )
      [COMMENT <table_comment>]
      [PARTITIONED BY (<col_name> <data_type>, ...)] 
      ;
      -- Insert data.
      INSERT {INTO|OVERWRITE} TABLE <table_name> [PARTITION (<pt_spec>)] [(<col_name> [,<col_name> ...)]]
      <select_statement>
      FROM <from_statement>
    • Example:

      Note

      For low-precision TIMESTAMP types from the source Paimon table, data is truncated during write operations as follows: precision 0–3 is truncated to 3 decimal places, 4–6 to 6 decimal places, and 7–9 to 9 decimal places.

      CREATE TABLE schema_table(id int,name string);
      INSERT INTO schema_table VALUES (101,'Zhang San'),(102,'Li Si');
      -- Query the schema_table table.
      SELECT * FROM schema_table;
      -- Result
      +------------+------------+
      | id         | name       | 
      +------------+------------+
      | 101        | Zhang San  | 
      | 102        | Li Si      | 
      +------------+------------+
  7. Switch to the existing default schema.

    use schema default;
    SHOW tables;
    -- Sample result
    ID = 20250919*******yg5
    acs:ram::<uid>:root	emp
    acs:ram::<uid>:root	emp_detail
    acs:ram::<uid>:root	test_table
    OK
    -- Read data from the table.
    SELECT * FROM test_table;
    -- Result
    +------------+------------+
    | id         | name       | 
    +------------+------------+
    | 101        | Zhang San  | 
    | 102        | Li Si      | 
    +------------+------------+
    -- Write data to the table and query to verify the write operation.
    INSERT INTO test_table VALUES (103,'Wang Wu');
    SELECT * FROM test_table;
    -- Result
    +------------+------------+
    | id         | name       | 
    +------------+------------+
    | 101        | Zhang San  | 
    | 102        | Li Si      | 
    | 103        | Wang Wu    | 
    +------------+------------+

Passing Paimon table properties

Apache Paimon includes core configuration options. When creating a Paimon table in an external project, specify these options in the TBLPROPERTIES clause of your CREATE TABLE statement.

Configuration: Add parameters prefixed with mcfed. to the TBLPROPERTIES list. The parameter names following the prefix must match the native Paimon parameter names.

Example

Create a Paimon external table with buckets, a primary key, and partitions

  1. Create the table and configure its external parameters

    -- Switch to your external project. You can skip this step if you are already in it.
    use <your external project>;
    -- Enable schema syntax at the session level.
    SET odps.namespace.schema=true;
    -- Select the schema to use.
    use schema <your schema>;
    CREATE TABLE oss_extable_bucket_pk_pt_bucket
    (
        id BIGINT,
        name STRING,
        dt STRING
    )tblproperties (
        'mcfed.bucket'='3', -- Number of buckets
        'mcfed.bucket-key'='id', -- Bucket key. Optional if a primary key is specified.
        "mcfed.primary-key"="dt,id", -- Primary key
        "mcfed.partition"="dt" -- Partition field
        );
  2. Insert data into the external table

    INSERT INTO oss_extable_bucket_pk_pt_bucket PARTITION (dt='2025-06-18') 
      VALUES (1, 'Alice'),(2, 'Bob');
    INSERT INTO oss_extable_bucket_pk_pt_bucket PARTITION (dt='2025-06-19') 
      VALUES (3, 'Charlie'),(4, 'David'),(5, 'Eva');
  3. Query the external table

    SELECT * FROM oss_extable_bucket_pk_pt_bucket;
    -- Result:
    +------------+---------+------------+
    | id         | name    | dt         |
    +------------+---------+------------+
    | 1          | Alice   | 2025-06-18 |
    | 2          | Bob     | 2025-06-18 |
    | 4          | David   | 2025-06-19 |
    | 3          | Charlie | 2025-06-19 |
    | 5          | Eva     | 2025-06-19 |
    +------------+---------+------------+
    
  4. Log on to the Data Lake Formation (DLF) console, and in the upper-left corner, select a region.

    View the table details in the catalog:

    The table name is oss_extable_bucket_pk_pt_bucket, the table format is Paimon primary key table, the table type is Managed, the file format is parquet, and it is in the default database. Key settings in Table Properties include bucket: 3 and manifest.merge-min-count: 10. The field list includes id (BIGINT, not null, primary key, not a partition field), name (STRING, nullable, not a primary key, not a partition field), and dt (STRING, not null, primary key, partition field).

FAQ

Issue 1: Error creating an external project

Problem: When you attempt to create an external project, you receive the error message: "You are not authorized to perform this action."

In the error details, the error code is NoMCPermission.

Solution:

  1. If you are using a RAM user, ensure that the AliyunMaxComputeFullAccess policy is attached to the RAM user. For more information, see Step 1.

  2. This operation requires you to select a catalog from Data Lake Formation (DLF). Ensure you have the required permissions for DLF operations. For more information, see Step 1.

Issue 2: Error running SHOW TABLES in default schema

Problem: When you run the SHOW TABLES command in the default schema of an external project, you receive the following error message: "Forbidden:User acs: ram: :<uid>:user/** doesn't have privilege LIST on DATABASE default".

Solution:

  1. Log on to the Data Lake Formation (DLF) console, and in the upper-left corner, select a region.

  2. In the navigation pane on the left, choose System & Security

  3. On the Access Control > Users tab, refresh the page to check if the current RAM user exists.

  4. If the user exists, go to the Access Control > Role tab to grant permissions to the RAM user.

Issue 3: "invalid database operations on two-tier" error

Problem: When you run the SHOW SCHEMAS command in an external project, you receive the error message: "invalid database operations on two-tier".

Solution:

  1. First, ensure that you are using an external project.

    1. If you are using a DataWorks SQL node, click Debug Configuration in the right-side pane and verify that the bound compute resource is an external project.

    2. If you are using DataWorks SQL analysis, click Data Source in the upper-right corner and verify that the bound data source is an external project.

    3. If you are using MaxCompute SQL analysis, click Run Configuration in the right-side pane and verify that the project selected in the compute resource section is an external project.

    4. If you are using odpscmd, check the value of the project_name parameter in the odps_config.ini file to ensure that it is set to the name of the external project.

  2. After you confirm that you are using an external project, select and run the following SQL statements together.

    -- Enable schema syntax at the session level.
    SET odps.namespace.schema=true;
    SHOW schemas;

Issue 4: SHOW SCHEMAS returns only default schema

Problem: When you run the SHOW SCHEMAS command in an external project, the command returns only 'default'.

Solution:

  1. First, ensure that you are using an external project.

    1. If you are using a DataWorks SQL node, click Debug Configuration in the right-side pane and verify that the bound compute resource is an external project.

    2. If you are using DataWorks SQL analysis, click Data Source in the upper-right corner and verify that the bound data source is an external project.

    3. If you are using MaxCompute SQL analysis, click Run Configuration in the right-side pane and verify that the project selected in the compute resource section is an external project.

    4. If you are using odpscmd, check the value of the project_name parameter in the odps_config.ini file to ensure that it is set to the name of the external project.

  2. After you confirm that you are using an external project, log on to the Data Lake Formation (DLF) console and check whether other databases exist in the catalog that is bound to the external project.

  3. If you have completed the verifications in Steps 1 and 2, but the SHOW SCHEMAS command still does not return the expected databases, you can submit a ticket to report the issue.

Issue 5: "Dynamic bucket" error

Problem: When you run the SHOW TABLES command in an external project, you receive the following error message: ODPS-0110005: com.aliyun.odps.meta.exception.MetaException: com.aliyun.odps.common.table.na.NativeException: common/table/jni/utils/jni_helper.cpp(79): UnretryableException: Common table connector exception - ExceptionType: java.lang.IllegalArgumentException - Message: You should use dynamic bucket (bucket = -1) mode in cross partition update case (Primary key constraint [sending_time, symbol, sequence_number] not include all partition fields [pt]).[pt])

Solution:

This error occurs because MaxCompute does not support cross-partition primary key tables with fixed buckets.

Temporary workaround: Delete the unsupported cross-partition PK tables in the Data Lake Formation (DLF) console.

Issue 6: "Can't set default schema..." error

Problem: When you run SHOW TABLES or SHOW SCHEMAS, you receive the following error message: "FAILED: Can't set default schema if odps.namespace.schema is false."

Solution:

  1. First, ensure that you have added the session flag SET odps.namespace.schema=true; before your SQL statements to enable schema syntax.

  2. If the error persists after you enable the flag, verify the order of your statements. To read a table from another external project, use the following sequence of commands:

    USE external_project;
    SET odps.namespace.schema=true;
    USE schema schema_name;
    SHOW tables;

Issue 7: "Failed to connect to...dlf.aliyuncs.com" error

Problem: When you try to read data from a table in a Paimon_DLF external project, you receive an error message similar to the following: "Caused by: java.net.ConnectException: Failed to connect to cn-hangzhou-intranet.dlf.aliyuncs.com/xx.xx.xx.xx:80"

This error indicates that you need to add the address cn-hangzhou-intranet.dlf.aliyuncs.com to the allowlist for the internal project where the job is running.

Solution:

  1. Join the MaxCompute developer community DingTalk group by using the application link or by searching for the group ID 11782920 in DingTalk. Contact the MaxCompute technical support team and request that they add the address from the error message to the allowlist.

  2. Once the address is added to the allowlist, include the following parameters before your read operations.

    set odps.security.outbound.intranetlist=cn-hangzhou-intranet.dlf.aliyuncs.com:80;
    set odps.internet.access.list=cn-hangzhou-intranet.dlf.aliyuncs.com:80;

Data type mapping

For details on MaxCompute data types, see Data types (V1.0) and Data types (V2.0).

Paimon data type

MaxCompute 2.0 data type

Read/write support

Description

TINYINT

TINYINT

Supported

8-bit signed integer.

SMALLINT

SMALLINT

Supported

16-bit signed integer.

INT

INT

Supported

32-bit signed integer.

BIGINT

BIGINT

Supported

64-bit signed integer.

BINARY(MAX_LENGTH)

BINARY

Supported

Binary data type. The current maximum length is 8 MB.

FLOAT

FLOAT

Supported

32-bit binary floating-point number.

DOUBLE

DOUBLE

Supported

64-bit binary floating-point number.

DECIMAL(precision,scale)

DECIMAL(precision,scale)

Supported

Exact decimal numeric type. The default is decimal(38,18). You can customize the precision and scale values.

  • precision: The maximum number of digits. The valid values are 1 to 38.

  • scale: The number of digits in the fractional part. The valid values are 0 to 18.

VARCHAR(n)

VARCHAR(n)

Supported

Variable-length character type. n specifies the length and ranges from 1 to 65,535.

CHAR(n)

CHAR(n)

Supported

Fixed-length character type. n specifies the length and ranges from 1 to 255.

VARCHAR(MAX_LENGTH)

STRING

Supported

String type. The current maximum length is 8 MB.

DATE

DATE

Supported

Date type. The format is yyyy-mm-dd.

TIME, TIME(p)

Not supported

Not supported

The Paimon TIME data type represents a time without a time zone, consisting of hours, minutes, and seconds, with nanosecond precision.

TIME(p) specifies the fractional-second precision from 0 to 9. The default value is 0.

No corresponding type exists in MaxCompute.

TIMESTAMP, TIMESTAMP(p)

TIMESTAMP_NTZ

Supported

A timestamp type without a time zone that is accurate to the nanosecond.

To read this data type, you must disable the native JNI bridge by running the following command: SET odps.sql.common.table.jni.disable.native=true;

TIMESTAMP WITH LOCAL TIME_ZONE(9)

TIMESTAMP

Supported

  • A timestamp type that is accurate to the nanosecond. The format is yyyy-mm-dd hh:mm:ss.xxxxxxxxx.

  • For low-precision TIMESTAMP types in the Paimon source table, MaxCompute truncates values during writes. For precision 0-3, data is truncated to 3 digits. For precision 4-6, data is truncated to 6 digits. For precision 7-9, data is truncated to 9 digits.

TIMESTAMP WITH LOCAL TIME_ZONE(9)

DATETIME

Not supported

A timestamp type that is accurate to the nanosecond.

The format is yyyy-mm-dd hh:mm:ss.xxxxxxxxx.

BOOLEAN

BOOLEAN

Supported

A BOOLEAN type.

ARRAY

ARRAY

Supported

A complex type.

MAP

MAP

Supported

A complex type.

ROW

STRUCT

Supported

A complex type.

MULTISET<t>

Not supported

Not supported

No corresponding type exists in MaxCompute.

VARBINARY, VARBINARY(n), BYTES

BINARY

Supported

A data type of variable-length binary strings.