Configure a Paimon Catalog data source in DataWorks-DataWorks(DataWorks)-阿里云帮助中心

DataWorks supports configuring a Paimon Catalog data source to collect and govern metadata for Paimon tables that do not originate from Data Lake Formation (DLF). This specialized data source helps you unify the governance of Paimon data lake assets in Data Map. This topic describes how to configure this data source.

Introduction

With the growing adoption of the lakehouse architecture in enterprises, open table formats like Paimon, Iceberg, and Delta Lake have become the cornerstones for building real-time data warehouses and enabling unified batch-stream processing scenarios. In the Flink stream processing ecosystem, Paimon Catalog is widely used due to its native compatibility.

DataWorks is deeply integrated with Data Lake Formation, supporting unified management and access to data lake tables through DLF data sources. For example, a user might define a Paimon Catalog using the Flink engine, with the metadata and data stored in Alibaba Cloud OSS.

Existing data source systems cannot effectively discover or deeply manage this type of native, non-DLF-managed lake format metadata. To address this, DataWorks introduces the Paimon Catalog data source to support metadata collection and governance for native data lake formats. This feature fills the management gap for self-declared catalogs, making end-to-end lakehouse data visible, manageable, and usable.

Limitations

Network connectivity: Only a serverless resource group is supported.
Scenarios: Paimon Catalog is currently used only for Collect Metadata and governance. It does not support data integration and synchronization tasks. To read from and write to Paimon tables for data synchronization, use other data sources, such as DLF or OSS.

Procedure

1. Go to the Data Sources page

Log on to the DataWorks console and switch to the target region. In the left navigation bar, click Workspace, and then click Manage in the Actions column of the target workspace to go to the management page.
On the workspace management center page, click Data Sources in the left navigation bar to go to the Data Source page.

2. Add a Paimon Catalog data source

On the Data Sources page, click Add Data Source .
In the Add Data Source dialog box, search for and select Paimon Catalog.

3. Configure parameters

Configure the following parameters:

Parameter	Description
Data Source Name	Specify a custom data source name, such as `paimon_finance`.
Catalog	The name of the catalog for the connection, such as `paimon-catalog`. We recommend that you set the catalog name to be the same as the one used by the computing engine to ensure accurate metadata mapping.
MetaStore	The storage type of the catalog. Currently, only Filesystem is supported.
Filesystem	The file storage type. Currently, only OSS is supported.
Access Mode	RAM Role Authorization Mode : You can use this mode to access the OSS path of the Catalog by using RAM role authorization. For configuration instructions, see Configure a data source by using the RAM Role Authorization Mode . Alibaba Cloud RAM User: Allows you to use the currently logged-in account as the access identity to access OSS.
Region	Select a bucket in the same region as the workspace for optimal performance. For cross-region data sources, establish a VPC peering connection. For details, see Connect to a data source in a different region under the same Alibaba Cloud account. Alternatively, connect by using a public endpoint.
Endpoint	For information about endpoint configuration, see Access domain names and data centers.
Warehouse	Warehouse path: The storage path of the Paimon Catalog in OSS. Format: Required. Enter a full path. Example: `oss://bucket/path/warehouse`. Note: Ensure that the path is correct. Otherwise, metadata collection will fail. Quick select: Click the folder icon to the right of the input box to select a path from a list.

4. Test connectivity

After you configure the data source, run a connectivity test to verify the connection between the data source and the resource group.

If Connected is displayed, the configuration is correct.
If Connection failed. is displayed, a diagnostic tool opens to help you troubleshoot. Common causes include incorrect credentials, network connectivity issues such as an unconfigured IP address whitelist, or a missing NAT gateway.
In standard mode, you must ensure that both the development environment and the production environment are Connected. Otherwise, errors will occur during subsequent operations such as metadata collection.

Next steps

After adding the data source, go to the Data Map module to collect metadata. You can then view and govern this metadata.