Data management

更新时间:
复制 MD 格式

Data Lake Formation (DLF) provides table management APIs compatible with the Apache Paimon REST Catalog. The file storage structure is fully compatible with Paimon, allowing any Paimon-compatible engine or application to create, update, query, and delete tables managed by DLF.

The main data hierarchy in a catalog is as follows:

image
  1. Catalog: The top-level container for metadata. A catalog organizes resources in a hierarchy that enforces isolation and permission boundaries across services and users. It also governs data lake storage and table operations.

  2. Database: A logical grouping within a catalog that provides finer-grained organization and access control.

  3. Table: DLF supports multiple table types for unified management and seamless compatibility with different compute engines and formats. To enable encryption at rest, submit a ticket.

  4. View: Views are persisted in DLF and support multiple SQL dialects, enabling configuration across different compute engines.

  5. Function: Functions are persisted in DLF and currently support Flink JARs (Java and Python) and Java Lambda functions executed on the Spark engine.

Unified metadata service

DLF provides a unified metadata management service. A single catalog centrally manages metadata for tables, views, and functions, breaking down silos between compute engines so that Alibaba Cloud's big data and AI engines can access omni-modal data without per-engine configuration.

Multi-modal data support

DLF manages both structured and unstructured data through a single catalog, with native support for major open table formats and AI data types:

  • Data lake formats: Full support for Apache Paimon, Apache Iceberg, and their ecosystem components.

  • AI and vector data: Support for the Lance format for high-performance AI retrieval and training.

  • Unstructured data: Manage non-tabular datasets such as images and videos using Object Tables, providing interoperability between storage and compute.

  • Standard file formats: Support for Hive-compatible format tables, including Parquet, CSV, and ORC.

Unified access control

DLF centralizes security management so that a permission granted once applies automatically across all connected compute engines without per-engine configuration.

  • Granular authorization: Control access at four levels—catalog, database, table, and column.

  • Multi-engine synchronization: Authorize an operation once and the policy takes effect across all connected compute engines.

This ensures consistent data access, stronger security, and significantly simplifies cross-engine operations and maintenance (O&M) processes.