Enable lake-stream integration

更新时间:
复制 MD 格式

This topic describes how to enable the lake-stream integration feature.

Prerequisites

Service activation

Version limitations

  • DLF-Legacy is not supported.

Permission requirements

  • Only the cluster owner can enable the lake-stream integration service.

  • The account used to enable the service must have the following permissions:

    • The Editor permission in Flink, which lets you create and manage lake-stream integration jobs. For more information, see Grant permissions in the development console.

    • The AliyunDLFFullAccess permission in DLF to create databases and tables, and to perform read and write operations on tables. For more information, see Authorize users.

Enable lake-stream integration

Step 1: Enable lake-stream integration for the cluster

  1. Log on to the Realtime Compute for Apache Flink console.

  2. On the Streaming Storage Fluss tab, click Console in the Actions column to access the Fluss cluster.

  3. On the Cluster Overview page, click the image toggle in the lower-right corner to enable the service and associate it with Flink.

    image

  4. Click Confirm. After the service is enabled, you can view the sync task in the associated Flink project.

    Important
    • The sync task requires 0.5 CU by default to run. Make sure the project has enough resources to start the task.

    • The task runs continuously and incurs fees. To stop the task and the associated fees, disable lake-stream integration.

  5. (Optional) If the service fails to start or an unexpected error occurs, navigate to the Operation Center > Job O&M page of the associated Flink workspace and locate the JAR job whose name starts with tiering-fluss.

    • If you cannot locate the job, the service submission may have failed or you may have insufficient permissions.

    • If the job exists but its status is abnormal, such as FAILED or CANCELED, check the job logs to identify the cause of the error.

Step 2: Enable lake-stream integration for a table

Table not created

In the Flink workspace, register a Fluss Catalog. Then, in Data Studio > Data Query, create a table.

CREATE TABLE `my-catalog`.`fluss`.`datalake_orders` (
  shop_id BIGINT,
  user_id BIGINT,
  num_orders INT,
  total_amount INT,
  PRIMARY KEY (shop_id, user_id) NOT ENFORCED
) WITH (
  'bucket.num' = '4',
  'table.datalake.enabled' = 'true',
  'table.datalake.freshness' = '30min'
);

Additional parameters for lake-stream integration

Parameter

Description

Default value

Notes

table.datalake.enabled

Specifies whether to enable lake-stream integration.

false

If you do not add this parameter, you can enable the feature manually in the console.

table.datalake.freshness

Defines the maximum allowed latency (data freshness) of the Paimon table data relative to the source Fluss table.

3min

The Streaming Storage Fluss service automatically synchronizes data based on this configuration. A smaller value improves the real-time performance of the data lakehouse. If your business is not sensitive to latency, you can increase this value to reduce resource consumption.

Important

This parameter cannot be modified later.

paimon.*

Any configuration item prefixed with paimon. is used as a configuration for the underlying Paimon lake table in Fluss.

None

Lake-stream integration in Fluss uses the default Paimon parameters to create the underlying Paimon lake table. To configure other parameters for the Paimon lake table, such as setting the format to ORC, you can set 'paimon.file.format' = 'orc'.

For more information about parameter settings, see Paimon Configuration.

Important

Creating Paimon tables with deletion vectors enabled is not supported. Do not specify the 'paimon.deletion-vectors.enabled' = 'true' parameter.

For existing tables

  • Enable using SQL: Enable the feature using the ALTER TABLE syntax.

    ALTER TABLE datalake_orders SET ('table.datalake.enabled' = 'false');
  • Enable in the console

    In the navigation pane on the left of the Fluss console, click Data Management. Click the name of the target database. In the list of tables that appears, click the image toggle for the desired table to enable lake-stream integration.

    image

Query data synchronization

After you enable lake-stream integration for a data table, the system automatically synchronizes the table's metadata to DLF. In DLF, under the associated catalog, you can view a database that has the same name as the database in Fluss. This database contains the same table schema and metadata as the source table and is used for unified queries and management.

Before synchronization, make sure that no table with the same name exists in DLF. Otherwise, the synchronization may fail or cause conflicts.

image

Data consistency in lake-stream integration

  • After you enable lake-stream integration for a table, a Flink sync job continuously writes data from Streaming Storage Fluss to a Paimon-based data lake. The table schema and metadata are managed by DLF and are retained for long-term use.

  • If you delete a table in Streaming Storage Fluss for which lake-stream integration is enabled, the synchronized data lake table (Paimon table) in DLF is not deleted. You must manually delete the table in DLF.

  • For primary key tables, data changes are processed in changelog mode:

    • When a record is deleted, the system does not physically remove the historical data. Instead, it appends a DELETE change log record.

    • When a Union Read query is executed, the Flink engine automatically merges real-time data from Streaming Storage Fluss with historical data from Paimon. The engine then removes duplicates and merges states based on the primary key and change logs.

    • Logically deleted records do not appear in the final result. This ensures the correctness and consistency of the query results.

Disable lake-stream integration

Disable lake-stream integration for a data table

  • In the console: In the navigation pane on the left of the Fluss console, click Data Management. Disable the lake-stream integration feature for each required table.

  • Using SQL: Disable the feature using the ALTER TABLE syntax.

    ALTER TABLE datalake_orders SET ('table.datalake.enabled' = 'false');

Disable lake-stream integration for the cluster

On the Cluster Overview page, click the image toggle in the lower-right corner to disable the lake-stream integration service.

Note
  • You cannot disable the service at the cluster level if lake-stream integration is still enabled for any tables.

  • When you disable the lake-stream integration service, the Flink sync job also stops.