Data partitioning

更新时间:
复制 MD 格式

This topic describes three strategies for managing partitioned tables.

Partitioned tables

In Fluss, a partitioned table organizes data based on one or more partition keys. This effectively improves query performance and manageability for large datasets. Partitions divide data into segments, each corresponding to a specific partition key value.

Fluss supports the following three management strategies for partitioned tables:

  • Manual partition management: You can manually create new partitions or delete existing ones. For more information, see Manage Partitions.

  • Automatic partition management: The system automatically creates partitions based on the automatic partitioning rules configured when you create the table. It also automatically cleans up expired partitions to prevent unlimited data growth. For more information, see Automatic Partitioning.

  • Dynamic partition creation: Fluss dynamically creates partitions based on the data written to the table. For more information, see Dynamic Partitioning.

These three strategies are orthogonal and can coexist on the same table.

Multi-field partitioned tables

Partitioned tables, including both primary key tables and log tables, support configuring partition keys based on multiple fields. This approach allows you to partition data by combinations of field values, enabling more granular data organization, management, and query optimization.

For example, in an Order primary key table, you can define the partition key as (date, region). Data is then stored in partitions corresponding to specific combinations, such as date=2025-04-05, region='US'. In streaming queries, you can filter data, for example, by region='US', to leverage partition pruning and improve read performance through partition pushdown.

Key advantages of partitioned tables

  • Improved query performance: Querying specific partitions reduces the amount of data the system reads, which shortens query execution time.

  • Clearer data organization: Partitions help organize data logically, making it easier to manage and query.

  • Better scalability: Dividing large datasets into smaller, more manageable chunks improves system scalability.

Limitations

  • The partition key type must be STRING.

  • Tables with a single partition key support both automatic partition creation and cleanup. Tables with multiple partition keys only support automatic cleanup of expired partitions.

  • If the table is a primary key table, the partition key must be a subset of the primary key.

  • You can configure automatic partitioning rules only at table creation; they cannot be modified afterward.

Automatic partitioning

Example

You configure automatic partitioning rules using table options. The following example shows how to use Flink SQL to create a table named site_access that supports automatic partitioning.

CREATE TABLE site_access(
  event_day STRING,  -- Partition key
  site_id INT,       -- Site ID
  city_code STRING,  -- City code
  user_name STRING,  -- User name
  pv BIGINT,         -- Page views
  PRIMARY KEY(event_day, site_id) NOT ENFORCED 
) PARTITIONED BY (event_day) WITH (
  'table.auto-partition.enabled' = 'true',            -- Enable automatic partitioning
  'table.auto-partition.time-unit' = 'YEAR',          -- Set the time granularity to YEAR
  'table.auto-partition.num-precreate' = '5',         -- Pre-create 5 partitions
  'table.auto-partition.num-retention' = '2',         -- Retain 2 historical partitions
  'table.auto-partition.time-zone' = 'Asia/Shanghai'  -- Set the time zone to Asia/Shanghai
);

In this example, when an automatic partitioning task runs (Fluss performs this periodically in the background on all tables), the system pre-creates five partitions and retains two historical partitions using YEAR as the time granularity. The time zone is set to Asia/Shanghai.

Table parameters

Parameter

Type

Required

Default

Description

table.auto-partition.enabled

Boolean

No

false

Specifies whether to enable automatic partitioning for the table. The default value is false. When enabled, partitions are created automatically.

table.auto-partition.key

String

No

None

For tables with multiple partition keys, this parameter specifies which key is used for time-based automatic partitioning. The system compares this key's value to the current time to create and remove partitions. Required for multi-key tables; not used for single-key tables.

table.auto-partition.time-unit

ENUM

No

DAY

The time granularity for automatically creating partitions. The default value is 'DAY'. Valid values are 'HOUR', 'DAY', 'MONTH', 'QUARTER', and 'YEAR'. If the value is 'HOUR', partitions are created in the yyyyMMddHH format. If 'DAY', the format is yyyyMMdd. If 'MONTH', the format is yyyyMM. If 'QUARTER', the format is yyyyQ. If 'YEAR', the format is yyyy.

table.auto-partition.num-precreate

Integer

No

2

The number of partitions to pre-create during each automatic partitioning check. For example, if the check time is 2024-11-11 and this value is 3, the partitions 20241111, 20241112, and 20241113 are pre-created. If a partition already exists, its creation is skipped. The default value is 2. For example, if 'table.auto-partition.time-unit' is 'DAY' (the default), one pre-created partition is for today and the other is for tomorrow. Pre-creation is not supported for tables with multiple partition keys; this parameter is ignored for them.

table.auto-partition.num-retention

Integer

No

7

The number of historical partitions to retain during each automatic partitioning check. For example, if the check time is 2024-11-11, the time unit is DAY, and this value is 3, the historical partitions 20241108, 20241109, and 20241110 are retained. The system deletes partitions older than 20241108. The default value is 7.

table.auto-partition.time-zone

String

No

System time zone

The time zone for automatic partitioning. Defaults to the system time zone.

Partition generation rules

The table.auto-partition.time-unit for an automatically partitioned table can be set to HOUR, DAY, MONTH, QUARTER, or YEAR. Automatic partitioning creates partitions in the following formats.

Time unit

Format

Example

HOUR

yyyyMMddHH

2024091922

DAY

yyyyMMdd

20240919

MONTH

yyyyMM

202409

QUARTER

yyyyQ

20241

YEAR

yyyy

2024

Fluss cluster configuration

The following table describes the configuration options related to automatic partitioning for a Fluss cluster.

Parameter

Type

Default

Description

auto-partition.check.interval

Duration

10 minutes

The interval at which Fluss checks tables with automatic partitioning enabled. During each check, Fluss creates or deletes partitions as needed based on the table's rules.

Dynamic partitioning

Dynamic partitioning is a client-side feature, enabled by default, that automatically creates partitions as data is written to a table. This is valuable when the full set of partitions is unknown in advance, eliminating the need for manual creation. It is also particularly useful when you work with multi-field partitions because automatic partitioning currently supports partition creation for single-field keys only.

Note

The number of dynamically created partitions is also limited by the max.partition.num and max.bucket.num parameters configured on the Fluss cluster.

  • The default value of max.partition.num is 1000. This represents the maximum number of partitions that can be created for each partitioned table.

  • The default value of max.bucket.num is 128000. This represents the maximum total number of buckets supported per table, which is calculated as (number of partitions) × (number of buckets per partition).

Client options

Parameter

Type

Required

Default

Description

client.writer.dynamic-create-partition.enabled

Boolean

No

true

Specifies whether to enable dynamic partition creation for the client writer. If enabled, the client automatically creates a new partition if it does not exist at write time.

Dynamic partitioning example

Step 1: Create a sample table

In the left navigation bar, choose Data Development > Data Query, create a new, blank data query script, and enter the following SQL to create a partitioned table with the dt column as the partition key.

CREATE TABLE IF NOT EXISTS `fluss-demo`.fluss.table_enable_dynamic_partition
(
    id    BIGINT
    ,name STRING
    ,dt   STRING
)
PARTITIONED BY (dt)
WITH (
    'client.writer.dynamic-create-partition.enabled' = 'true'
)
;

Verify the current partitions of the table:

SHOW PARTITIONS `fluss-demo`.fluss.table_enable_dynamic_partition;

The query result indicates that the table's partition list is empty.

Step 2: Dynamically insert data

Run the following SQL statement to insert data into a non-existent partition:

INSERT INTO `fluss-demo`.fluss.table_enable_dynamic_partition values(1, 'hello', '20250701');

Step 3: Verify partition creation

Run the SHOW PARTITIONS command again to verify that the partition has been dynamically created.

The query result shows dt=20250701 in the partition name column, confirming that the partition was dynamically created.