Lake-mode columnar tables are Lindorm's analytical storage format built on the open Apache Iceberg standard. Data is persisted as columnar files on the instance's underlying file system, and OLAP resource groups and ETL resource groups within the same instance share a single copy of this lake-mode data, enabling cooperative multi-engine reads and writes without data copying.
Lake-mode columnar tables are wire-compatible with the open Iceberg protocol and interoperate with Apache Spark, Apache Flink, and similar engines — a natural fit for a multi-engine, lake-warehouse unified architecture. Compared with warehouse-mode columnar tables, lake-mode columnar tables prioritize data sharing and ecosystem compatibility, while warehouse-mode columnar tables offer a more complete SQL surface and higher ad-hoc query performance.
After connecting to an OLAP resource group, switch to the lake-mode columnar data source:
SET CATALOG lindorm_columnar;
Lake-mode columnar table
When to use
-
Append-only detail analytics — raw logs, behavioral events, instrumentation traces — that never need to be updated.
-
Multi-engine offline data warehouse fact tables — Flink/Spark batch-loads the data, the OLAP resource group runs the analytical queries.
-
Lake-warehouse unified scenarios that need to interoperate with external compute engines such as Apache Spark and Apache Flink.
Data model
Lake-mode columnar tables are built on Apache Iceberg. Data versions are managed through Iceberg's snapshot mechanism, so reads and writes never block each other. Data is organized by partition, and queries skip irrelevant partitions through partition pruning. Multiple engines within the same instance — OLAP resource groups, ETL resource groups, Flink, and Spark — read and write the same underlying data directly, with no migration or synchronization required.
How to use
-- Example: user access log table (partitioned by year and month)
-- Partition columns year and month are placed at the end of the column list
CREATE TABLE access_log (
user_id BIGINT COMMENT "User ID",
path VARCHAR(512) COMMENT "Request path",
status_code INT COMMENT "HTTP status code",
latency_ms INT COMMENT "Response latency in milliseconds",
region VARCHAR(32) COMMENT "Region",
year STRING NOT NULL COMMENT "Year",
month STRING NOT NULL COMMENT "Month"
)
ENGINE = iceberg
PARTITION BY (year, month);
-- Insert a single row
INSERT INTO access_log VALUES (1001, '/api/v1/query', 200, 32, 'east', '2024', '03');
-- Insert query results into the table (cross-catalog supported)
INSERT INTO access_log (user_id, path, status_code, latency_ms, region, year, month)
SELECT user_id, path, status_code, latency_ms, region, year, month
FROM default_catalog.ods.raw_access_log
WHERE year = '2024' AND month = '03';
-- Query (with partition pruning)
SELECT * FROM access_log WHERE year = '2024' AND month = '03' LIMIT 100;
-- Aggregation
SELECT month, COUNT(*) AS pv, COUNT(DISTINCT user_id) AS uv
FROM access_log
WHERE year = '2024'
GROUP BY month
ORDER BY month;
-- Cross-catalog join (lake-mode columnar joined with warehouse-mode columnar)
SELECT a.user_id, a.path, u.user_level
FROM lindorm_columnar.db01.access_log a
JOIN default_catalog.dw.user_profile u ON a.user_id = u.user_id
WHERE a.year = '2024' AND a.month = '03';
-- BUCKET exact query (primary key tables only): routes the query to specified buckets, skipping file scans on other buckets
-- bucket_id starts from 0 and must not exceed the total number of buckets specified at table creation; multiple buckets can be specified (comma-separated)
SELECT * FROM lindorm_columnar.db01.orders BUCKET(0);
SELECT * FROM lindorm_columnar.db01.orders BUCKET(1, 2, 3);
Caveats
-
Partition columns must be at the end of the column list. Place partition columns after all non-partition columns, otherwise
CREATE TABLEfails. -
No automatic deduplication on write. Inserting the same content multiple times produces multiple independent rows.
-
CREATE TABLE AS SELECT(CTAS) is not supported. Create the table first, then load it withINSERT INTO ... SELECT. -
TIMESTAMPis not supported. UseDATETIMEorDATEfor time fields. -
Cross-engine writes are visible in real time. Data written by an external engine becomes visible to the OLAP resource group immediately, with no extra refresh step.
Data distribution
Lake-mode columnar tables control physical data distribution through the PARTITION BY clause. A well-designed partition layout enables effective partition pruning at query time, significantly reducing the amount of data scanned.
Identity partitioning
Identity partitioning partitions data by the raw value of a column. It works best on low-cardinality discrete columns such as region, year, month, or status. When the WHERE clause includes equality predicates on the partition column, the engine automatically skips unrelated partitions, significantly reducing I/O.
-- Partition by year and month
PARTITION BY (year, month)
-- Partition by region
PARTITION BY (region)
Avoid choosing partition columns with too many distinct values — excessive small partitions inflate metadata pressure and slow query planning.
Data types
|
Type |
Description |
|
|
Boolean value. |
|
|
32-bit integer. |
|
|
64-bit integer. |
|
|
Single-precision floating point. |
|
|
Double-precision floating point. |
|
|
Exact decimal — recommended for financial and transactional data. |
|
|
Date in |
|
|
Date and time. |
|
|
Variable-length string. |
|
|
Binary data. |
|
|
Array of elements of the same type. |
|
|
Key-value map. |
Compute features
SQL analytical operators
Lake-mode columnar tables support standard SQL analytical syntax, covering the full range of compute operations expected in an OLAP workload.
Projection and filter
SELECT projects the columns you choose. Column names can be aliased with AS and used in arithmetic expressions or function calls. The WHERE clause supports compound predicates:
|
Predicate type |
Example |
|
Equality filter |
|
|
Range filter |
|
|
Set membership |
|
|
Pattern match |
|
|
NULL test |
|
|
Regex match |
|
Predicates are pushed down to the storage layer whenever possible to cut the volume of data read.
Joins
The following join types are supported:
|
Join type |
Description |
|
|
Returns matching rows from both tables. |
|
|
Keeps every row of the left table; right-side columns are NULL when no match exists. |
|
|
Keeps every row of the right table; left-side columns are NULL when no match exists. |
|
|
Keeps every row of both tables. |
|
|
Cartesian product. |
|
|
Returns rows of the left table that have a match in the right table; right-side columns are not returned. |
|
|
Returns rows of the left table that have no match in the right table. |
Cross-catalog joins are supported, letting you correlate lake-mode columnar data with warehouse-mode columnar or wide-table engine data:
-- Lake-mode columnar joined with warehouse-mode columnar
SELECT a.user_id, a.path, u.user_level
FROM lindorm_columnar.db01.access_log a
INNER JOIN default_catalog.dw.user_profile u ON a.user_id = u.user_id
WHERE a.year = '2024' AND a.month = '03'
AND u.user_level IN ('gold', 'platinum');
Set operations
|
Operator |
Description |
|
|
Combines results from multiple queries, keeping duplicates — the cheaper option. |
|
|
Combines results from multiple queries and removes duplicates. |
|
|
Returns rows present in both result sets. |
|
|
Returns rows present in the first result set but not the second. |
Aggregation (GROUP BY)
Standard aggregate functions are supported (COUNT, SUM, AVG, MIN, MAX, COUNT DISTINCT), along with HAVING for second-pass filtering on aggregated results.
The following extended grouping syntaxes compute multiple dimensional rollups in a single scan:
-- GROUPING SETS: explicit list of dimension combinations
SELECT region, month, COUNT(*) AS pv
FROM access_log
GROUP BY GROUPING SETS ((region, month), (region), ());
-- ROLLUP: hierarchical rollup from finest to grand total
SELECT region, month, COUNT(*) AS pv
FROM access_log
GROUP BY ROLLUP (region, month);
-- CUBE: all combinations across the listed dimensions
SELECT region, month, COUNT(*) AS pv
FROM access_log
GROUP BY CUBE (region, month);
Sorting and pagination
ORDER BY sorts on multiple columns (ASC / DESC), and pairs with LIMIT and OFFSET for pagination.
SELECT user_id, path, latency_ms
FROM access_log
WHERE year = '2024' AND month = '03'
ORDER BY latency_ms DESC, user_id ASC
LIMIT 20 OFFSET 40;
Subqueries
The following subquery forms are supported. The optimizer rewrites them as equivalent joins where possible:
|
Subquery form |
Example |
|
Scalar subquery |
|
|
|
|
|
|
|
Window functions
Window functions compute aggregations over a row group while preserving the original rows — a core OLAP capability. Use the OVER (PARTITION BY ... ORDER BY ... ROWS/RANGE ...) clause to define the window.
|
Category |
Common functions |
|
Ranking |
|
|
Aggregation |
|
|
Offset access |
|
|
First / last value |
|
-- Cumulative page views per user, plus monthly access ranking
SELECT
user_id,
path,
year,
month,
COUNT(*) OVER (
PARTITION BY user_id
ORDER BY month
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_pv,
RANK() OVER (
PARTITION BY month
ORDER BY COUNT(*) DESC
) AS monthly_rank
FROM access_log;
-- Latency delta between consecutive requests of the same user
SELECT
user_id,
path,
latency_ms,
latency_ms - LAG(latency_ms, 1) OVER (PARTITION BY user_id ORDER BY year, month) AS latency_diff
FROM access_log;
CASE WHEN expressions
Both searched and simple CASE WHEN forms are supported, and can be used in SELECT, WHERE, ORDER BY, aggregate functions, and elsewhere.
SELECT
user_id,
path,
latency_ms,
CASE
WHEN latency_ms < 100 THEN 'fast'
WHEN latency_ms < 500 THEN 'normal'
WHEN latency_ms < 2000 THEN 'slow'
ELSE 'timeout'
END AS latency_tier,
SUM(CASE WHEN status_code = 200 THEN 1 ELSE 0 END) AS success_cnt
FROM access_log
GROUP BY user_id, path, latency_ms;
Metadata access
Because lake-mode columnar tables are built on Iceberg, every table maintains complete metadata — snapshot history, partition state, data file inventory, and more. Append $<metadata-table-name> to a table name to query this metadata directly, with no extra configuration.
|
Metadata table |
Description |
|
|
Per-partition statistics for the current state — row count, file count, data size, etc. |
|
|
All snapshots of the table, including snapshot ID, commit time, and operation type. |
|
|
History of metadata changes on the table. |
|
|
Overview of manifest files associated with each snapshot. |
|
|
Path, format, size, and other details for every data file in the current snapshot. |
|
|
References on the table — branches and tags. |
-- Per-partition statistics
SELECT * FROM lindorm_columnar.db01.access_log$partitions;
-- Snapshot list (write history)
SELECT * FROM lindorm_columnar.db01.access_log$snapshots
ORDER BY committed_at DESC LIMIT 10;
-- Metadata change history
SELECT * FROM lindorm_columnar.db01.access_log$history;
-- Data file details for the current snapshot
SELECT * FROM lindorm_columnar.db01.access_log$files;