New columnstore index-Lindorm(Lindorm)-阿里云帮助中心

Indexes accelerate data queries. A columnstore index improves data analytics for large datasets in wide tables. Common scenarios include device statistics for the Internet of Vehicles and the Internet of Things (IoT), data analytics for e-commerce, and order statistics for logistics. This topic describes the basics of the new columnstore index to help you get started.

Important

The new columnstore index is currently in invitational preview. To use it, you can contact Lindorm technical support (DingTalk ID: s0s3eg3) to request access.

Core advantages

Compared to the previous version of the columnstore index, the new version has the following core advantages:

Feature	Previous version	New version
Synchronization latency	15 minutes	Real-time (second-level)

Architecture

Scenarios

Internet of Vehicles: Real-time analysis of vehicle trajectories
E-commerce: Real-time order statistics
Logistics: Real-time monitoring of shipping orders
IoT: Real-time aggregation of device data

Getting started

Prerequisites

The following engines must be enabled and meet the version requirements:

Engine	Version requirements	Purpose
LindormTable	2.8.6 or later	Source data storage
LTS	3.9.1 or later	Real-time log subscription
Columnstore engine	3.10.15 or later	Index data storage
Compute engine	-	Executes analytical queries

Example

To perform efficient, parallel data analytics on a large data table named my_tbl, you can create a columnstore index for the table.

Sample table schema:

+------------+-------------+---------+----------------+
| TABLE_NAME | COLUMN_NAME |  TYPE   | IS_PRIMARY_KEY |
+------------+-------------+---------+----------------+
| my_tbl     | pk0         | INT     | true           |
| my_tbl     | pk1         | VARCHAR | true           |
| my_tbl     | pt_d        | VARCHAR | true           |
| my_tbl     | col0        | INT     | false          |
| my_tbl     | col1        | VARCHAR | false          |
+------------+-------------+---------+----------------+

pk0: A data identifier with high cardinality. This makes it suitable for use as a bucket partition key.
pt_d: The date when the data is generated. Because data is usually analyzed by day, this field is suitable for use as an enumeration partition key.

Procedure:

Connect to and use LindormTable using Lindorm-cli.
Refer to the following sample code to create a table and a columnstore index, view the index status, and use the columnstore index to run analytical queries.

-- 1. Create and switch to a database
CREATE DATABASE my_db;
USE my_db;

-- 2. Create a wide table
CREATE TABLE my_tbl (
  pk0 INT,
  pk1 VARCHAR,
  pt_d VARCHAR,
  col0 INT,
  col1 VARCHAR,
  PRIMARY KEY (pk0, pk1, pt_d)
);

-- 3. Create a columnstore index
CREATE INDEX my_tbl_idx USING COLUMNAR ON my_tbl(pk0, pk1, pt_d, col0, col1)
PARTITION BY ENUMERABLE (pt_d, bucket(16, pk0))
WITH (
  `lindorm_columnar.user.index.database` = 'my_db',
  `lindorm_columnar.user.index.table` = 'my_tbl',
  `lindorm_columnar.user.index.type` = 'LCE'
);

-- 4. View the index status
SHOW INDEX FROM my_tbl;

-- 5. Use the columnstore index for an analytical query
SELECT /*+ _use_ldps_(cg_name), _columnar_index_ */ 
  pk1, SUM(col0) 
FROM my_db.my_tbl 
WHERE pt_d = '2024-01-01'
GROUP BY pk1;

Create an index

Syntax

CREATE INDEX index_name USING COLUMNAR ON table_name(column_list)
PARTITION BY ENUMERABLE (partition_expression)
WITH (properties);

Parameters:

Parameter	Description
`index_name`	The name of the index.
`table_name`	The name of the wide table.
`column_list`	The index columns. For more information, see Index columns.
`partition_expression`	The partition expression. For more information, see Partition configuration.
`properties`	The index properties. For more information, see Property configuration.

Index columns

The index columns must include all primary key fields of the wide table.

Supported data types: TINYINT, SMALLINT, INTEGER, BIGINT, LONG, FLOAT, DOUBLE, VARCHAR, BINARY, VARBINARY, BOOLEAN, DECIMAL, JSON, DATE, and TIMESTAMP.

The JSON data type is stored as VARCHAR in the column store.

Partition configuration

The partition expression specifies that index data is partitioned using an enumeration algorithm to improve retrieval efficiency during queries. A partition expression includes an enumeration partition expression and a bucket partition expression.

Important

The partition keys in the partition expression must be primary key columns of the wide table.

Enumeration partition expression

You can specify zero or more expressions. Separate multiple expressions with commas (,).

For an enumeration partition expression, you typically select primary key fields such as city or date. The index data is built based on different partition values. This lets you efficiently locate data using partition filter conditions during queries.

Bucket partition expression

You must specify at least one. The syntax is bucket(bucket_num, column_name):

bucket_num: The number of bucket partitions.
column_name: The bucket partition field, which is used to calculate the bucket partition number.

The bucket partition number is calculated as follows: bucket_index = hash(column_name) % bucket_num

The bucket partition field must be a primary key field of the wide table and must have sufficient discrete features to prevent data skew.

Partition quantity planning

The enumeration partition expression and the bucket partition expression together determine the number of partitions for the index data. The data volume of each partition must be between 50 MB and 512 MB.

Example: The primary key of the wide table is (uid, dt), and the daily data volume is about 50 GB. You can configure the partition as follows:

PARTITION BY ENUMERABLE (dt, bucket(200, uid))

Note

When you design a partition policy, avoid using high-cardinality fields (fields with many unique values) as enumeration partition keys. This can create numerous partitions and small files, which adversely affects the storage system.

Property configuration

Property	Required	Description
`lindorm_columnar.user.index.type`	Yes	Must be set to `'LCE'`
`lindorm_columnar.user.index.database`	Yes	The name of the destination database in the columnstore engine.
`lindorm_columnar.user.index.table`	Yes	The name of the destination table in the columnstore engine.
`lindorm_columnar.user.syncer.skip.fullsync`	No	`'true'`: Synchronizes only incremental data.

Important

lindorm_columnar.user.index.type = 'LCE' is required. If this property is not included, the index is created using the process for the previous version.

Data synchronization

Mode	Configuration	Description
Full + Incremental	Default	Synchronizes both historical data and incremental data.
Incremental only	`lindorm_columnar.user.syncer.skip.fullsync`=`'true'`	Skips historical data and synchronizes only incremental data.

Example of incremental synchronization only:

CREATE INDEX my_tbl_idx USING COLUMNAR ON my_tbl(pk0, pk1, pt_d, col0, col1)
PARTITION BY ENUMERABLE (pt_d, bucket(128, pk0))
WITH (
  `lindorm_columnar.user.index.database` = 'my_db',
  `lindorm_columnar.user.index.table` = 'my_tbl',
  `lindorm_columnar.user.index.type` = 'LCE',
  `lindorm_columnar.user.syncer.skip.fullsync` = 'true'
);

Expand JSON fields

You can use the json_extract function to expand JSON fields into separate columns. This facilitates efficient queries and analysis in the columnstore engine.

Supported functions

Function	Return type	Description
`json_extract_string`	STRING	Extracts a string value.
`json_extract_long`	LONG	Extracts a long integer value.
`json_extract_double`	DOUBLE	Extracts a double-precision floating-point value.
`json_extract_boolean`	BOOLEAN	Extracts a Boolean value.

Note

If the specified path does not exist in the JSON field, or if the data type does not match the function, null is returned.

Example

Assume that the wide table my_tbl contains a JSON column named json_col with the following structure:

{
  "a": {
    "b": {
      "c": "hello,world",
      "d": 123
    },
    "e": false
  },
  "f": 3.14
}

You can expand a.b.c, a.b.d, a.e, and f into separate columns:

CREATE INDEX my_tbl_idx USING COLUMNAR ON my_tbl(
  pk0, pk1, pt_d,
  json_extract_string(json_col, '$.a.b.c'),
  json_extract_long(json_col, '$.a.b.d'),
  json_extract_boolean(json_col, '$.a.e'),
  json_extract_double(json_col, '$.f')
)
PARTITION BY ENUMERABLE (pt_d, bucket(16, pk0))
WITH (
  `lindorm_columnar.user.index.database` = 'my_db',
  `lindorm_columnar.user.index.table` = 'my_tbl',
  `lindorm_columnar.user.index.type` = 'LCE'
);

Column name mapping rule

After a JSON field is expanded, the column name in the columnstore engine uses the format <JSON_column_name>.<field_path>. The data type is determined by the json_extract function. The columns generated in the preceding example are as follows:

Column name	Type	Description
`json_col.a.b.c`	STRING	Stores the value of `a.b.c` from the `json_col` field.
`json_col.a.b.d`	LONG	Stores the value of `a.b.d` from the `json_col` field.
`json_col.a.e`	BOOLEAN	Stores the value of `a.e` from the `json_col` field.
`json_col.f`	DOUBLE	Stores the value of `f` from the `json_col` field.

Retain original JSON content

By default, when you use json_extract to expand fields, only the expanded columns are synchronized. The original JSON column is not synchronized to the columnstore engine. To retain the original JSON content, you can set the following property:

`lindorm_columnar.user.syncer.lci.json.syncOriginalJsonContent` = 'true'

Example:

You can retain the original json_col column while expanding its fields. This lets you query the complete JSON content later.

CREATE INDEX my_tbl_idx USING COLUMNAR ON my_tbl(
  pk0, pk1, pt_d, json_col,
  json_extract_string(json_col, '$.a.b.c'),
  json_extract_long(json_col, '$.a.b.d')
)
PARTITION BY ENUMERABLE (pt_d, bucket(16, pk0))
WITH (
  `lindorm_columnar.user.index.database` = 'my_db',
  `lindorm_columnar.user.index.table` = 'my_tbl',
  `lindorm_columnar.user.index.type` = 'LCE',
  `lindorm_columnar.user.syncer.lci.json.syncOriginalJsonContent` = 'true'
);

Query an index

Syntax

You can use a HINT to specify that the query should use the columnstore index instead of the raw data from the wide table.

SELECT /*+ _use_ldps_(cg_name), _columnar_index_ */ ...
FROM index_database.index_table
WHERE ...;

In the syntax, cg_name is the name of the OLAP resource group for the compute engine.

Examples

-- Aggregate query
SELECT /*+ _use_ldps_(cg_name), _columnar_index_ */ 
  pk1, COUNT(*), SUM(col0)
FROM my_db.my_tbl
WHERE pt_d = '2024-01-01'
GROUP BY pk1;

-- Detail query
SELECT /*+ _use_ldps_(cg_name), _columnar_index_ */ *
FROM my_db.my_tbl
WHERE pt_d = '2024-01-01' AND pk0 = 12345
LIMIT 100;

Manage indexes

View an index

SHOW INDEX FROM table_name;

The output includes information such as the index status and synchronization progress.

Modify an index

Add columns:

ALTER INDEX index_name ON table_name ADD COLUMNS (col1, col2);

Delete an index

DROP INDEX index_name ON table_name;

FAQ

Q: Can a partition expression contain non-primary key fields?
A: No, it cannot. All fields in a partition expression must be primary key fields.
Q: Can a bucket partition expression contain a complex partition expression?
A: No, it cannot. A bucket partition expression includes only bucket_num and the bucket partition field.
Q: What are the effects of having too many or too few partitions?
A: Too many partitions can lead to data bloat and affect query efficiency. Ensure that the data volume of a single partition is greater than 50 MB and that bucket_num in the bucket partition expression is less than 1024. Too few partitions can affect read and write throughput or cause data skew. Ensure that the data volume of a single partition is less than 512 MB.
Q: Can I create multiple columnstore indexes for the same wide table?
A: No, you cannot. You can create only one columnstore index for a wide table.
Q: If data in a wide table is deleted after its TTL expires, is the corresponding columnstore index data automatically deleted?
A: No, it is not.
Q: Can I delete index columns?
A: No, you cannot.