New columnstore index

更新时间:
复制 MD 格式

Indexes accelerate data queries. A columnstore index improves data analytics for large datasets in wide tables. Common scenarios include device statistics for the Internet of Vehicles and the Internet of Things (IoT), data analytics for e-commerce, and order statistics for logistics. This topic describes the basics of the new columnstore index to help you get started.

Important

The new columnstore index is currently in invitational preview. To use it, you can contact Lindorm technical support (DingTalk ID: s0s3eg3) to request access.

Core advantages

Compared to the previous version of the columnstore index, the new version has the following core advantages:

Feature

Previous version

New version

Synchronization latency

15 minutes

Real-time (second-level)

Architecture

image

Scenarios

  • Internet of Vehicles: Real-time analysis of vehicle trajectories

  • E-commerce: Real-time order statistics

  • Logistics: Real-time monitoring of shipping orders

  • IoT: Real-time aggregation of device data

Getting started

Prerequisites

The following engines must be enabled and meet the version requirements:

Engine

Version requirements

Purpose

LindormTable

2.8.6 or later

Source data storage

LTS

3.9.1 or later

Real-time log subscription

Columnstore engine

3.10.15 or later

Index data storage

Compute engine

-

Executes analytical queries

Example

To perform efficient, parallel data analytics on a large data table named my_tbl, you can create a columnstore index for the table.

Sample table schema:

+------------+-------------+---------+----------------+
| TABLE_NAME | COLUMN_NAME |  TYPE   | IS_PRIMARY_KEY |
+------------+-------------+---------+----------------+
| my_tbl     | pk0         | INT     | true           |
| my_tbl     | pk1         | VARCHAR | true           |
| my_tbl     | pt_d        | VARCHAR | true           |
| my_tbl     | col0        | INT     | false          |
| my_tbl     | col1        | VARCHAR | false          |
+------------+-------------+---------+----------------+
  • pk0: A data identifier with high cardinality. This makes it suitable for use as a bucket partition key.

  • pt_d: The date when the data is generated. Because data is usually analyzed by day, this field is suitable for use as an enumeration partition key.

Procedure:

  1. Connect to and use LindormTable using Lindorm-cli.

  2. Refer to the following sample code to create a table and a columnstore index, view the index status, and use the columnstore index to run analytical queries.

-- 1. Create and switch to a database
CREATE DATABASE my_db;
USE my_db;

-- 2. Create a wide table
CREATE TABLE my_tbl (
  pk0 INT,
  pk1 VARCHAR,
  pt_d VARCHAR,
  col0 INT,
  col1 VARCHAR,
  PRIMARY KEY (pk0, pk1, pt_d)
);

-- 3. Create a columnstore index
CREATE INDEX my_tbl_idx USING COLUMNAR ON my_tbl(pk0, pk1, pt_d, col0, col1)
PARTITION BY ENUMERABLE (pt_d, bucket(16, pk0))
WITH (
  `lindorm_columnar.user.index.database` = 'my_db',
  `lindorm_columnar.user.index.table` = 'my_tbl',
  `lindorm_columnar.user.index.type` = 'LCE'
);

-- 4. View the index status
SHOW INDEX FROM my_tbl;

-- 5. Use the columnstore index for an analytical query
SELECT /*+ _use_ldps_(cg_name), _columnar_index_ */ 
  pk1, SUM(col0) 
FROM my_db.my_tbl 
WHERE pt_d = '2024-01-01'
GROUP BY pk1;

Create an index

Syntax

CREATE INDEX index_name USING COLUMNAR ON table_name(column_list)
PARTITION BY ENUMERABLE (partition_expression)
WITH (properties);

Parameters:

Parameter

Description

index_name

The name of the index.

table_name

The name of the wide table.

column_list

The index columns. For more information, see Index columns.

partition_expression

The partition expression. For more information, see Partition configuration.

properties

The index properties. For more information, see Property configuration.

Index columns

The index columns must include all primary key fields of the wide table.

Supported data types: TINYINT, SMALLINT, INTEGER, BIGINT, LONG, FLOAT, DOUBLE, VARCHAR, BINARY, VARBINARY, BOOLEAN, DECIMAL, JSON, DATE, and TIMESTAMP.

The JSON data type is stored as VARCHAR in the column store.

Partition configuration

The partition expression specifies that index data is partitioned using an enumeration algorithm to improve retrieval efficiency during queries. A partition expression includes an enumeration partition expression and a bucket partition expression.

Important

The partition keys in the partition expression must be primary key columns of the wide table.

Enumeration partition expression

You can specify zero or more expressions. Separate multiple expressions with commas (,).

For an enumeration partition expression, you typically select primary key fields such as city or date. The index data is built based on different partition values. This lets you efficiently locate data using partition filter conditions during queries.

Bucket partition expression

You must specify at least one. The syntax is bucket(bucket_num, column_name):

  • bucket_num: The number of bucket partitions.

  • column_name: The bucket partition field, which is used to calculate the bucket partition number.

The bucket partition number is calculated as follows: bucket_index = hash(column_name) % bucket_num

The bucket partition field must be a primary key field of the wide table and must have sufficient discrete features to prevent data skew.

Partition quantity planning

The enumeration partition expression and the bucket partition expression together determine the number of partitions for the index data. The data volume of each partition must be between 50 MB and 512 MB.

Example: The primary key of the wide table is (uid, dt), and the daily data volume is about 50 GB. You can configure the partition as follows:

PARTITION BY ENUMERABLE (dt, bucket(200, uid))
Note

When you design a partition policy, avoid using high-cardinality fields (fields with many unique values) as enumeration partition keys. This can create numerous partitions and small files, which adversely affects the storage system.

Property configuration

Property

Required

Description

lindorm_columnar.user.index.type

Yes

Must be set to 'LCE'

lindorm_columnar.user.index.database

Yes

The name of the destination database in the columnstore engine.

lindorm_columnar.user.index.table

Yes

The name of the destination table in the columnstore engine.

lindorm_columnar.user.syncer.skip.fullsync

No

'true': Synchronizes only incremental data.

Important

lindorm_columnar.user.index.type = 'LCE' is required. If this property is not included, the index is created using the process for the previous version.

Data synchronization

Mode

Configuration

Description

Full + Incremental

Default

Synchronizes both historical data and incremental data.

Incremental only

lindorm_columnar.user.syncer.skip.fullsync='true'

Skips historical data and synchronizes only incremental data.

Example of incremental synchronization only:

CREATE INDEX my_tbl_idx USING COLUMNAR ON my_tbl(pk0, pk1, pt_d, col0, col1)
PARTITION BY ENUMERABLE (pt_d, bucket(128, pk0))
WITH (
  `lindorm_columnar.user.index.database` = 'my_db',
  `lindorm_columnar.user.index.table` = 'my_tbl',
  `lindorm_columnar.user.index.type` = 'LCE',
  `lindorm_columnar.user.syncer.skip.fullsync` = 'true'
);

Expand JSON fields

You can use the json_extract function to expand JSON fields into separate columns. This facilitates efficient queries and analysis in the columnstore engine.

Supported functions

Function

Return type

Description

json_extract_string

STRING

Extracts a string value.

json_extract_long

LONG

Extracts a long integer value.

json_extract_double

DOUBLE

Extracts a double-precision floating-point value.

json_extract_boolean

BOOLEAN

Extracts a Boolean value.

Note

If the specified path does not exist in the JSON field, or if the data type does not match the function, null is returned.

Example

Assume that the wide table my_tbl contains a JSON column named json_col with the following structure:

{
  "a": {
    "b": {
      "c": "hello,world",
      "d": 123
    },
    "e": false
  },
  "f": 3.14
}

You can expand a.b.c, a.b.d, a.e, and f into separate columns:

CREATE INDEX my_tbl_idx USING COLUMNAR ON my_tbl(
  pk0, pk1, pt_d,
  json_extract_string(json_col, '$.a.b.c'),
  json_extract_long(json_col, '$.a.b.d'),
  json_extract_boolean(json_col, '$.a.e'),
  json_extract_double(json_col, '$.f')
)
PARTITION BY ENUMERABLE (pt_d, bucket(16, pk0))
WITH (
  `lindorm_columnar.user.index.database` = 'my_db',
  `lindorm_columnar.user.index.table` = 'my_tbl',
  `lindorm_columnar.user.index.type` = 'LCE'
);

Column name mapping rule

After a JSON field is expanded, the column name in the columnstore engine uses the format <JSON_column_name>.<field_path>. The data type is determined by the json_extract function. The columns generated in the preceding example are as follows:

Column name

Type

Description

json_col.a.b.c

STRING

Stores the value of a.b.c from the json_col field.

json_col.a.b.d

LONG

Stores the value of a.b.d from the json_col field.

json_col.a.e

BOOLEAN

Stores the value of a.e from the json_col field.

json_col.f

DOUBLE

Stores the value of f from the json_col field.

Retain original JSON content

By default, when you use json_extract to expand fields, only the expanded columns are synchronized. The original JSON column is not synchronized to the columnstore engine. To retain the original JSON content, you can set the following property:

`lindorm_columnar.user.syncer.lci.json.syncOriginalJsonContent` = 'true'

Example:

You can retain the original json_col column while expanding its fields. This lets you query the complete JSON content later.

CREATE INDEX my_tbl_idx USING COLUMNAR ON my_tbl(
  pk0, pk1, pt_d, json_col,
  json_extract_string(json_col, '$.a.b.c'),
  json_extract_long(json_col, '$.a.b.d')
)
PARTITION BY ENUMERABLE (pt_d, bucket(16, pk0))
WITH (
  `lindorm_columnar.user.index.database` = 'my_db',
  `lindorm_columnar.user.index.table` = 'my_tbl',
  `lindorm_columnar.user.index.type` = 'LCE',
  `lindorm_columnar.user.syncer.lci.json.syncOriginalJsonContent` = 'true'
);

Query an index

Syntax

You can use a HINT to specify that the query should use the columnstore index instead of the raw data from the wide table.

SELECT /*+ _use_ldps_(cg_name), _columnar_index_ */ ...
FROM index_database.index_table
WHERE ...;

In the syntax, cg_name is the name of the OLAP resource group for the compute engine.

Examples

-- Aggregate query
SELECT /*+ _use_ldps_(cg_name), _columnar_index_ */ 
  pk1, COUNT(*), SUM(col0)
FROM my_db.my_tbl
WHERE pt_d = '2024-01-01'
GROUP BY pk1;

-- Detail query
SELECT /*+ _use_ldps_(cg_name), _columnar_index_ */ *
FROM my_db.my_tbl
WHERE pt_d = '2024-01-01' AND pk0 = 12345
LIMIT 100;

Manage indexes

View an index

SHOW INDEX FROM table_name;

The output includes information such as the index status and synchronization progress.

Modify an index

Add columns:

ALTER INDEX index_name ON table_name ADD COLUMNS (col1, col2);

Delete an index

DROP INDEX index_name ON table_name;

FAQ

  • Q: Can a partition expression contain non-primary key fields?

    A: No, it cannot. All fields in a partition expression must be primary key fields.

  • Q: Can a bucket partition expression contain a complex partition expression?

    A: No, it cannot. A bucket partition expression includes only bucket_num and the bucket partition field.

  • Q: What are the effects of having too many or too few partitions?

    A: Too many partitions can lead to data bloat and affect query efficiency. Ensure that the data volume of a single partition is greater than 50 MB and that bucket_num in the bucket partition expression is less than 1024. Too few partitions can affect read and write throughput or cause data skew. Ensure that the data volume of a single partition is less than 512 MB.

  • Q: Can I create multiple columnstore indexes for the same wide table?

    A: No, you cannot. You can create only one columnstore index for a wide table.

  • Q: If data in a wide table is deleted after its TTL expires, is the corresponding columnstore index data automatically deleted?

    A: No, it is not.

  • Q: Can I delete index columns?

    A: No, you cannot.