Indexes accelerate data queries. A columnstore index improves data analytics for large datasets in wide tables. Common scenarios include device statistics for the Internet of Vehicles and the Internet of Things (IoT), data analytics for e-commerce, and order statistics for logistics. This topic describes the basics of the new columnstore index to help you get started.
The new columnstore index is currently in invitational preview. To use it, you can contact Lindorm technical support (DingTalk ID: s0s3eg3) to request access.
Core advantages
Compared to the previous version of the columnstore index, the new version has the following core advantages:
Feature | Previous version | New version |
Synchronization latency | 15 minutes | Real-time (second-level) |
Architecture
Scenarios
Internet of Vehicles: Real-time analysis of vehicle trajectories
E-commerce: Real-time order statistics
Logistics: Real-time monitoring of shipping orders
IoT: Real-time aggregation of device data
Getting started
Prerequisites
The following engines must be enabled and meet the version requirements:
Engine | Version requirements | Purpose |
2.8.6 or later | Source data storage | |
3.9.1 or later | Real-time log subscription | |
3.10.15 or later | Index data storage | |
- | Executes analytical queries |
Example
To perform efficient, parallel data analytics on a large data table named my_tbl, you can create a columnstore index for the table.
Sample table schema:
+------------+-------------+---------+----------------+
| TABLE_NAME | COLUMN_NAME | TYPE | IS_PRIMARY_KEY |
+------------+-------------+---------+----------------+
| my_tbl | pk0 | INT | true |
| my_tbl | pk1 | VARCHAR | true |
| my_tbl | pt_d | VARCHAR | true |
| my_tbl | col0 | INT | false |
| my_tbl | col1 | VARCHAR | false |
+------------+-------------+---------+----------------+pk0: A data identifier with high cardinality. This makes it suitable for use as a bucket partition key.pt_d: The date when the data is generated. Because data is usually analyzed by day, this field is suitable for use as an enumeration partition key.
Procedure:
Refer to the following sample code to create a table and a columnstore index, view the index status, and use the columnstore index to run analytical queries.
-- 1. Create and switch to a database
CREATE DATABASE my_db;
USE my_db;
-- 2. Create a wide table
CREATE TABLE my_tbl (
pk0 INT,
pk1 VARCHAR,
pt_d VARCHAR,
col0 INT,
col1 VARCHAR,
PRIMARY KEY (pk0, pk1, pt_d)
);
-- 3. Create a columnstore index
CREATE INDEX my_tbl_idx USING COLUMNAR ON my_tbl(pk0, pk1, pt_d, col0, col1)
PARTITION BY ENUMERABLE (pt_d, bucket(16, pk0))
WITH (
`lindorm_columnar.user.index.database` = 'my_db',
`lindorm_columnar.user.index.table` = 'my_tbl',
`lindorm_columnar.user.index.type` = 'LCE'
);
-- 4. View the index status
SHOW INDEX FROM my_tbl;
-- 5. Use the columnstore index for an analytical query
SELECT /*+ _use_ldps_(cg_name), _columnar_index_ */
pk1, SUM(col0)
FROM my_db.my_tbl
WHERE pt_d = '2024-01-01'
GROUP BY pk1;Create an index
Syntax
CREATE INDEX index_name USING COLUMNAR ON table_name(column_list)
PARTITION BY ENUMERABLE (partition_expression)
WITH (properties);Parameters:
Parameter | Description |
| The name of the index. |
| The name of the wide table. |
| The index columns. For more information, see Index columns. |
| The partition expression. For more information, see Partition configuration. |
| The index properties. For more information, see Property configuration. |
Index columns
The index columns must include all primary key fields of the wide table.
Supported data types: TINYINT, SMALLINT, INTEGER, BIGINT, LONG, FLOAT, DOUBLE, VARCHAR, BINARY, VARBINARY, BOOLEAN, DECIMAL, JSON, DATE, and TIMESTAMP.
The JSON data type is stored as VARCHAR in the column store.
Partition configuration
The partition expression specifies that index data is partitioned using an enumeration algorithm to improve retrieval efficiency during queries. A partition expression includes an enumeration partition expression and a bucket partition expression.
The partition keys in the partition expression must be primary key columns of the wide table.
Enumeration partition expression
You can specify zero or more expressions. Separate multiple expressions with commas (,).
For an enumeration partition expression, you typically select primary key fields such as city or date. The index data is built based on different partition values. This lets you efficiently locate data using partition filter conditions during queries.
Bucket partition expression
You must specify at least one. The syntax is bucket(bucket_num, column_name):
bucket_num: The number of bucket partitions.column_name: The bucket partition field, which is used to calculate the bucket partition number.
The bucket partition number is calculated as follows: bucket_index = hash(column_name) % bucket_num
The bucket partition field must be a primary key field of the wide table and must have sufficient discrete features to prevent data skew.
Partition quantity planning
The enumeration partition expression and the bucket partition expression together determine the number of partitions for the index data. The data volume of each partition must be between 50 MB and 512 MB.
Example: The primary key of the wide table is (uid, dt), and the daily data volume is about 50 GB. You can configure the partition as follows:
PARTITION BY ENUMERABLE (dt, bucket(200, uid))When you design a partition policy, avoid using high-cardinality fields (fields with many unique values) as enumeration partition keys. This can create numerous partitions and small files, which adversely affects the storage system.
Property configuration
Property | Required | Description |
| Yes | Must be set to |
| Yes | The name of the destination database in the columnstore engine. |
| Yes | The name of the destination table in the columnstore engine. |
| No |
|
lindorm_columnar.user.index.type = 'LCE' is required. If this property is not included, the index is created using the process for the previous version.
Data synchronization
Mode | Configuration | Description |
Full + Incremental | Default | Synchronizes both historical data and incremental data. |
Incremental only |
| Skips historical data and synchronizes only incremental data. |
Example of incremental synchronization only:
CREATE INDEX my_tbl_idx USING COLUMNAR ON my_tbl(pk0, pk1, pt_d, col0, col1)
PARTITION BY ENUMERABLE (pt_d, bucket(128, pk0))
WITH (
`lindorm_columnar.user.index.database` = 'my_db',
`lindorm_columnar.user.index.table` = 'my_tbl',
`lindorm_columnar.user.index.type` = 'LCE',
`lindorm_columnar.user.syncer.skip.fullsync` = 'true'
);Expand JSON fields
You can use the json_extract function to expand JSON fields into separate columns. This facilitates efficient queries and analysis in the columnstore engine.
Supported functions
Function | Return type | Description |
| STRING | Extracts a string value. |
| LONG | Extracts a long integer value. |
| DOUBLE | Extracts a double-precision floating-point value. |
| BOOLEAN | Extracts a Boolean value. |
If the specified path does not exist in the JSON field, or if the data type does not match the function, null is returned.
Example
Assume that the wide table my_tbl contains a JSON column named json_col with the following structure:
{
"a": {
"b": {
"c": "hello,world",
"d": 123
},
"e": false
},
"f": 3.14
}You can expand a.b.c, a.b.d, a.e, and f into separate columns:
CREATE INDEX my_tbl_idx USING COLUMNAR ON my_tbl(
pk0, pk1, pt_d,
json_extract_string(json_col, '$.a.b.c'),
json_extract_long(json_col, '$.a.b.d'),
json_extract_boolean(json_col, '$.a.e'),
json_extract_double(json_col, '$.f')
)
PARTITION BY ENUMERABLE (pt_d, bucket(16, pk0))
WITH (
`lindorm_columnar.user.index.database` = 'my_db',
`lindorm_columnar.user.index.table` = 'my_tbl',
`lindorm_columnar.user.index.type` = 'LCE'
);Column name mapping rule
After a JSON field is expanded, the column name in the columnstore engine uses the format <JSON_column_name>.<field_path>. The data type is determined by the json_extract function. The columns generated in the preceding example are as follows:
Column name | Type | Description |
| STRING | Stores the value of |
| LONG | Stores the value of |
| BOOLEAN | Stores the value of |
| DOUBLE | Stores the value of |
Retain original JSON content
By default, when you use json_extract to expand fields, only the expanded columns are synchronized. The original JSON column is not synchronized to the columnstore engine. To retain the original JSON content, you can set the following property:
`lindorm_columnar.user.syncer.lci.json.syncOriginalJsonContent` = 'true'Example:
You can retain the original json_col column while expanding its fields. This lets you query the complete JSON content later.
CREATE INDEX my_tbl_idx USING COLUMNAR ON my_tbl(
pk0, pk1, pt_d, json_col,
json_extract_string(json_col, '$.a.b.c'),
json_extract_long(json_col, '$.a.b.d')
)
PARTITION BY ENUMERABLE (pt_d, bucket(16, pk0))
WITH (
`lindorm_columnar.user.index.database` = 'my_db',
`lindorm_columnar.user.index.table` = 'my_tbl',
`lindorm_columnar.user.index.type` = 'LCE',
`lindorm_columnar.user.syncer.lci.json.syncOriginalJsonContent` = 'true'
);Query an index
Syntax
You can use a HINT to specify that the query should use the columnstore index instead of the raw data from the wide table.
SELECT /*+ _use_ldps_(cg_name), _columnar_index_ */ ...
FROM index_database.index_table
WHERE ...;In the syntax, cg_name is the name of the OLAP resource group for the compute engine.
Examples
-- Aggregate query
SELECT /*+ _use_ldps_(cg_name), _columnar_index_ */
pk1, COUNT(*), SUM(col0)
FROM my_db.my_tbl
WHERE pt_d = '2024-01-01'
GROUP BY pk1;
-- Detail query
SELECT /*+ _use_ldps_(cg_name), _columnar_index_ */ *
FROM my_db.my_tbl
WHERE pt_d = '2024-01-01' AND pk0 = 12345
LIMIT 100;Manage indexes
View an index
SHOW INDEX FROM table_name;The output includes information such as the index status and synchronization progress.
Modify an index
Add columns:
ALTER INDEX index_name ON table_name ADD COLUMNS (col1, col2);Delete an index
DROP INDEX index_name ON table_name;FAQ
Q: Can a partition expression contain non-primary key fields?
A: No, it cannot. All fields in a partition expression must be primary key fields.
Q: Can a bucket partition expression contain a complex partition expression?
A: No, it cannot. A bucket partition expression includes only
bucket_numand thebucketpartition field.Q: What are the effects of having too many or too few partitions?
A: Too many partitions can lead to data bloat and affect query efficiency. Ensure that the data volume of a single partition is greater than 50 MB and that
bucket_numin thebucketpartition expression is less than 1024. Too few partitions can affect read and write throughput or cause data skew. Ensure that the data volume of a single partition is less than 512 MB.Q: Can I create multiple columnstore indexes for the same wide table?
A: No, you cannot. You can create only one columnstore index for a wide table.
Q: If data in a wide table is deleted after its TTL expires, is the corresponding columnstore index data automatically deleted?
A: No, it is not.
Q: Can I delete index columns?
A: No, you cannot.