Wide table engine data

更新时间:
复制 MD 格式

OLAP resource groups support various query patterns for wide table data, including primary key lookups, non-primary key filtering, analytics, full-text search, and vector retrieval. They can also leverage various indexes in the wide table engine to automatically accelerate queries.

Prerequisites

  • You have activated Lindorm (Activate Service) and enabled an OLAP resource group.

  • You have created a Lindorm instance (Create Instance) and it contains wide table data.

Manage wide tables

OLAP resource groups support full lifecycle management for databases and tables in the wide table engine.

Create a database

CREATE DATABASE <db_name>;

Example:

CREATE DATABASE mydb;

Drop a database

DROP DATABASE <db_name>;

Example:

DROP DATABASE mydb;

Create a wide table

For detailed SQL syntax, see CREATE TABLE. OLAP resource groups support a subset of common data types.

CREATE TABLE <table_name> (
  <col_name> <data_type> [NOT NULL],
  ...
  PRIMARY KEY(<pk_col1> [, <pk_col2>, ...])
);
  • Primary key columns must be declared as NOT NULL. A primary key consists of one or more columns that uniquely identify a row.

  • OLAP resource groups support the following data types:

Type

Description

BOOLEAN

Boolean value

TINYINT

8-bit integer

SMALLINT

16-bit integer

INT / INTEGER

32-bit integer

BIGINT

64-bit integer

FLOAT

Single-precision floating-point number

DOUBLE

Double-precision floating-point number

VARCHAR / VARCHAR(N)

Variable-length string

BINARY / VARBINARY

Binary data

Example:

USE mydb;

-- User event log table (composite primary key)
CREATE TABLE user_event (
  user_id    VARCHAR NOT NULL,
  event_id   BIGINT  NOT NULL,
  event_type TINYINT,
  amount     DOUBLE,
  region     VARCHAR,
  PRIMARY KEY(user_id, event_id)
);

-- Orders table (single-column primary key)
CREATE TABLE orders (
  order_id VARCHAR NOT NULL,
  user_id  VARCHAR,
  status   TINYINT,
  total    DOUBLE,
  PRIMARY KEY(order_id)
);

Drop a wide table

DROP TABLE <table_name>;

Example:

DROP TABLE orders;

Write data

OLAP resource groups support writing data to the wide table engine using the INSERT INTO statement. The wide table engine does not support UPDATE or DELETE operations. To update data, you must use the native APIs of the wide table engine, such as the HBase or Phoenix APIs.

Write a single row

INSERT INTO <table_name> VALUES (<val1>, <val2>, ...);

Write multiple rows in a batch

INSERT INTO <table_name> VALUES
  (<val1>, <val2>, ...),
  (<val1>, <val2>, ...),
  ...;

Example:

USE mydb;

-- Write a single row
INSERT INTO orders VALUES ('u001', 'o10001', 100.0, 'SHIPPED', 'Beijing', '2024-01-01');

-- Write multiple rows in a batch
INSERT INTO orders VALUES
  ('u001', 'o10002', 200.0, 'SHIPPED',   'Shanghai', '2024-01-02'),
  ('u002', 'o20001', 50.0,  'PENDING',   'Beijing',  '2024-01-03'),
  ('u002', 'o20002', 300.0, 'DELIVERED', 'Shenzhen', '2024-01-04');

Manage indexes

The wide table engine supports multiple index types. You can create and manage these indexes with SQL through an OLAP resource group to accelerate queries.

Columnar index

A columnar index synchronizes wide table data to columnar storage for efficient scanning by the OLAP resource group. It is suited for large-scale aggregation and analytics scenarios.

CREATE INDEX <index_name>
USING COLUMNAR
ON <table_name>(*)
PARTITION BY ENUMERABLE (
    [<pk_col> [, ...],]
    bucket(<bucket_num>, <pk_col>)
)
WITH (
    `lindorm_columnar.user.index.database` = '<db_name>',
    `lindorm_columnar.user.index.table` = '<table_name>'
);

Syntax description:

Parameter

Description

index_name

The name of the index.

table_name

The name of the target wide table.

PARTITION BY ENUMERABLE

Specifies the partitioning strategy for the index data. It consists of standard partition expressions and bucket partition expressions. All columns in these expressions must be primary key columns of the wide table.

Standard partition expression

Optional. Specifies zero or more low-cardinality primary key columns, such as dates or cities, separated by commas. Index data is built for each partition value, which allows queries to use partition predicates to efficiently locate data. Avoid using high-cardinality columns, such as user IDs, as standard partition keys to prevent creating a large number of small partitions.

bucket(<bucket_num>, <pk_col>)

Required. You must specify at least one bucket expression. bucket_num is the number of buckets, and pk_col is a primary key column. The system hashes the column value and calculates the remainder modulo bucket_num to determine the bucket ID. Choose a bucket column with sufficient distribution to prevent data skew.

lindorm_columnar.user.index.database

The destination database for the columnar index data.

lindorm_columnar.user.index.table

The destination table for the columnar index data.

Note The standard partition and bucket partition expressions together determine the number of index partitions. We recommend that you keep the data size of each partition between 50 MB and 512 MB. For example, if a wide table ingests 50 GB of data per day and its primary key is (id, dt), you can configure the partition as PARTITION BY ENUMERABLE (dt, bucket(200, id)).

Example:

-- Create a columnar index for the 'test' table.
-- The wide table primary key is (p2, createtime).
-- The index uses 'createtime' for standard partitioning and hash-buckets 'p2' into 128 buckets.
CREATE INDEX columnar_idx
USING COLUMNAR
ON test(*)
PARTITION BY ENUMERABLE (
    createtime,
    bucket(128, p2)
)
WITH (
    `lindorm_columnar.user.index.database` = 'mydb',
    `lindorm_columnar.user.index.table` = 'test'
);

Secondary index

A secondary index builds an independent index structure on specified columns to accelerate point lookups and range queries on non-primary key columns. It supports a covering column to avoid additional table access.

CREATE INDEX <index_name>
ON <table_name>(<col_name> [ASC|DESC] [, ...])
[INCLUDE(<col_name> [, ...])];

Syntax description:

Parameter

Description

index_name

The name of the index.

table_name

The name of the target wide table.

col_name [ASC\|DESC]

The indexed columns and their sort order. You can specify multiple columns. Query predicates must match the prefix of the indexed columns.

INCLUDE

Specifies the covering columns. These additional columns are stored in the index, which allows queries to return their values without table access.

Example:

-- Create a secondary index on the c1 column of the 'test' table, covering columns c3 and c4 to avoid table access.
CREATE INDEX secondary_idx ON test(c1 DESC) INCLUDE(c3, c4);

Search index

A search index is based on an inverted index and supports full-text search (MATCH) and exact queries on structured fields. The text field type is used for full-text search with tokenization, while the string field type is used for exact matches and supports comparison operators such as = and LIKE.

CREATE INDEX <index_name>
USING SEARCH
ON <table_name>(<col_spec> [, ...]);

Syntax description:

Parameter

Description

index_name

The name of the index.

table_name

The name of the target wide table.

col_spec

The index configuration for a column, in the format col_name(type=<field_type>[, analyzer=<analyzer>][, indexed=true]). Use '*' to cover all columns.

type

Use text for tokenized full-text search, and string for exact matching.

analyzer

Analyzer, which is valid only when type=text. Common values are ik (Chinese tokenization) and standard (English tokenization).

indexed

Specifies whether to create a full-text index on the column. If set to trueMATCH searches are supported.

Example:

-- Create a search index for the 'test' table. All columns support structured queries, 
-- and column c3 is also enabled for Chinese tokenization and full-text search.
CREATE INDEX search_idx USING SEARCH ON test('*', c3(type=text, analyzer=ik, indexed=true));

View indexes

SHOW INDEX FROM <table_name>;

Example:

SHOW INDEX FROM test;

Drop an index

DROP INDEX <index_name> ON <table_name>;

Example:

DROP INDEX search_idx ON test;

Query examples

OLAP resource groups integrate the wide table engine with a massively parallel processing (MPP) query engine, supporting primary key lookups, non-primary key filtering, aggregate analytics, full-text search, and vector retrieval. You can build a columnar index, secondary index, or search index on a wide table. During query planning, the OLAP resource group automatically selects the appropriate index to accelerate queries by several to dozens of times, without changing your SQL.

Primary key queries

Primary key queries filter data by primary key columns, supporting exact-match lookups on the full primary key and range scans on a primary key prefix. The wide table engine is natively optimized for these queries, enabling millisecond-level lookups.

-- Create a table.
CREATE TABLE orders (
  user_id     VARCHAR NOT NULL,
  order_id    VARCHAR NOT NULL,
  amount      DOUBLE,
  status      VARCHAR,
  region      VARCHAR,
  create_time VARCHAR,
  PRIMARY KEY(user_id, order_id)
);

-- Exact-match lookup on the full primary key.
SELECT * FROM orders WHERE user_id = 'u001' AND order_id = 'o10001';

-- Range scan on a primary key prefix.
SELECT * FROM orders WHERE user_id = 'u001' AND order_id >= 'o10001';

-- Partial primary key scan.
SELECT * FROM orders WHERE user_id >= 'u001';

Non-primary key queries

Non-primary key queries filter data on non-primary key columns and perform a full table scan by default. You can create a secondary index or a search index based on the predicate type. The OLAP resource group automatically uses the index to accelerate the query without any changes to your SQL.

-- Create a table.
CREATE TABLE orders (
  user_id     VARCHAR NOT NULL,
  order_id    VARCHAR NOT NULL,
  amount      DOUBLE,
  status      VARCHAR,
  region      VARCHAR,
  create_time VARCHAR,
  PRIMARY KEY(user_id, order_id)
);

Non-primary key query with a secondary index

A secondary index is suited for structured filtering scenarios such as equality, range, and pattern matching queries.

-- Create a secondary index on the 'status' column.
-- Cover the 'amount' and 'region' columns to avoid table access.
CREATE INDEX idx_status ON orders(status) INCLUDE(amount, region);

-- Equality filter (hits the secondary index).
SELECT user_id, order_id, amount, region FROM orders WHERE status = 'SHIPPED';

-- Range filter (hits the secondary index).
SELECT user_id, order_id, amount FROM orders WHERE status >= 'S' AND status < 'T';

-- Pattern matching (hits the secondary index).
SELECT user_id, order_id, amount, region FROM orders WHERE status LIKE 'SH%';

Non-primary key query with a search index

A search index is suited for exact matches on string type fields, as well as for full-text search (MATCH) on text type fields that use tokenization.

-- Create a search index on the 'region' column (string type, supports equality and LIKE).
CREATE INDEX idx_search USING SEARCH ON orders(region(type=string, indexed=true));

-- Equality filter (hits the search index).
SELECT * FROM orders WHERE region = 'Beijing';

-- LIKE filter (hits the search index, supported for string type).
SELECT * FROM orders WHERE region LIKE 'Bei%';

-- MATCH full-text search (hits the search index).
SELECT * FROM orders WHERE region MATCH 'Beijing Shanghai';

Full-text search

After you create a search index on text fields in a wide table, you can use the MATCH syntax for full-text search on both regular and JSON fields. The OLAP resource group automatically pushes search predicates down to the inverted index for efficient retrieval.

The MATCH expression supports the following matching rules:

Syntax

Description

word

Contains the word.

+word

Must contain the word.

-word

Must not contain the word.

"phrase"

Contains the exact phrase (no tokenization).

+(word1 word2)

Must contain at least one of the words in parentheses.

-- Create a table with text and JSON fields.
CREATE TABLE orders (
  user_id     VARCHAR NOT NULL,
  order_id    VARCHAR NOT NULL,
  amount      DOUBLE,
  status      VARCHAR,
  region      VARCHAR,
  create_time VARCHAR,
  user_json   VARCHAR,
  PRIMARY KEY(user_id, order_id)
);

-- Create a search index on the 'region' column (string type for exact matching and LIKE).
-- Enable Chinese tokenization and full-text search on the 'user_json' column.
CREATE INDEX search_idx USING SEARCH ON orders(
  region(type=string, indexed=true),
  user_json(type=text, analyzer=ik, indexed=true)
);

-- Contains 'Beijing' or 'Shanghai'.
SELECT * FROM orders WHERE region MATCH 'Beijing Shanghai';

-- Must contain 'Beijing' and must not contain 'Shanghai'.
SELECT * FROM orders WHERE region MATCH '+Beijing -Shanghai';

-- Contains the exact phrase 'Beijing Chaoyang'.
SELECT * FROM orders WHERE region MATCH '"Beijing Chaoyang"';

-- Must contain 'Beijing', and must also contain either 'Chaoyang' or 'Haidian'.
SELECT * FROM orders WHERE region MATCH '+Beijing +(Chaoyang Haidian)';

-- The JSON field contains 'Beijing' (use MATCH directly on the column; the search index covers the entire column content).
SELECT * FROM orders WHERE user_json MATCH 'Beijing';

Complex analytics

OLAP resource groups support complex analytical operations on wide table data, such as large-scale aggregations, multi-table joins, and window functions. After you create a columnar index for a wide table, analytical queries read from columnar storage, which significantly reduces I/O. The OLAP resource group automatically routes queries to the index without any changes to your SQL.

-- Create a table with 'region' as a primary key prefix for easy partitioning.
CREATE TABLE orders (
  region      VARCHAR NOT NULL,
  user_id     VARCHAR NOT NULL,
  order_id    VARCHAR NOT NULL,
  amount      DOUBLE,
  status      VARCHAR,
  create_time VARCHAR,
  PRIMARY KEY(region, user_id, order_id)
);

-- Create a columnar index, using 'region' for standard partitioning and hash-bucketing 'user_id' (128 buckets).
-- Note: Columns in PARTITION BY ENUMERABLE must be primary key columns.
CREATE INDEX columnar_idx
USING COLUMNAR
ON orders(*)
PARTITION BY ENUMERABLE (
    region,
    bucket(128, user_id)
)
WITH (
    `lindorm_columnar.user.index.database` = 'mydb',
    `lindorm_columnar.user.index.table` = 'orders'
);

-- Large-scale aggregate analysis (hits the columnar index).
SELECT
    region,
    COUNT(*)       AS order_cnt,
    SUM(amount)    AS gmv,
    AVG(amount)    AS aov
FROM orders
WHERE region IN ('Shanghai', 'Beijing')
GROUP BY region
ORDER BY gmv DESC;

-- UNION query
(SELECT user_id, order_id, amount, status, region FROM orders WHERE region = 'Beijing')
UNION ALL
(SELECT user_id, order_id, amount, status, region FROM orders WHERE status LIKE 'SH%');

Vector retrieval

If a vector index has been created for a wide table, you can use an OLAP resource group to retrieve data based on vector similarity.

Query format

SELECT /*+ _l_force_vector_index_(${PROPERTIES}) */ ${SELECT_LIST}[,_vector_score_(${EMBEDDING_COLUMN}, ${TARGET_VECTOR})]
FROM ${TABLE_NAME}
WHERE ${CONDITION}
ORDER BY ${DISTANCE_FUNCTION}(${EMBEDDING_COLUMN}, ${TARGET_VECTOR})
LIMIT ${K}

Syntax notes

  1. Variables:

    Parameter

    Description

    ${SELECT_LIST}

    The columns to return from the query.

    _vector_score_(${EMBEDDING_COLUMN}, ${TARGET_VECTOR})

    Returns the similarity score. The EMBEDDING_COLUMN and TARGET_VECTOR parameters must match the parameters used by the distance function.

    ${TABLE_NAME}

    The target table.

    ${CONDITION}

    The scalar filter conditions.

    ${DISTANCE_FUNCTION}

    The similarity distance function. Supported functions are l2_distance, cosine_similarity, and innerproduct_distance.

    ${EMBEDDING_COLUMN}

    The column on which the vector index is built.

    ${TARGET_VECTOR}

    The target vector, expressed as a string representation of a floating-point array.

    ${PROPERTIES}

    Parameters for vector retrieval, such as k=10 and lvector.min_score=0.

  2. Distance functions: l2_distance, cosine_similarity, and innerproduct_distance.

  3. Hint: Use /*+ _l_force_vector_index_(...) */ to query the vector index and configure vector query parameters. The following parameters are supported:

    Parameter

    Required

    Description

    k

    Yes

    The number of most similar results to return. Ensure that the k parameter is consistent with the LIMIT clause.

    lvector.min_score

    No

    The similarity threshold. The scores of returned vectors must be greater than this value. The scores of returned vectors are in the range of [0, 1]. The value range is [0, +inf]. The default value is 0.

    lvector.ef_search

    No

    Specifies the length of the dynamic list during an index query for the HNSW algorithm. Value range: [1, 1000]. The default value is 100.

    lvector.nprobe

    No

    The number of cluster units to probe. Adjust this value to meet your recall requirements. A larger value improves recall but reduces performance. Valid values: [1, method.parameters.nlist]. No default value. Important: This parameter applies only to the IVFPQ algorithm.

    lvector.reorder_factor

    No

    Performs reordering by using the original vectors. The distances calculated by the ivfpq algorithm are quantized, which results in some precision loss. Reordering with the original vectors is used to compensate for this. The number of vectors to reorder is k * reorder_factor. This process is typically used to improve recall precision but increases performance overhead. Value range: [1, 200]. The default value is 10. Important - This parameter applies only to the ivfpq algorithm. - If the value of k is small, set this parameter to 5. If k is greater than 100, set this parameter to 1.

    lvector.client_refactor

    No

    Specifies whether to skip re-ranking within each shard and instead re-rank at an upper layer to improve performance. Valid values are true (yes) and false (no, default).

    lvector.filter_type

    No

    The hybrid search mode. Valid values: - pre_filter: Filters structured data and then searches vector data. - post_filter: Searches vector data and then filters structured data.

Example wide table schema

-- The following wide table already exists, and a vector index has been created on the 'item_embedding' column.
+----------------+---------------------+------+-------+---------+---------------+---------+
| Field          | Type                | Null | Key   | Default | Extra         | Comment |
+----------------+---------------------+------+-------+---------+---------------+---------+
| user_id        | VARCHAR(1073741824) | No   | true  | NULL    | partition key |         |
| order_id       | VARCHAR(1073741824) | No   | true  | NULL    | partition key |         |
| amount         | DOUBLE              | Yes  | false | NULL    |               |         |
| status         | VARCHAR(1073741824) | Yes  | false | NULL    |               |         |
| region         | VARCHAR(1073741824) | Yes  | false | NULL    |               |         |
| create_time    | VARCHAR(1073741824) | Yes  | false | NULL    |               |         |
| item_embedding | VARCHAR(1073741824) | Yes  | false | NULL    |               |         |
+----------------+---------------------+------+-------+---------+---------------+---------+

Query data

-- Retrieve the top 10 most similar product orders based on the input feature vector.
SELECT /*+ _l_force_vector_index_(k=10, lvector.min_score=0) */
    user_id,
    order_id
FROM orders
WHERE region='Beijing'
ORDER BY l2_distance(item_embedding, '[0.077, 0.941, 0.917, 0.347, 0.331, 0.309, 0.127, 0.990]')
LIMIT 10;

-- Retrieve the top 10 most similar product orders based on the input feature vector and return their similarity scores.
SELECT /*+ _l_force_vector_index_(k=10, lvector.min_score=0) */
    user_id,
    order_id,
    _vector_score_(item_embedding, '[0.077, 0.941, 0.917, 0.347, 0.331, 0.309, 0.127, 0.990]')
FROM orders
WHERE region='Beijing'
ORDER BY l2_distance(item_embedding, '[0.077, 0.941, 0.917, 0.347, 0.331, 0.309, 0.127, 0.990]')
LIMIT 10;

Parameters

The OLAP resource group automatically selects the appropriate index to accelerate queries. To force or disable the use of a specific index, add a hint to your query.

Index type

Hint

Description

secondary index

/*+ _l_force_secondary_index_*/

Forces the query to use the secondary index.

/*+ _l_disable_secondary_index_*/

Disables the secondary index.

search index

/*+ _l_force_search_index_*/

Forces the query to use the search index.

/*+ _l_disable_search_index_*/

Disables the search index.

columnar index

/*+ _l_force_columnar_index_*/

Forces the query to use the columnar index.

/*+ _l_disable_columnar_index_*/

Disables the columnar index.

vector index

/*+ _l_force_vector_index_*/

Forces the query to use the vector index for similarity retrieval.

/*+ _l_disable_vector_index_*/

Disables the vector index.