Discretize primary keys using a hash algorithm-Lindorm(Lindorm)-阿里云帮助中心

The primary key hash discretization feature in Lindorm uses a hash function to distribute data across different shards. This enables distributed storage and queries and prevents issues such as data skew and uneven loads. This topic describes how to use the primary key hash discretization feature.

Background information

LindormTable is a distributed storage engine that uses range partitioning. Typically, data written to LindormTable is evenly distributed across different shards.

In practice, uneven traffic distribution can cause problems. For example, some data might be accessed much more frequently than other data. This can lead to uneven data distribution across shards and create hot spots. These hot spots prevent you from taking full advantage of the distributed system. You can resolve this data distribution issue by adjusting the data access method. However, if the issue is caused by primary keys that cannot be easily discretized, you can use the primary key hash discretization feature. Examples of such keys include auto-incrementing primary keys or time-based columns in the HH:mm:ss format.

The primary key hash discretization feature evenly distributes this data across different shards. This improves system scalability. It also supports distributed queries on top of range partitions, which improves query efficiency.

If your business scenario matches these cases, you can apply a hash algorithm to a specified primary key column when you create a table or a secondary index to discretize the data.

Prerequisites

LindormTable must be version 2.5.3 or later. For more information about how to view or upgrade the current version, see LindormTable version guide and Perform a minor version update.

Note

We recommend that you use the latest version of LindormTable.

Usage notes

When you apply a hash algorithm to a specified column in a primary key or an index, the hash function expression must be placed first. For example, PRIMARY KEY(p1, hash32(p1), p2) is incorrect. The correct usage is PRIMARY KEY(hash32(p1), p1, p2).
When you apply a hash algorithm to a column in a primary key or an index, you must also specify that column as a primary key column or an index column.
Primary key columns with a specified hash algorithm cannot be modified.
After you enable the primary key hash discretization feature, you cannot use the bulk load method to import data.

DDL

Create a table

When you create a table, you can set the primary key columns in the `PRIMARY KEY` clause and apply a hash algorithm to the specified columns. The 8-bit, 32-bit, and 64-bit hash algorithms are supported. Based on the characteristics of your primary keys, you can apply different hash algorithms to one or more primary key columns.

Note

The more bits the hash algorithm uses, the more memory the underlying storage layer consumes, and the higher the storage cost. For example, if you use a 32-bit hash algorithm, the underlying storage layer consumes an additional 4 bytes for each key-value pair.

Examples:

-- Apply a hash algorithm to one primary key column
CREATE TABLE t1 (
    p1 bigint,
    p2 integer,
    c1 integer,
    c2 varchar,
    PRIMARY KEY(hash32(p1), p1, p2)
);

-- Apply a hash algorithm to multiple primary key columns
CREATE TABLE t2 (
    p1 bigint,
    p2 integer,
    c1 integer,
    c2 varchar,
    PRIMARY KEY(hash8(p1, p2), p1, p2) 
);

hash32(p1) indicates that a 32-bit hash algorithm is applied to the p1 column. hash8(p1, p2) indicates that an 8-bit hash algorithm is applied to both the p1 and p2 columns.

Create a secondary index

The primary key hash discretization feature is supported for index tables.

Note

If you apply a hash algorithm to a column when you create a primary table but do not apply a hash algorithm to that column when you create a secondary index, the data in the index table is not discretized by default.

CREATE INDEX idx ON t1 (hash64(c1, c2), c1, c2);

View hash column details

You can view the table schema to see details about the hash columns.

DESCRIBE table t1;

Result:

+--------------+------------+-------------+---------+----------------+------------+
| TABLE_SCHEMA | TABLE_NAME | COLUMN_NAME |  TYPE   | IS_PRIMARY_KEY | SORT_ORDER |
+--------------+------------+-------------+---------+----------------+------------+
| test1        | t1         | hash32[p1]  | INT     | true           | ASC        |
| test1        | t1         | p1          | BIGINT  | true           | ASC        |
| test1        | t1         | p2          | INT     | true           | ASC        |
| test1        | t1         | c1          | INT     | false          | none       |
| test1        | t1         | c2          | VARCHAR | false          | none       |
+--------------+------------+-------------+---------+----------------+------------+

DML

Write data

When you write data, you do not need to add hash-related parameters to the SQL statement. The system automatically generates and populates the hash value based on the data that is written.

UPSERT INTO t1(p1, p2, c1, c2) VALUES(1, 1, 1, 'a');

Query data

When you query data, you do not need to add hash-related parameters to the SQL statement. The system automatically calculates the hash value based on the query conditions. However, note the following points:

You must specify values for all primary key columns that use a hash algorithm. If a column value is missing, the system cannot calculate the hash value. This prevents the system from locating the correct data shard and results in a full table scan.

The following examples use table t1:

-- Recommended
SELECT * FROM t1 WHERE p1=1 AND p2=1; -- The system queries the shard where hash(p1)=hash32(1).

-- Not recommended
SELECT * FROM t1 WHERE p2=1; -- The value of the primary key column p1 is not specified. The system cannot locate the shard that corresponds to the hash value and performs a full table scan.

For primary key columns that use a hash algorithm, the query condition must be an equality query, such as p1=2. Range queries on hash columns, such as p1>1, are not supported.
The following examples use table t1:
```
-- Correct
SELECT * FROM t1 WHERE p1=2 AND p2>1;

-- Incorrect
SELECT * FROM t1 WHERE p2=1 AND p1>2 AND p1<8; -- This query is not supported because a range query is set for the primary key column p1.
```

For more information about how to view the execution details of a SELECT statement, see EXPLAIN.

FAQ

Why does the query result display a hash column with a `null` value?

This issue is fixed in LindormTable 2.6.0 and the SQL engine 2.6.2.1. You can perform a minor version update to version 2.6.0 or later.

Note

A minor version update may restart the instance. We recommend that you perform this operation during off-peak hours.

Background information

Prerequisites

Usage notes

DDL

Create a table

Create a secondary index

View hash column details

DML

Write data

Query data

FAQ

Why does the query result display a hash column with a null value?

Why does the query result display a hash column with a `null` value?