Best practices for wide table design

更新时间:
复制 MD 格式

LindormTable provides petabytes of storage capacity. It partitions data by primary key range and distributes it evenly across nodes. Lindorm also supports SQL and indexes, offering an experience similar to a relational database. However, its underlying architecture is a distributed NoSQL database built on the LSM-Tree structure, which differs from relational databases. Understanding these principles is essential for using Lindorm effectively and avoiding performance issues and hotspots caused by improper schema design.

Data model

LindormTable is a row-oriented storage engine. The following SQL statement creates a sample table named orders:

CREATE TABLE orders (
    channel VARCHAR NOT NULL,       -- Payment channel, such as Alipay or WeChat
    id VARCHAR NOT NULL,            -- Order ID
    ts TIMESTAMP NOT NULL,          -- Order event time
    status VARCHAR,                 -- Order status
    location VARCHAR,               -- Location where the order was placed
    PRIMARY KEY(channel, id, ts)    -- Composite primary key consisting of channel, id, and ts
) WITH (DYNAMIC_COLUMNS='true');    -- Enables dynamic columns. You can write to non-primary key columns without predefining them.

This statement creates the data model shown in the following table:

Primary key

Non-primary key

channel

id

ts

status

location

...

alipay

a0001

1705786502000

0

shanghai

...

alipay

a0002

1705786502001

1

beijing

...

...

...

...

...

...

...

unionpay

u0001

1705786502056

0

hangzhou

...

unionpay

u0002

1705786502068

1

nanjing

...

...

...

...

...

...

...

wechat

w0001

1705786502056

0

shanghai

...

wechat

w0002

1705786502068

0

shanghai

...

...

...

...

...

...

...

Primary key

A row's columns are divided into primary key and non-primary key columns. A primary key can be composed of multiple columns and has the following characteristics:

  • Immutable schema: The primary key is determined when you create a wide table. After table creation, you cannot add, delete, or reorder primary key columns, nor can you change their data types. Therefore, you must plan your primary key before creating the table. Primary key design is critical to query performance.

  • Uniqueness: All primary key columns combine to form a row key, which is unique within a table. The complete primary key uniquely identifies a row in the table.

  • The primary key is the clustered index: In a Lindorm wide table, data is stored in the order of the primary key. For example, in the sample table orders, rows with the same value in the channel column are stored together. If the values in the channel column are the same, the rows are then sorted by the values in the id column. If the values in the id column are also the same, the rows are sorted by the values in the ts column.

    This clustered index property enables more efficient queries. Similar to MySQL, Lindorm follows the left-prefix matching principle. You should design your primary key with your most common equality-based query conditions as the leftmost columns to narrow down the search range. If a query includes a range condition on a primary key column, the left-prefix match stops at that point. Subsequent primary key columns with equality checks cannot leverage the sorted storage order and will require extensive filtering.

    Consider the following query on the orders table:

    -- This query reads all data where channel='alipay' and id > 'a0089' to find rows matching ts=1705786502068.
    SELECT * FROM orders WHERE channel='alipay' AND id > 'a0089' AND ts = 1705786502068;  

    If you do not specify the leftmost primary key column in a query, it results in a full table scan, even if other primary key columns are specified.

    -- This is a full table scan query.
    SELECT * FROM orders WHERE id = 'a0089';

    In the statement above, the query condition specifies a primary key id, but this key is in the middle of the composite primary key. Therefore, the system still needs to scan all rows to find the row that matches id= 'a0089', which is very inefficient. If this is a main query, we recommend that you place the id column as the leftmost column of the primary key when you design the table, or create an index table for the id column to speed up queries.

Non-primary key

Lindorm allows you to dynamically define non-primary key columns. This means you do not need to define them in the schema. You can write to any non-primary key column by name, similar to how HBase works. For more information, see Dynamic columns. Lindorm also supports using wildcards to define non-primary key columns. For example, if you define *_str as a STRING type, you can write to any column with a name ending in _str. For more information, see Wildcard columns.

Lindorm supports updates to non-primary key columns. When writing data, you do not need to specify all non-primary key columns, but you must specify at least one. Lindorm does not support writing a row that contains only primary key values. Because data in a wide table is sorted by the primary key, using a non-primary key column as a filter without an index causes a full table scan. Lindorm rejects such queries by default. To query non-primary key columns efficiently, limit the primary key range or create a secondary index on the non-primary key column.

-- This request scans the entire table because it filters only on a non-primary key column without a primary key constraint.
SELECT * FROM table WHERE location = 'shanghai'; 

-- This query scans all rows where channel='alipay' and then filters for location = 'shanghai'.
SELECT * FROM table WHERE location = 'shanghai' and channel='alipay'; 

Data distribution

Lindorm is a distributed database. It automatically partitions data by primary key range and distributes these partitions, called regions, across the nodes in an instance. The following figure illustrates this process.

image

In Lindorm, a partition is called a region. Each region stores a continuous range of rows from a table. All regions are arranged sequentially by their primary key range to form the complete table space. For example, if you write a row with the key {alipay,a100999,1705786502068}, it is stored in Region_3. When the data in Region_3 grows beyond a threshold (8 GB by default), or if the system detects a read/write hotspot, it triggers region splitting. The original region is split into two new regions, covering the lower and upper halves of the original key range. The system then balances the load by moving one of the new regions to a different node.

image

The system does not guarantee that rows with the same prefix are stored in the same region. For example, the system cannot guarantee that rows where channel=alipay are stored in a single Region. If a large number of rows meet the channel=alipay condition, the rows are distributed across multiple Regions. For example, in the preceding figure, rows where channel=alipay are distributed across Region_1, Region_2, and Region_3. During the actual write process, you do not need to worry about writing too much data to specific ranges or use PARTITION BY to specify partitions when you create a table. The system automatically selects appropriate split points without manual intervention.

Pre-split region

Since Lindorm supports automatic partitioning, you do not need to define partition ranges when you create a table. The system automatically splits regions as data is written. However, if you have specific data distribution requirements, you can specify a number of pre-split regions at table creation. The system then distributes these regions across multiple machines to spread out read and write loads from the beginning. Note that this initial number of regions is just a starting point; the system still performs region splitting as data grows. For information on how to specify pre-split regions, see CREATE TABLE.

If you expect a high initial write volume or plan to use Bulkload to import data, we recommend setting up appropriate pre-split regions to prevent a single node from being overloaded at the start.

  • If you write data using SQL or the HBase API, you can set the number of pre-split regions to Number of nodes × 4. More initial partitions are not always better, so you should set the number based on your specific needs.

  • If you use Bulkload to import data, we recommend setting the number of pre-split regions to Data size (GB) / 8. This helps distribute the imported data evenly across regions and prevents immediate region splitting.

  • If you know the distribution characteristics of your primary key, you can specify split keys based on your data. For more information, see Set start keys for multiple regions.

Ensure that the ranges of your pre-split regions match your data write patterns. Otherwise, data skew can still occur, where writes are concentrated in a few regions, leading to poor performance despite having many pre-split regions.

Hotspot

In a distributed system, the goal is to distribute requests evenly across all regions so the system can scale out and handle more data and traffic. A hotspot breaks this model and can turn a single node into a performance bottleneck.

  • Single-row hotspot: Frequently reading or writing a single row can create a single-row hotspot. Since requests for the same row are always routed to the same node, that node's capacity becomes the upper limit for the entire system's performance. Scaling out does not solve this problem; you can only scale up the single node. Therefore, design your wide table to avoid single-row hotspots.

  • Ranged hotspot: A ranged hotspot can occur if you frequently read from and write to a very small range of data. Because Lindorm partitions data based on primary key ranges, data within a small range is likely to be stored in the same region. As a result, all requests might be sent to the same server. Although Lindorm provides a hotspot self-healing feature that can identify ranged hotspots and distribute the data from a hotspot region to different nodes by automatically splitting the region, you can still select a more distributed primary key to avoid requests that target a small range. For example, in the sample table orders, you can apply a hash function to the id column and store the result to prevent hotspot issues as much as possible.

  • Incremental write hotspot: If your primary key is based on an incremental value, new writes will always target the end of your key space. Because Lindorm partitions data by key range, even after a region splits, all subsequent writes will go to only one of the new regions (the one covering the end of the range), preventing the load from being distributed. For example, in the orders table, the id is incremental. If there is a high volume of orders for channel=alipay, even if Region_3 splits into Region_3_a and Region_3_b, only Region_3_b will handle the new write traffic. The system cannot resolve this issue. Therefore, avoid incremental primary keys, and never use an incremental value as the first column of your primary key.

    Important

    If you are migrating data from a system like Cassandra that uses hash-based partitioning, be aware of this difference. In a hash-partitioned system, an incremental key is hashed and distributed evenly. If you migrate that same schema to Lindorm, where the incremental key is used directly for range partitioning, it will create a severe hotspot.

Primary key design

The design of the primary key is crucial in a wide table, as it directly influences data distribution and sorting. A well-designed primary key is essential for allocating resources efficiently and using the wide table correctly. For more recommendations, see How to design a wide table primary key.

Key points for primary key design are as follows:

  • Keep the primary key as short as possible. It only needs to ensure row uniqueness. Do not include large data like JSON strings or web content in the primary key.

  • The first column of the primary key should be as distributed as possible. For incremental values such as an order ID, you can use the hash(id)+id or reverse(id) method.

  • The first column of the primary key cannot be an incremental value, as this will cause a severe hotspot. If an incremental field is unavoidable, place it as the second or third column of the primary key to ensure the leftmost part of the key is well-distributed. For example, in the orders table, the first column is channel. Writes for different channel values with incremental id values are distributed to different regions, which mitigates the risk of a single write hotspot.

  • The primary key is critical for query performance (when no index is used). Therefore, design your primary key based on your main query patterns.

Index design

When querying by the primary key alone does not meet your performance requirements, you can create a secondary index on the columns you need to query. For more information, see Secondary index.

A secondary index is an index table where the indexed columns form the primary key. Therefore, the design principles for primary keys also apply to indexed columns. Columns with low cardinality or incremental values are poor candidates for indexing.

Queries on a secondary index follow the left-prefix matching principle. For example, if a composite index is created on columns a, b, and c, a query that specifies column b or c but not the leftmost column a (such as SELECT * FROM tablename WHERE b=xx;) will still perform a full table scan.

When you create a secondary index, data is written to both the base table and the index table. A single write operation can involve multiple steps, such as reading from the base table and then writing to the index table, which can impact the write performance of the base table. To maintain good performance, limit the number of secondary indexes. Additionally, the columns in a secondary index cannot be modified. If an existing index no longer meets your query needs, you must drop it and create a new one. This makes secondary indexes best suited for scenarios with relatively stable query patterns. If you have many columns and need to perform complex queries with various combinations of conditions, we recommend using Lindorm search index and columnar index.

The following table compares three types of indexes:

Index type

Scenario

Dependency

Real-time visibility

Secondary index

For applications with stable query patterns.

None.

Yes.

Search index

Online query scenarios with many indexed columns and complex combinations of query conditions.

Requires activating the search index feature and provisioning LindormSearch nodes.

Data must be synchronized to the search engine, so there is some indexing latency. For more information, see Introduction to search index.

Columnar index

Scenarios that require analytical queries on specific columns.

Requires activating the columnar index feature and provisioning LDFS nodes.

Data needs to be converted to a columnar format, which introduces indexing latency. For more information, see Columnar index.

Data writes

Follow good primary key design principles to ensure data is evenly distributed across nodes for optimal performance. Batch writes are more efficient than single-row writes because they reduce the number of RPCs and allow the server to process rows in bulk, achieving higher throughput. However, do not make batches too large. An excessively large batch can cause node out-of-memory (OOM) errors or full garbage collection (GC), which can affect service stability. We recommend keeping the size of each batch under 2 MB.

Lindorm supports both the HBase API and SQL for data access and writes, but these two methods should not be mixed. If you create a table using SQL and define data types for each column (like INT or LONG), you may not be able to read data correctly with SQL if you later write to those columns using the HBase API. Tables created with SQL should only be accessed with SQL. Tables created with the HBase API can be accessed via SQL, and you can add a schema using column mapping. For more information, see Use SQL to access HBase tables.

Data queries

Point queries and range queries

Lindorm wide tables have two common query types: point queries and range queries.

Point query

A point query specifies the full primary key of a row. For example:

SELECT * FROM orders WHERE channel='alipay' AND id='a0001' AND ts=1705786502000;
SELECT * FROM orders WHERE channel='alipay' AND id='a0001' AND ts IN (1705786502000, 1705786502222, 1705786502333);
SELECT * FROM orders WHERE channel='alipay' AND id IN ('a0001', 'a0002', 'a0003') AND ts IN (1705786502000, 1705786502222, 1705786502333);
SELECT * FROM orders WHERE channel IN ('alipay', 'wechat', 'unionpay') AND id IN ('a0001', 'a0002', 'a0003') AND ts IN (1705786502000, 1705786502222, 1705786502333);

In the preceding example statements, the first SQL statement is a single-row point query, and the other SQL statements are multi-row point queries. The second SQL statement specifies multiple conditions and returns multiple rows of data from the Lindorm wide table. In the third SQL statement, three values are specified in the IN clause for each of the primary key columns id and ts. Therefore, 3 × 3 = 9 rows of data are queried. Similarly, the fourth SQL statement queries 3 × 3 × 3 = 27 rows of data. The more IN clauses you combine in a point query, the more likely a Cartesian product is created, and the more row query requests are sent to the server. When you perform a multi-row point query in a Lindorm wide table, all results are returned at once. The more rows that are queried, the more rows the server needs to process, and the larger the returned dataset. This increases the risk of an OutOfMemory (OOM) error or a Full Garbage Collection (GC) on the server. Therefore, when you perform point queries, you must control the number of rows queried by reducing the number of values in the IN clauses and the number of IN clause combinations.

Lindorm has a limit on the number of batch point queries. By default, you can perform a maximum of 2,000 point query requests at a time. If this limit is exceeded, the Multi Get Plan query too many rows in one select error is reported. If you need to perform large-scale batch point queries and have confirmed that memory will not become a bottleneck, you can contact Lindorm technical support (DingTalk ID: s0s3eg3) to increase this limit.

Range query

A range query does not specify the full primary key, using only a partial key or none at all. For example:

SELECT * FROM orders;
SELECT * FROM orders WHERE channel='alipay';
SELECT * FROM orders WHERE channel='alipay' AND id='1705786502001';
SELECT * FROM orders WHERE channel='alipay' AND id IN ('a0001', 'a0002', 'a0003');
SELECT * FROM orders WHERE channel IN ('alipay', 'wechat', 'unionpay') AND id IN ('a0001', 'a0002', 'a0003');

All of the preceding statements are range queries because none of them specify the complete primary key.

The results of a range query are streamed back to the client. This means that even a full table scan can return all its data without causing node OOM issues. Therefore, there is no limit on the number of rows a single range query can return. You do not need to add limit or offset clauses for pagination or cursors.

ORDER BY queries

Data in a wide table is sorted by its primary key. Without an ORDER BY clause, a SELECT statement returns results in primary key order. However, if a query uses an index, the results are returned in the order of the index key by default. The order of columns in the index key determines the return order.

The columns in an ORDER BY clause follow the left-prefix matching principle. For efficient sorting, you cannot skip the leftmost primary key fields and use a middle primary key field in the ORDER BY statement. When you use a middle primary key or a non-primary key to sort data in an ORDER BY clause, a large number of sorting computations are involved. Therefore, we recommend that you place the primary key that you want to sort by on the far left, or create a secondary index for the non-primary key that requires sorting. In addition, if you want to sort query results in descending order (ORDER BY DESC), we recommend that you add the DESC keyword when you create the table. This ensures that data is stored in descending order for optimal query performance. For more information about using ORDER BY, see How to use ORDER BY in large result sets.

Count queries

Lindorm is a NoSQL database based on the LSM-Tree storage architecture. Because it stores multiple versions of data and uses delete markers for deletions, the underlying storage does not maintain a real-time row count in the metadata. Obtaining an exact row count requires a full table scan, which can be time-consuming for large tables. If you only need an estimated row count, see How to count the number of rows in a table.

If a COUNT request includes conditions, the query time depends on the amount of table data that must be scanned to meet these conditions. For example, SELECT count(*) FROM table WHERE channel = 'alipay' reads all rows that satisfy channel = 'alipay' to perform the count.

Data deletion

When you delete a row in a Lindorm wide table, the row is not immediately removed. Instead, a delete marker is created. The row is permanently deleted only after the system performs a COMPACTION operation. Before the data is permanently deleted, both the deleted data and the delete marker are read during queries and then filtered out. Therefore, a large number of delete operations increases the number of delete markers and degrades query performance. We recommend that you set a Time To Live (TTL) to have data expire automatically instead of deleting it directly. If your workload involves a large number of delete operations, you can shorten the COMPACTION period to quickly remove the deleted data and delete markers. For information about how to set the TTL and the COMPACTION period, see ALTER TABLE.

If a secondary index is used, when you update data in the primary table, the old data in the secondary index table is deleted and the new data is written. As a result, this process generates a large number of delete markers in the secondary index table. A high volume of updates to the primary table can degrade the query performance of the index table. You can shorten the COMPACTION operation cycle to reduce the impact of these delete markers.

Important

Shortening the COMPACTION operation cycle increases the system load. Configure this setting appropriately.

More information

  • LindormTable is built on the LSM-Tree storage architecture. If you are unfamiliar with features like multi-versioning, timestamps, and TTL, you may encounter unexpected results from write and query operations due to improper use. For help with such issues, see Common causes for unexpected query results.

  • When using wide tables, continuously monitor your instance's CPU, network, and disk usage to prevent performance degradation from insufficient resources. Also, note the number of files on the node, region sizes, and any compaction backlogs to ensure they do not exceed Lindorm's limits, which can affect write and query performance. For information about Lindorm's limits, see Quotas and limits, and for key metrics to monitor, see Best practices for monitoring and alerting.

  • For other common errors and their solutions, see Lindorm SQL FAQ.