Best practices for table group setup

更新时间:
复制 MD 格式

When you build tables in Hologres, choosing the right table group and shard count is one of the most consequential decisions you make before production. An inappropriate shard count causes one of two problems: too many shards introduce shuffle and startup overhead that degrades query latency; too few shards limit query concurrency and write throughput. The goal is to match your shard count to your data volume and workload — not to maximize parallelism.

Recommended shard counts

The right shard count depends on several factors: data volume, query frequency, data access patterns (point query vs. online analytical processing (OLAP)), write throughput, and the number of tables in the table group. Use the following table as a starting point, then adjust based on your actual workload.

Total data size Recommended shard count Recommended instance specifications
Less than 40 million rows 10–20 More than 32 CPU cores
40 million–400 million rows 20–40 More than 64 CPU cores
400 million–4 billion rows 40–80 More than 128 CPU cores
4 billion–40 billion rows 80–240 More than 256 CPU cores. Consider creating a dedicated table group.
40 billion–400 billion rows 160–400 More than 512 CPU cores. Use multiple table groups.
These ranges are guidelines, not strict requirements. A table with a small data volume can reside in a high-shard table group, and a large table can reside in a single-shard table group. Select a shard count that balances high concurrency, compute efficiency, data concentration, and minimal shuffle overhead.

A table group should contain no more than 10,000 tables, including partitioned child tables but not foreign tables. Exceeding this limit causes metadata accumulation and slower Data Definition Language (DDL) execution.

To check the current shard count of your table group, run:

SELECT * FROM hologres.hg_table_group_properties;

-- Sample output
 tablegroup_name | property_key  | property_value
-----------------+---------------+----------------
 test_tg_default | is_default_tg | 1
 test_tg_default | shard_count   | 40
 test_tg_default | tg_version    | 1
 test_tg_default | table_num     | 1
(4 rows)

Choose a table group configuration

Work through the following decisions in order. Stop at the first step where the conditions are met.

Step 1: Does the default table group fit your data volume?

Compare the current shard count against the recommended ranges above. If it fits your current and projected data volume, use the default table group — provided that all three conditions apply:

  • The total data volume across all tables is controllable and predictable.

  • The usage pattern is not expected to change significantly.

  • Tables in the group need to perform efficient local join operations.

After upgrading or downgrading instance specifications, the shard count of the default table group does not change. Verify the shard count after any specification change.

If the default table group does not fit, move to the next step.

Step 2: Do you need a new table group?

Create a new table group if any of the following conditions apply:

  • Data volume mismatch: The existing shard count is too large for the table's data volume (producing many small files with high I/O overhead) or too small (limiting query concurrency).

  • Write load isolation: Many tables in the existing table group require concurrent writes, causing high instance load. Isolating tables in a dedicated table group makes writes and queries more independent.

  • Table correlation: A set of tables shares a unique write or query pattern and has local join requirements among themselves, but has little correlation with tables in the existing table group. Local join requires that the join key is the distribution key and that all joined tables are in the same table group.

  • Instance scaling: If the instance has been scaled in or out by more than five times, the original shard count may no longer be appropriate. The new shard count must be greater than the number of compute nodes and less than 60% of the total CPU cores.

If none of the above apply, continue using the existing table group.

Plan multiple table groups

Before stress testing and production, define the role of each table group and assign tables accordingly. Consider the following factors:

Data volume

Match the shard count to the amount of data stored in the table group. High-shard table groups suit large tables; low-shard table groups suit small and medium tables.

Write throughput

Shard count has a direct, positive effect on write throughput. Each shard has a maximum write capacity: at 100% CPU utilization, a single shard writes at 3,000–5,000 records per second (RPS) for 1 KB records.

To estimate the required shard count from a write throughput target: a shard operating at one-third CPU utilization sustains 1,000 RPS for 1 KB records. If your target is 60,000 RPS with 1 KB records, you need at least 60 shards (60,000 / 1,000 = 60). The shard count can be fine-tuned.

Shards also serve read requests, so available CPU for writes is always less than 100%. Factor in read load when estimating the shard count.

Table group load

If a table group will contain many tables that are frequently accessed concurrently, a small shard count may not support the required query concurrency. Account for the expected number of tables and their access patterns when setting the shard count.

FAQ

I have a 512-core instance for real-time OLAP on an event table with 20–40 billion rows. How do I configure the table group and shard count?

This is a single-workload scenario, so one table group is sufficient. The default shard count for a 512-core instance is 160. If the event table has hundreds of columns, increase the shard count to 200 or more to support higher OLAP concurrency.

I have a 256-core instance with many column-oriented tables, each containing tens of millions of rows, and I need millisecond-level OLAP with group-by and filter queries. How should I configure the shard count?

This is also a single-workload scenario; one table group is sufficient. The default shard count for a 256-core instance is 120. For tables with tens of millions of rows, 10–20 shards is more appropriate — more shards increase shuffle overhead and make millisecond-level analysis harder to achieve. Change the default table group's shard count to a value between 16 and 40, then validate with a stress test.

How do I check whether a slow query is caused by an inappropriate shard count?

Run EXPLAIN ANALYZE on the slow query and examine the output for the following indicators:

  • Shard count too large: High query startup overhead (check the start query cost line) or high shuffle overhead (check Max_GetNext_Time of Redistribution Motion). In Hologres V0.10 and later, these metrics are also available in slow query logs.

  • Shard count too small: CPU utilization stays below 100% during long-running computations, data scan overhead is high (check Max_GetNext_Time of Scan Node), or write performance is poor. Compare the actual RPS with the 3,000–5,000 RPS per shard baseline to assess whether the shard count is insufficient.

Point query QPS is not meeting my target. Is the shard count the cause?

Before adjusting the shard count, rule out other causes: the query may be analytical rather than a true point query, indexes may not be in use, shards may not be split, or the CPU may be at 100% utilization. If none of those apply and a single SQL statement is already performing optimally, increase the shard count to raise backend concurrency for point queries.

How do I detect and fix data skew?

Hologres provides the internal field hg_shard_id, which identifies the shard where each row resides. Run the following query to check for skew:

SELECT hg_shard_id, COUNT(1) FROM <table_name> GROUP BY hg_shard_id ORDER BY COUNT(1) DESC;

If one shard holds significantly more data than the others, data skew exists. Adjust the distribution key to achieve a more even distribution.