Eliminate data redistribution
In AnalyticDB for PostgreSQL's massively parallel processing (MPP) architecture, data is distributed across compute nodes using hash, random, or replicated distribution. During query execution, nodes exchange data through redistribution or broadcast operations. This exchange accounts for a significant share of query execution costs and consumes substantial network resources. Choosing the right distribution strategy at table design time reduces or eliminates this overhead.
How it works
When two joined tables have their matching rows on different compute nodes, the database must redistribute one or both tables before the join can proceed. Joins where matching rows already reside on the same node—called collocated joins—require no redistribution and are the most efficient.
Three distribution modes are available:
-
Hash distribution: Rows are assigned to nodes based on a hash of the distribution column. Rows with identical distribution column values always land on the same node, enabling collocated joins.
-
Replicated distribution: A full copy of the table is stored on every compute node. Any join with a replicated table is automatically collocated.
-
Random distribution: Rows are distributed evenly across nodes with no particular affinity. Joins with randomly distributed tables typically require redistribution.
Choose a distribution strategy
Follow these steps in order when designing your tables. Apply the first rule that fits.
Step 1: Identify your most costly queries
Before choosing distribution columns, analyze your query patterns:
-
Identify the queries with the highest execution costs. Cost is measured by execution time and compute resource consumption, weighted by query frequency and impact on concurrent workloads.
-
For each high-cost query, identify the tables it references and how those tables are joined and aggregated.
-
Prioritize distribution decisions for the tables that appear most frequently in high-cost joins.
Step 2: Use hash distribution for fact and dimension tables in star schemas
In star schema designs, a fact table joins multiple dimension tables. Hash distribution on a shared join column eliminates redistribution for those joins.
-
Select the primary dimension table to collocate with the fact table. The largest dimension table is the one that contributes the most rows to common join operations after WHERE clause filtering—not necessarily the table with the most total rows. When WHERE clauses are present, only a subset of rows participates in the join, so filter selectivity determines actual join volume.
-
Set the distribution column to the shared join key. Use the column present in both the fact table and the primary dimension table as the distribution column for both. This makes their join collocated.
-
For remaining dimension tables, set the distribution column to the primary key or foreign key that matches the actual join condition for each table.
Step 3: Use replicated distribution when collocated joins are not feasible
If a dimension table cannot share a distribution column with the fact table—for example, when the join key is not present in the dimension table—use replicated distribution. Every node holds a full copy of the table, so any join with it is always collocated at no redistribution cost.
Step 4: Use random distribution for tables without join requirements
For tables that have no suitable distribution column or do not participate in joins, use random distribution. Rows are spread evenly across nodes, avoiding data skew.
Quick reference
| Scenario | Distribution strategy |
|---|---|
| Fact table and its primary join dimension table | Hash on the shared join column |
| Secondary dimension tables | Hash on the primary key or foreign key used in the join |
| Dimension table that cannot share a join key with the fact table | Replicated |
| Table with no suitable distribution column or no join involvement | Random |
Distribution column guidelines
Recommended:
-
Use a column that appears in join conditions across high-cost queries.
-
For hash distribution, choose a high-cardinality column to distribute rows evenly across nodes.
A fact table can have only one distribution column, so choose the dimension table to collocate based on join frequency and filtered row volume.