Incremental refresh best practices for Dynamic Tables-Hologres(Hologres)-阿里云帮助中心

Use cases for incremental refresh

A Dynamic Table supports two refresh modes:

Full refresh: Recalculates all data each time the table is refreshed.
Incremental refresh: Calculates only new or changed data and merges it with the existing full results.

The source table has tens of millions of rows or more.
Minute-level data freshness is required.
Data changes per refresh interval are low, typically 0.1% to 5%.

This refers to the overall data change rate across the entire computation pipeline. A small change in a source table, such as an update to a dimension table, can trigger large-scale recalculation of intermediate or final results.

Table creation and initial refresh

The first refresh of an incremental Dynamic Table processes all historical data and persists intermediate results, consuming significantly more memory than a standard SQL query. For large tables, follow these steps:

Disable automatic refresh initially. Run the initial full refresh manually, then add new data and run a second manual refresh to test the incremental process. Enable automatic refresh after verification.
Use a serverless resource for the initial refresh. This prevents the refresh from affecting other tasks on your local resource. After the refresh succeeds, check the refresh history for actual resource consumption. If your local resource is sufficient, switch to it for subsequent automatic refreshes.

The following example demonstrates these steps.

-- 1. Define the source table
CREATE TABLE orders (
  order_id BIGINT NOT NULL,
  user_id BIGINT,
  amount FLOAT,
  PRIMARY KEY (order_id)
);

-- 2. Generate initial data
INSERT INTO orders
SELECT i, i % 100000, RANDOM() * 1000
FROM GENERATE_SERIES(1, 1000000) AS i;

-- 3. Create an incremental Dynamic Table (automatic refresh is disabled initially)
CREATE DYNAMIC TABLE order_agg
WITH (
  freshness = '1 minutes',
  auto_refresh_mode = 'incremental',
  auto_refresh_enable = 'false',     -- Enable automatic refresh after successful manual verification
  computing_resource = 'serverless'  -- Prioritize using a serverless resource
) AS
SELECT
  user_id,
  COUNT(1) AS order_count,
  SUM(amount) AS amount_sum
FROM orders
GROUP BY user_id;

-- 4. Manually run the first refresh and verify the result
REFRESH TABLE order_agg WITH (refresh_mode = 'incremental');
SELECT * FROM order_agg WHERE user_id = 1;

-- 5. Insert incremental data
INSERT INTO orders
SELECT i, i % 100000, RANDOM() * 1000
FROM GENERATE_SERIES(1000001, 1010001) AS i;

-- 6. Manually run an incremental refresh and verify the result
REFRESH TABLE order_agg WITH (refresh_mode = 'incremental');
SELECT * FROM order_agg WHERE user_id = 1;

-- 7. After verification, enable automatic refresh
ALTER TABLE order_agg SET (auto_refresh_enable = 'true');

GUCs for Dynamic Table refresh

GUC (Grand Unified Configuration) parameters affect incremental Dynamic Table refreshes the same way they affect standard OLAP statements. However, you must apply them using one of the following methods:

-- Configure or reset a GUC parameter for automatic refreshes
ALTER TABLE order_agg SET (refresh_guc_hg_experimental_xxx = xxx);
ALTER TABLE order_agg RESET (refresh_guc_hg_experimental_xxx);

-- Set a GUC parameter for a manual refresh
REFRESH TABLE order_agg WITH (refresh_mode = 'incremental', refresh_guc_hg_experimental_xxx = xxx);

Tuning incremental refresh

Append-only optimization

If a base table has only append writes (no UPDATE or DELETE operations), set the appendonly property to optimize incremental refresh performance. The improvement is especially significant for queries with MIN/MAX aggregations.

After this property is set, the table no longer supports update or delete operations. Set this property before you create the Dynamic Table.

-- Enable Append-Only optimization (must be executed before creating the Dynamic Table)
CALL set_table_property('orders', 'mutate_type', 'appendonly');

-- To restore the default behavior (first delete all associated incremental Dynamic Tables)
CALL set_table_property('orders', 'mutate_type', 'none');

Optimize Group By/Join key order

Incremental Dynamic Tables persist intermediate results for aggregations and joins. During each refresh, the system scans these results using Group By/Join keys to filter files, reducing the amount of intermediate data read.

When you have multiple Group By/Join keys, reorder them in the SQL statement to improve filtering efficiency. Place these fields first:

Sequential fields, such as timestamps or auto-incrementing IDs.
Fields with high cardinality.

Example: calculate per-minute sales by product category and sales platform for a single day.

-- The minute is a highly sequential time field, so place it first.
-- The product category has higher cardinality than the sales platform, so place it second.
-- The sales platform has only a few values, so place it last.
CREATE TABLE order_details (
  order_id BIGINT,
  platform BIGINT,
  product_id BIGINT,
  category_id BIGINT,
  amount FLOAT,
  create_at TIMESTAMP
);

CREATE DYNAMIC TABLE order_detail_summary
WITH (
  freshness = '1 minutes',
  auto_refresh_mode = 'incremental',
  auto_refresh_enable = 'false',
  computing_resource = 'serverless'
) AS
SELECT
  to_char(create_at, 'HH24:MI') AS minute,
  category_id,
  platform,
  SUM(amount)
FROM order_details
GROUP BY minute, category_id, platform;

Dimension table joins

Use dimension table joins instead of multi-table joins in these scenarios:

One table is updated infrequently and not written to in real time.
One table is updated frequently, but the join fields do not change.

Example: join an orders table with a customers table using a dimension table join when customers changes infrequently:

CREATE TABLE orders (
  order_id BIGINT NOT NULL,
  user_id BIGINT,
  amount FLOAT,
  PRIMARY KEY (order_id)
);

INSERT INTO orders
SELECT i, i % 100000, RANDOM() * 1000
FROM GENERATE_SERIES(1, 1000000) AS i;

CREATE TABLE customers (
  user_id BIGINT NOT NULL,
  user_name TEXT,
  PRIMARY KEY (user_id)
);

INSERT INTO customers
SELECT i, 'user_' || i
FROM GENERATE_SERIES(1, 100000) AS i;

CREATE DYNAMIC TABLE order_customer
WITH (
  freshness = '1 minutes',
  auto_refresh_mode = 'incremental',
  auto_refresh_enable = 'false',
  computing_resource = 'serverless'
) AS
SELECT
  o.order_id,
  o.user_id,
  c.user_name,
  o.amount
FROM orders o
-- FOR SYSTEM_TIME AS OF PROCTIME() identifies this as a dimension table join
LEFT JOIN customers FOR SYSTEM_TIME AS OF PROCTIME() c
  ON o.user_id = c.user_id;

In a dimension table join, changes to the dimension table affect only subsequent data — historical results are not retroactively updated. Dimension table join semantics are documented in CREATE DYNAMIC TABLE. To correct historical joined data, run a REFRESH OVERWRITE:

REFRESH OVERWRITE TABLE order_customer WITH (refresh_mode = 'incremental');

In practice, combine dimension table joins with periodic REFRESH OVERWRITE operations to tolerate small amounts of missing dimension data. REFRESH OVERWRITE is resource-intensive — do not run it frequently. If dimension table data timeliness is critical, use a multi-table join instead.

Tune dual-stream joins (multi-table joins)

Avoid small tables in dual-stream joins

Avoid using small tables with frequently changing data in dual-stream joins. A single row in a small table can match many rows in a large table, so changing a few dozen rows can force massive updates to intermediate and final results, negating the benefits of incremental refresh.

If this happens, use a dimension table join when its semantics are acceptable — for example, when the updated fields in the small table are not used in the join. Alternatively, join the small table at query time.

Example: a three-table join of orders, lineitems, and categories.

CREATE TABLE orders (
  order_id BIGINT NOT NULL,
  user_id BIGINT,
  PRIMARY KEY (order_id)
);
INSERT INTO orders SELECT i, i%100000 FROM GENERATE_SERIES(1, 1000000) AS i;

CREATE TABLE lineitems (
  lineitem_id BIGINT NOT NULL,
  order_id BIGINT,
  product_id BIGINT,
  category_id BIGINT,
  PRIMARY KEY (lineitem_id)
);
INSERT INTO lineitems SELECT i, i%1000000, i%100000, i%1000 FROM GENERATE_SERIES(1, 10000000) AS i;

CREATE TABLE categories (
  category_id BIGINT NOT NULL,
  category_name TEXT,
  PRIMARY KEY (category_id)
);
INSERT INTO categories SELECT i, 'category_' || i FROM GENERATE_SERIES(1, 1000) AS i;

CREATE DYNAMIC TABLE order_detail
WITH (
  freshness = '1 minutes',
  auto_refresh_mode = 'incremental',
  auto_refresh_enable = 'false',
  computing_resource = 'serverless'
) AS
SELECT
  o.order_id,
  o.user_id,
  l.lineitem_id,
  l.product_id,
  l.category_id,
  c.category_name
FROM orders o
LEFT JOIN lineitems l ON o.order_id = l.order_id
LEFT JOIN categories c ON l.category_id = c.category_id;

The categories table has only 1,000 rows, but if 500 rows change, 5 million related rows in the result table also change — negating the advantage of incremental refresh.

Because the join only uses category_name, which is rarely updated, you can use a dimension table join for categories to prevent this issue.

If the small table data changes frequently and must be reflected in query results, join the small table at query time through a view:

CREATE DYNAMIC TABLE order_detail_dt
WITH (
  freshness = '1 minutes',
  auto_refresh_mode = 'incremental',
  auto_refresh_enable = 'false',
  computing_resource = 'serverless'
) AS
SELECT
  o.order_id,
  o.user_id,
  l.lineitem_id,
  l.product_id,
  l.category_id
FROM orders o
LEFT JOIN lineitems l ON o.order_id = l.order_id;

CREATE VIEW order_detail AS
SELECT
  dt.*,
  c.category_name
FROM order_detail_dt dt
LEFT JOIN categories c ON dt.category_id = c.category_id;

Tune the execution plan

Join execution order significantly impacts performance, and the auto-generated plan is not always optimal. Optimize the order by adjusting your SQL or using a Plan Hint:

Multi-table joins: Join small tables first to reduce intermediate state size.
Large-small table joins: Make the small table the Build Side and the large table the Probe Side to avoid OOM errors during the initial refresh.

Example: Optimize a three-table join between line items, products, and categories.

Step 1: View and analyze the execution plan

-- 1. Prepare the data
-- lineitems table (large)
CREATE TABLE lineitems (
  lineitem_id BIGINT NOT NULL,
  order_id BIGINT,
  product_id BIGINT,
  PRIMARY KEY (lineitem_id)
);

-- products table (medium)
CREATE TABLE products (
  product_id BIGINT NOT NULL,
  product_name TEXT,
  category_id BIGINT,
  PRIMARY KEY (product_id)
);

-- categories table (small)
CREATE TABLE categories (
  category_id BIGINT NOT NULL,
  category_name TEXT,
  PRIMARY KEY (category_id)
);

-- 2. Create the Dynamic Table
CREATE DYNAMIC TABLE lineitem_detail
WITH (
  freshness = '1 minutes',
  auto_refresh_mode = 'incremental',
  auto_refresh_enable = 'false',
  computing_resource = 'serverless'
) AS
SELECT
  l.order_id,
  l.lineitem_id,
  p.product_id,
  p.product_name,
  c.category_id,
  c.category_name
FROM lineitems l
LEFT JOIN products p ON l.product_id = p.product_id
LEFT JOIN categories c ON p.category_id = c.category_id;

-- 3. View the execution plan
EXPLAIN REFRESH TABLE lineitem_detail WITH (REFRESH_MODE = 'incremental');

The execution plan output is as follows:

QUERY PLAN
->  Gather  (id=100005, fragmentId=8, dagId=8)
       ->  Hybrid DML  (id=13, fragmentId=7, dagId=7)
             ->  Redistribution  (id=100004, fragmentId=7, dagId=7)
                   Hash Key: lineitems.lineitem_id, products.product_id, categories.category_id, (row_key_pb_encoding(lineitems.product_id, products.category_id))
                   ->  Project  (id=11, fragmentId=5, dagId=5)
                         ->  Incremental Hash Join Node  (id=10, fragmentId=5, dagId=5)
                               Hash Cond: (products.category_id = categories.category_id)
                               ->  Redistribution  (id=100003, fragmentId=5, dagId=5)
                                     Hash Key: products.category_id
                                     ->  Incremental Hash Join Node  (id=6, fragmentId=3, dagId=3)
                                           Hash Cond: (lineitems.product_id = products.product_id)
                                           ->  Redistribution  (id=100002, fragmentId=3, dagId=3)
                                                 Hash Key: lineitems.product_id
                                                 ->  Local Gather  (id=2, fragmentId=1, dagId=1)
                                                       ->  NiagaraScanNode on public.lineitems  (id=1, fragmentId=2, dagId=1)
                                           ->  Local Gather  (id=5, fragmentId=3, dagId=3)
                                                 ->  NiagaraScanNode on public.products  (id=4, fragmentId=4, dagId=3)
                               ->  Local Gather  (id=9, fragmentId=5, dagId=5)
                                     ->  NiagaraScanNode on public.categories  (id=8, fragmentId=6, dagId=5)
 RefreshMode: INCREMENTAL

The execution plan shows the following join order:

lineitems (Probe Side) is joined with products (Build Side) to compute the intermediate result lineitem_product.
lineitem_product (Probe Side) is joined with categories (Build Side) to compute the final result.

Step 2: Delay the join with the large table

The intermediate result lineitem_product is large, creating high state storage overhead. Instead, join products and categories first to produce a smaller intermediate result and reduce overall cost.

DROP TABLE lineitem_detail;
CREATE DYNAMIC TABLE lineitem_detail_opt
WITH (
  freshness = '1 minutes',
  auto_refresh_mode = 'incremental',
  auto_refresh_enable = 'false',
  computing_resource = 'serverless'
) AS
WITH pc AS (
  SELECT
    p.product_id,
    p.product_name,
    c.category_id,
    c.category_name
  FROM products p
  LEFT JOIN categories c ON p.category_id = c.category_id
)
SELECT
  l.order_id,
  l.lineitem_id,
  pc.product_id,
  pc.product_name,
  pc.category_id,
  pc.category_name
FROM lineitems l
LEFT JOIN pc ON l.product_id = pc.product_id;

Verify the new execution plan confirms that products and categories are joined first:

EXPLAIN REFRESH TABLE lineitem_detail_opt WITH (REFRESH_MODE = 'incremental');

The execution plan output is as follows:

QUERY PLAN
 ->  Gather  (id=100006, fragmentId=9, dagId=9)
       ->  Hybrid DML  (id=16, fragmentId=8, dagId=8)
             ->  Redistribution  (id=100005, fragmentId=8, dagId=8)
                   Hash Key: lineitems.lineitem_id, products.product_id, categories.category_id, (row_key_pb_encoding(lineitems.product_id, products.category_id))
                   ->  Project  (id=14, fragmentId=7, dagId=7)
                         ->  Project  (id=13, fragmentId=7, dagId=7)
                               ->  Incremental Hash Join Node  (id=12, fragmentId=7, dagId=7)
                                     Hash Cond: (products.product_id = lineitems.product_id)
                                     ->  Redistribution  (id=100003, fragmentId=7, dagId=7)
                                           Hash Key: products.product_id
                                           ->  Project  (id=7, fragmentId=3, dagId=3)
                                                 ->  Incremental Hash Join Node  (id=6, fragmentId=3, dagId=3)
                                                       Hash Cond: (products.category_id = categories.category_id)
                                                       ->  Redistribution  (id=100002, fragmentId=3, dagId=3)
                                                             Hash Key: products.category_id
                                                             ->  Local Gather  (id=2, fragmentId=1, dagId=1)
                                                                   ->  NiagaraScanNode on public.products  (id=1, fragmentId=2, dagId=1)
                                                       ->  Local Gather  (id=5, fragmentId=3, dagId=3)
                                                             ->  NiagaraScanNode on public.categories  (id=4, fragmentId=4, dagId=3)
                                     ->  Redistribution  (id=100004, fragmentId=7, dagId=7)
                                           Hash Key: lineitems.product_id
                                           ->  Local Gather  (id=9, fragmentId=5, dagId=5)
                                                 ->  NiagaraScanNode on public.lineitems  (id=8, fragmentId=6, dagId=5)
 RefreshMode: INCREMENTAL

Step 3: Use a Plan Hint to specify the Build and Probe sides

In the optimized plan, lineitems is still the Build Side. Using a large table as the Build Side can cause OOM errors. Use a Plan Hint to swap the sides:

DROP TABLE lineitem_detail_opt;
CREATE DYNAMIC TABLE lineitem_detail
WITH (
  freshness = '1 minutes',
  auto_refresh_mode = 'incremental',
  auto_refresh_enable = 'false',
  computing_resource = 'serverless'
) AS
WITH pc AS (
  SELECT
    p.product_id,
    p.product_name,
    c.category_id,
    c.category_name
  FROM products p
  LEFT JOIN categories c ON p.category_id = c.category_id
)
SELECT /*+HINT Leading(l pc) */
  l.order_id,
  l.lineitem_id,
  pc.product_id,
  pc.product_name,
  pc.category_id,
  pc.category_name
FROM lineitems l
LEFT JOIN pc ON l.product_id = pc.product_id;

View the final execution plan:

EXPLAIN REFRESH TABLE lineitem_detail WITH (REFRESH_MODE = 'incremental');

The execution plan output is as follows:

QUERY PLAN
Gather  (cost=0.00..15.01 rows=4 width=56)
   ->  Sink Table  (cost=0.00..15.01 rows=4 width=56)
         ->  Redistribution  (cost=0.00..15.01 rows=4 width=56)
               Hash Key: lineitems.lineitem_id, products.product_id, categories.category_id, (pg_catalog.row_key_pb_encoding(lineitems.product_id, products.category_id))
               ->  Project  (cost=0.00..15.01 rows=4 width=56)
                     ->  Hash Left Join  (cost=0.00..15.01 rows=4 width=64)
                           Hash Cond: (lineitems.product_id = products.product_id)
                           ->  Redistribution  (cost=0.00..5.00 rows=1 width=24)
                                 Hash Key: lineitems.product_id
                                 ->  Local Gather  (cost=0.00..5.00 rows=1 width=24)
                                       ->  Seq Scan on lineitems  (cost=0.00..5.00 rows=1 width=24)
                           ->  Hash  (cost=10.00..10.00 rows=2 width=40)
                                 ->  Redistribution  (cost=0.00..10.00 rows=2 width=40)
                                       Hash Key: products.product_id
                                       ->  Hash Left Join  (cost=0.00..10.00 rows=2 width=40)
                                             Hash Cond: (products.category_id = categories.category_id)
                                             ->  Redistribution  (cost=0.00..5.00 rows=1 width=24)
                                                   Hash Key: products.category_id
                                                   ->  Local Gather  (cost=0.00..5.00 rows=1 width=24)
                                                         ->  Seq Scan on products  (cost=0.00..5.00 rows=1 width=24)
                                             ->  Hash  (cost=5.00..5.00 rows=1 width=16)
                                                   ->  Local Gather  (cost=0.00..5.00 rows=1 width=16)
                                                         ->  Seq Scan on categories  (cost=0.00..5.00 rows=1 width=16)
 Optimizer: HQO version 4.1.0

The final plan is ideal: products and categories are joined first to minimize intermediate state, then joined with lineitems as the Probe Side to prevent high memory usage.

Test environment: 32 Compute Units (CUs). Data specifications:

lineitems: 100,000,000 rows.
products: 500,000 rows.
categories: 10,000 rows.

The test results are as follows:

	Initial refresh (full)			Incremental refresh (1% increment)			State table size
	E2E time	Memory	CPU Time	E2E time	Memory	CPU Time	State table size
Before optimization	52.8 s	27 GB	361 s	1254 ms	412 MB	8404 ms	5247 MB
After optimization	49.9 s	22 GB	255 s	911 ms	389 MB	6886 ms	2003 MB

Modify a Dynamic Table definition online

To modify a Dynamic Table definition without disrupting queries, use ALTER DYNAMIC TABLE table_name AS SQL for minor changes. This overwrites the table during the next refresh with the new definition. This method has restrictions — refer to the ALTER DYNAMIC TABLE documentation for details.

If ALTER DYNAMIC TABLE conditions are not met and no other Dynamic Tables depend on the table, use a view-based rebuild. Example for the orders_agg Dynamic Table:

-- 1. Create a Dynamic Table with the new definition. The name must be different from the old one.
CREATE DYNAMIC TABLE orders_agg_dt
WITH (
  ...
) AS ...;

-- 2. Wait for the first refresh to complete and confirm that the new Dynamic Table contains data.
SELECT * FROM orders_agg_dt LIMIT 1;

-- 3. In a transaction, create a view to atomically replace the original Dynamic Table.
BEGIN;
DROP TABLE orders_agg;
CREATE VIEW orders_agg AS SELECT * FROM orders_agg_dt;
END;

Use cases for incremental refresh

Table creation and initial refresh

GUCs for Dynamic Table refresh

Tuning incremental refresh

Append-only optimization

Optimize Group By/Join key order

Dimension table joins

Tune dual-stream joins (multi-table joins)

Avoid small tables in dual-stream joins

Tune the execution plan

Modify a Dynamic Table definition online

Related documents