Synchronized materialized views

更新时间:
复制 MD 格式

You can use a materialized view to accelerate queries and layer your data warehouse. With a synchronized materialized view, all data changes in the base table are automatically synchronized to the materialized view. You do not need to manually run a refresh command to keep the materialized view updated. Synchronized materialized views have low management and update costs, making them ideal for transparently accelerating single-table aggregation queries in real-time scenarios.

Background

In StarRocks, a synchronized materialized view can only be created on a single base table in the Default Catalog. It acts as a special type of query acceleration index.

Key concepts

Concept

Description

base table

The table from which the materialized view derives its data.

For a synchronized materialized view, the base table must be a single internal table in the Default Catalog. StarRocks supports creating synchronized materialized views on detail tables and aggregate tables.

refresh

Data in a synchronized materialized view is automatically updated when data is loaded into the base table. No manual refresh command is required.

query rewrite

When you query a base table that has an associated materialized view, the system automatically determines if it can use the view's pre-computed results. If so, it reads data directly from the view, avoiding redundant computations and saving both time and system resources.

Synchronized materialized views support query rewrite for certain aggregate operators. See Aggregate function matching for details.

Materialized view comparison

Materialized view

Single-table aggregation

Multi-table join

Query rewrite

Refresh strategy

Base table

asynchronous materialized view

Yes

Yes

Yes

  • Asynchronous

  • Manual

Supports multiple base tables from:

  • Default Catalog

  • External Catalog

  • Existing asynchronous materialized views

  • Existing views

synchronized materialized view (Rollup)

Only for certain aggregate functions

No

Yes

Synchronous with data loading

Supports only a single base table in the Default Catalog

Aggregate function matching

When a query on a base table can be accelerated by a synchronized materialized view, it is automatically rewritten to use the view's pre-computed results. The following table maps the aggregate functions in the original query to those used to build the synchronized materialized view. You can select the appropriate aggregate functions to build the synchronized materialized view based on your business scenario.

Aggregate function in query

Aggregate function in materialized view

sum

sum

min

min

max

max

count

count

bitmap_union, bitmap_union_count, count(distinct)

bitmap_union

hll_raw_agg, hll_union_agg, ndv, approx_count_distinct

hll_union

Limitations

  • Synchronized materialized views support WHERE clauses as of v3.1.8.

  • Disaggregated storage and compute instances do not currently support synchronized materialized views.

Prerequisites

Before you create a synchronized materialized view, verify that it is needed to accelerate queries in your data warehouse.

The following example is based on the sales_records table, which contains the transaction ID (record_id), seller ID (seller_id), store ID (store_id), sale date (sale_date), and sale amount (sale_amt) for each transaction.

  1. Create the table and load data into it.

    CREATE DATABASE IF NOT EXISTS load_test;
    USE load_test;
    CREATE TABLE sales_records(
        record_id INT,
        seller_id INT,
        store_id INT,
        sale_date DATE,
        sale_amt BIGINT
    ) DISTRIBUTED BY HASH(record_id);
    INSERT INTO sales_records
    VALUES
        (001,01,1,"2022-03-13",8573),
        (002,02,2,"2022-03-14",6948),
        (003,01,1,"2022-03-14",4319),
        (004,03,3,"2022-03-15",8734),
        (005,03,3,"2022-03-16",4212),
        (006,02,2,"2022-03-17",9515);
  2. Measure query time.

    This business scenario requires frequent analysis of sales figures from different stores, so queries heavily use the sum() function, consuming significant system resources.

    SELECT store_id, SUM(sale_amt)
    FROM load_test.sales_records
    GROUP BY store_id;

    The output shows that the query takes 25 ms.

    The result set has three rows. The sum(sale_amt) values for store_id 1, 2, and 3 are 12892, 16463, and 12946, respectively.

  3. Run EXPLAIN to view the Query Profile for the query.

    EXPLAIN SELECT store_id, SUM(sale_amt)
    FROM load_test.sales_records
    GROUP BY store_id;

    In the output, the rollup field shows sales_records (the base table), which means the query was not accelerated by a materialized view.

Create a synchronized materialized view

Run the CREATE MATERIALIZED VIEW statement to create a synchronized materialized view for a specific query.

Note
  • Requirements for using aggregate functions:

    • The query must include a GROUP BY clause, and the SELECT list must include at least one grouping column.

    • A single aggregate function across multiple columns is not supported, such as SUM(a+b).

    • Multiple aggregate functions on the same column are not supported, such as SELECT SUM(a), MIN(a) FROM table.

  • The CREATE MATERIALIZED VIEW statement for a synchronized materialized view does not support JOIN operations.

  • Before dropping a column from a base table using ALTER TABLE DROP COLUMN, ensure it is not used by any synchronized materialized view. Otherwise, the operation will fail. To drop the column, you must first drop all referencing synchronized materialized views.

  • Creating too many synchronized materialized views for a table slows down data loading. When you load data into the base table, the synchronized materialized views and the base table are updated synchronously. If a base table has n synchronized materialized views, the data loading speed is similar to loading data into n tables.

  • Do not create multiple synchronized materialized views simultaneously. Wait for the current creation task to complete before starting the next one.

Based on the query above, the following example creates a synchronized materialized view for the sales_records table to group by store and sum the sales amount for each store.

CREATE MATERIALIZED VIEW store_amt AS
SELECT store_id, SUM(sale_amt)
FROM load_test.sales_records
GROUP BY store_id;

Check the build status

Creating a synchronized materialized view is an asynchronous operation. A successful return from the CREATE MATERIALIZED VIEW statement indicates that the creation task has been submitted successfully. Run SHOW ALTER MATERIALIZED VIEW to check the build status of synchronized materialized views in the current database.

USE load_test;
SHOW ALTER MATERIALIZED VIEW;

In the output, RollupIndexName shows the name of the synchronized materialized view (store_amt). A State of FINISHED indicates that the materialized view has been successfully created on the sales_records table.

Query a synchronized materialized view

A synchronized materialized view is essentially an index on the base table, not a physical table. To query it directly, you must use the [_SYNC_MV_] hint.

Note

Any aliases that you specify for the columns are ignored.

SELECT * FROM <mv_name> [_SYNC_MV_];-- Do not omit the brackets [] in the hint.

In this example, run the following command to query the synchronized materialized view.

USE load_test;
SELECT * FROM store_amt [_SYNC_MV_];

Accelerate queries with query rewrite

The new synchronized materialized view pre-computes and stores the results for the preceding query. Subsequent queries can read these results directly for faster execution. After the view is created, run the same query again to measure the performance improvement.

SELECT store_id, SUM(sale_amt)
FROM load_test.sales_records
GROUP BY store_id;

The output shows that the query now takes 17 ms, a significant reduction.

The query returns the same three rows as before: the sum(sale_amt) is 12892 for store_id 1, 16463 for store_id 2, and 12946 for store_id 3.

Verify materialized view hits

You can run EXPLAIN again to check whether the query hits the synchronized materialized view.

EXPLAIN SELECT store_id, SUM(sale_amt)
FROM load_test.sales_records
GROUP BY store_id;

In the output from the Query Profile, the rollup field now shows store_amt, confirming that the query used the synchronized materialized view.

View materialized view schema

You can run the DESC <tbl_name> ALL command to view the schema of a table and all its synchronized materialized views.

USE load_test;
DESC sales_records ALL;

Example output:

The result includes two index definitions. The first, for sales_records (type DUP_KEYS), lists the base table's five fields: record_id(int), seller_id(int), store_id(int), sale_date(date), and sale_amt(bigint). The second, for store_amt (type AGG_KEYS), lists the materialized view's two fields: store_id(int) and mv_sum_sale_amt(bigint). The Extra value for mv_sum_sale_amt is SUM, which indicates that the SUM aggregate function is used.

Drop a synchronized materialized view

You may need to drop a synchronized materialized view in the following situations:

  • Creation error: If an error occurs during creation, you must drop the failed view.

  • Performance degradation: A large number of synchronized materialized views can slow data loading. You may also have redundant views that are safe to remove.

  • Low usage: If query frequency is low and your workload can tolerate higher query latency, consider dropping it.

Drop a view during creation

To drop a synchronized materialized view that is being created, you can cancel the creation task. First, get the JobID of the creation task by following the instructions in Check the build status. Then, run the CANCEL ALTER command to cancel the task.

CANCEL ALTER TABLE ROLLUP FROM sales_records (12090);

Drop a created view

You can run the DROP MATERIALIZED VIEW <mv_name> command to drop an existing synchronized materialized view.

USE load_test;
DROP MATERIALIZED VIEW store_amt;

Best practices

Exact deduplication

The following example uses the advertiser_view_record table, which contains ad click details: click date (click_time), advertiser (advertiser), channel (channel), and user ID (user_id).

CREATE TABLE advertiser_view_record (
    click_time DATE,
    advertiser VARCHAR(10),
    channel VARCHAR(10),
    user_id INT
) DISTRIBUTED BY HASH(click_time);

In this scenario, you frequently run the following statement to query unique visitors (UV) for ad clicks.

SELECT advertiser, channel, count(distinct user_id)
FROM advertiser_view_record
GROUP BY advertiser, channel;

To accelerate exact deduplication queries, create a synchronized materialized view on the detail table and use the bitmap_union() function for pre-aggregation.

CREATE MATERIALIZED VIEW advertiser_uv AS
SELECT advertiser, channel, bitmap_union(to_bitmap(user_id)) AS user_bitmap
FROM advertiser_view_record
GROUP BY advertiser, channel;

After the synchronized materialized view is created, subsequent COUNT(DISTINCT user_id) queries are automatically rewritten to use the pre-aggregated bitmap data, allowing the query to efficiently hit the materialized view.

Approximate deduplication

Using the advertiser_view_record table as an example, if you need to perform approximate deduplication for UV queries, you can create a synchronized materialized view on the detail table and use the hll_union() function to pre-aggregate data.

CREATE MATERIALIZED VIEW advertiser_uv2 AS
SELECT advertiser, channel, hll_union(hll_hash(user_id)) AS user_hll
FROM advertiser_view_record
GROUP BY advertiser, channel;

Creating a prefix index

For example, the base table tableA contains columns k1, k2, and k3, where k1 and k2 are sort keys. If your workload includes queries with a WHERE clause on k3 (for example, WHERE k3 = x) and you want to use a prefix index to accelerate them, you can create a synchronized materialized view with k3 as its first column. This effectively makes k3 a prefix index, improving retrieval speed and reducing query response time.

CREATE MATERIALIZED VIEW k3_as_key AS
SELECT k3, k2, k1
FROM tableA;