Build full-text search indexes on Paimon tables with Tantivy

更新时间:
复制 MD 格式

DLF lets you create Tantivy-based full-text indexes on text columns of Paimon append-only tables and query them through PyPaimon or Spark SQL.

Overview

Full-text search uses the Tantivy engine to provide inverted-index-based search on STRING columns in Paimon append-only tables. Common use cases include log search, document retrieval, and content filtering.

Feature

Description

Index engine

Tantivy (implemented in Rust, called through JNI/Python bindings)

Supported column types

STRING (VARCHAR / CHAR)

Search types

Keyword matching, boolean queries (AND/OR)

Prerequisites

Verify that your environment meets the following requirements:

  • PyPaimon: Install the pypaimon and tantivy packages. The runtime environment must be Linux x86_64 (glibc >= 2.34, for example, Ubuntu 22.04 or later).

    1. Install pypaimon-1.5.dev20260608.tar.gz:

      pip install pypaimon-1.5.dev20260603.tar.gz
    2. Install tantivy-0.25.1-manylinux_x86_64.tar.gz:

      tar xzf tantivy-0.25.1-manylinux_x86_64.tar.gz
      pip install tantivy --no-index --find-links .
    3. (Optional) For DuckDB output, also install:

      pip install duckdb
  • Spark SQL: Add the following properties to your Spark configuration to load the JAR paimon-ali-emr-spark-3.5-fulltext-sample.jar.

    spark.emr.serverless.excludedModules                paimon
    spark.emr.serverless.user.defined.jars              oss://<your-bucket>/paimon-ali-emr-spark-3.5-fulltext-sample.jar

Create a full-text-indexed table

Create with Flink SQL

CREATE TABLE articles (
    id INT,
    title STRING,
    content STRING
) WITH (
    'row-tracking.enabled' = 'true',
    'data-evolution.enabled' = 'true',
    'morax.tantivy-fulltext-index.enabled' = 'true',
    'global-index.tantivy-fulltext.index-column' = 'content'
);

Create with Spark SQL

CREATE TABLE articles (
    id INT,
    title STRING,
    content STRING
) TBLPROPERTIES (
    'row-tracking.enabled' = 'true',
    'data-evolution.enabled' = 'true',
    'morax.tantivy-fulltext-index.enabled' = 'true',
    'global-index.tantivy-fulltext.index-column' = 'content'
);

Table property reference

Property

Description

row-tracking.enabled = true

Enables row-level tracking, which links index entries to data rows.

data-evolution.enabled = true

Enables data evolution, which allows incremental index building and ongoing index maintenance.

morax.tantivy-fulltext-index.enabled = true

Enables automatic full-text index scheduling.

global-index.tantivy-fulltext.index-column

Name of the column to index. Must be of type STRING.

Insert data and trigger index building

The following INSERT statement works with both Flink SQL and Spark SQL:

INSERT INTO articles VALUES
    (1, 'lake storage', 'apache paimon is a lake storage format for big data'),
    (2, 'stream engine', 'flink is a stream processing engine for real time analytics'),
    (3, 'data evolution', 'paimon supports data evolution and row tracking features'),
    (4, 'query engine', 'spark sql can query paimon tables directly with high performance'),
    (5, 'distributed', 'ray data provides distributed data processing capabilities');

After you insert data:

  • DLF automatically schedules the full-text index build.

  • Index building is asynchronous. Once complete, queries automatically use the index for faster searches.

  • To check the index build progress, go to the target Catalog > target table > Files tab in the DLF console. After the build completes, you can see the full-text index file (for example, tantivy-global-index-<UUID>.index).

Run full-text searches

Option 1: Search with PyPaimon

PyPaimon lets you run full-text searches without a Flink or Spark cluster.

Initialize the catalog

Set up a catalog connection to DLF before running searches. For the full parameter reference, see PyPaimon and Ray Data.

from pypaimon import CatalogFactory

CATALOG_OPTIONS = {
    "metastore": "rest",
    "uri": "http://<DLF-ENDPOINT>",  # For VPC access, use http://<REGION>-vpc.dlf.aliyuncs.com
    "warehouse": "<YOUR-CATALOG>",
    "token.provider": "dlf",
    "dlf.region": "<REGION-ID>",
    "dlf.access-key-id": "<ACCESS-KEY-ID>",
    "dlf.access-key-secret": "<ACCESS-KEY-SECRET>",
    "dlf.oss-endpoint": "<OSS-ENDPOINT>",
}

catalog = CatalogFactory.create(CATALOG_OPTIONS)

Basic usage

Run a full-text search to find matching rows:

table = catalog.get_table('default.articles')

builder = table.new_full_text_search_builder()
builder.with_text_column('content')
builder.with_query_text('paimon')
builder.with_limit(3)
result = builder.execute_local()

Read the matching rows from the table by using the search results:

read_builder = table.new_read_builder()
read_builder = read_builder.with_projection(['id', 'title', 'content'])
scan = read_builder.new_scan().with_global_index_result(result)
splits = scan.plan().splits()
table_read = read_builder.new_read()
df = table_read.to_pandas(splits)
print(df)

Sample output:

id

content

1

apache paimon is a lake storage format for big data

3

paimon supports data evolution and row tracking features

4

spark sql can query paimon tables directly with high performance

Retrieve relevance scores

# Continues from the builder and result variables above
score_fn = result.score_getter()
for row_id in result.results():
    print(f"row_id={row_id}, score={score_fn(row_id)}")

Sample output:

row_id=0, score=1.0508
row_id=2, score=1.1567
row_id=3, score=1.0508

Boolean queries

Use the AND operator to require all keywords:

builder = table.new_full_text_search_builder()
builder.with_text_column('content')
builder.with_query_operator('and')
builder.with_query_text('data processing')
builder.with_limit(10)
result = builder.execute_local()

Sample output:

id

content

5

ray data provides distributed data processing capabilities

Use the OR operator (default) to match any keyword:

builder = table.new_full_text_search_builder()
builder.with_text_column('content')
builder.with_query_operator('or')
builder.with_query_text('spark flink')
builder.with_limit(10)
result = builder.execute_local()

Output formats

PyPaimon supports multiple output formats:

# PyArrow Table
arrow_table = table_read.to_arrow(splits)
print(arrow_table)
# DuckDB
conn = table_read.to_duckdb(splits, 'articles')
print(conn.execute('SELECT * FROM articles').fetchdf())

Option 2: Search with Spark SQL

-- Search for articles containing "paimon" and return the top 3 results
SELECT * FROM full_text_search('articles', 'content', 'paimon', 3);

Sample output:

id

content

1

apache paimon is a lake storage format for big data

3

paimon supports data evolution and row tracking features

4

spark sql can query paimon tables directly with high performance

Combine full-text search with projection and filtering:

SELECT id, title FROM full_text_search('default.articles', 'content', 'data', 10)
  WHERE id > 1;

The full_text_search function accepts the following parameters:

Parameter

Type

Description

table_name

STRING

Table name. Supports db.table and catalog.db.table formats.

column_name

STRING

Text column to search. Must be of type STRING.

query_text

STRING

Search keywords.

limit

INT

Maximum number of results to return.

Limitations

Item

Description

Table type

Only append-only tables are supported.

Column type

Only STRING (VARCHAR / CHAR) columns can be indexed.

Indexed columns

Each table supports a full-text index on only one STRING column.

Index building

Indexes are built asynchronously through Flink batch jobs. Wait for the build to complete before searching the data.

Search precision

Full-text search uses inverted-index keyword matching, not semantic search. For semantic search, see Vector search.

NULL values

Rows where the indexed column is NULL are excluded from both index building and search results.

Tokenizer

The default Tantivy tokenizer is optimized for English text. Chinese tokenization and custom tokenizers are not yet supported.