DLF lets you create Tantivy-based full-text indexes on text columns of Paimon append-only tables and query them through PyPaimon or Spark SQL.
Overview
Full-text search uses the Tantivy engine to provide inverted-index-based search on STRING columns in Paimon append-only tables. Common use cases include log search, document retrieval, and content filtering.
|
Feature |
Description |
|
Index engine |
Tantivy (implemented in Rust, called through JNI/Python bindings) |
|
Supported column types |
STRING (VARCHAR / CHAR) |
|
Search types |
Keyword matching, boolean queries (AND/OR) |
Prerequisites
Verify that your environment meets the following requirements:
-
PyPaimon: Install the pypaimon and tantivy packages. The runtime environment must be Linux x86_64 (glibc >= 2.34, for example, Ubuntu 22.04 or later).
-
Install pypaimon-1.5.dev20260608.tar.gz:
pip install pypaimon-1.5.dev20260603.tar.gz -
Install tantivy-0.25.1-manylinux_x86_64.tar.gz:
tar xzf tantivy-0.25.1-manylinux_x86_64.tar.gz pip install tantivy --no-index --find-links . -
(Optional) For DuckDB output, also install:
pip install duckdb
-
-
Spark SQL: Add the following properties to your Spark configuration to load the JAR paimon-ali-emr-spark-3.5-fulltext-sample.jar.
spark.emr.serverless.excludedModules paimon spark.emr.serverless.user.defined.jars oss://<your-bucket>/paimon-ali-emr-spark-3.5-fulltext-sample.jar
Create a full-text-indexed table
Create with Flink SQL
CREATE TABLE articles (
id INT,
title STRING,
content STRING
) WITH (
'row-tracking.enabled' = 'true',
'data-evolution.enabled' = 'true',
'morax.tantivy-fulltext-index.enabled' = 'true',
'global-index.tantivy-fulltext.index-column' = 'content'
);
Create with Spark SQL
CREATE TABLE articles (
id INT,
title STRING,
content STRING
) TBLPROPERTIES (
'row-tracking.enabled' = 'true',
'data-evolution.enabled' = 'true',
'morax.tantivy-fulltext-index.enabled' = 'true',
'global-index.tantivy-fulltext.index-column' = 'content'
);
Table property reference
|
Property |
Description |
|
|
Enables row-level tracking, which links index entries to data rows. |
|
|
Enables data evolution, which allows incremental index building and ongoing index maintenance. |
|
|
Enables automatic full-text index scheduling. |
|
|
Name of the column to index. Must be of type STRING. |
Insert data and trigger index building
The following INSERT statement works with both Flink SQL and Spark SQL:
INSERT INTO articles VALUES
(1, 'lake storage', 'apache paimon is a lake storage format for big data'),
(2, 'stream engine', 'flink is a stream processing engine for real time analytics'),
(3, 'data evolution', 'paimon supports data evolution and row tracking features'),
(4, 'query engine', 'spark sql can query paimon tables directly with high performance'),
(5, 'distributed', 'ray data provides distributed data processing capabilities');
After you insert data:
-
DLF automatically schedules the full-text index build.
-
Index building is asynchronous. Once complete, queries automatically use the index for faster searches.
-
To check the index build progress, go to the target Catalog > target table > Files tab in the DLF console. After the build completes, you can see the full-text index file (for example,
tantivy-global-index-<UUID>.index).
Run full-text searches
Option 1: Search with PyPaimon
PyPaimon lets you run full-text searches without a Flink or Spark cluster.
Initialize the catalog
Set up a catalog connection to DLF before running searches. For the full parameter reference, see PyPaimon and Ray Data.
from pypaimon import CatalogFactory
CATALOG_OPTIONS = {
"metastore": "rest",
"uri": "http://<DLF-ENDPOINT>", # For VPC access, use http://<REGION>-vpc.dlf.aliyuncs.com
"warehouse": "<YOUR-CATALOG>",
"token.provider": "dlf",
"dlf.region": "<REGION-ID>",
"dlf.access-key-id": "<ACCESS-KEY-ID>",
"dlf.access-key-secret": "<ACCESS-KEY-SECRET>",
"dlf.oss-endpoint": "<OSS-ENDPOINT>",
}
catalog = CatalogFactory.create(CATALOG_OPTIONS)
Basic usage
Run a full-text search to find matching rows:
table = catalog.get_table('default.articles')
builder = table.new_full_text_search_builder()
builder.with_text_column('content')
builder.with_query_text('paimon')
builder.with_limit(3)
result = builder.execute_local()
Read the matching rows from the table by using the search results:
read_builder = table.new_read_builder()
read_builder = read_builder.with_projection(['id', 'title', 'content'])
scan = read_builder.new_scan().with_global_index_result(result)
splits = scan.plan().splits()
table_read = read_builder.new_read()
df = table_read.to_pandas(splits)
print(df)
Sample output:
|
id |
content |
|
1 |
apache paimon is a lake storage format for big data |
|
3 |
paimon supports data evolution and row tracking features |
|
4 |
spark sql can query paimon tables directly with high performance |
Retrieve relevance scores
# Continues from the builder and result variables above
score_fn = result.score_getter()
for row_id in result.results():
print(f"row_id={row_id}, score={score_fn(row_id)}")
Sample output:
row_id=0, score=1.0508
row_id=2, score=1.1567
row_id=3, score=1.0508
Boolean queries
Use the AND operator to require all keywords:
builder = table.new_full_text_search_builder()
builder.with_text_column('content')
builder.with_query_operator('and')
builder.with_query_text('data processing')
builder.with_limit(10)
result = builder.execute_local()
Sample output:
|
id |
content |
|
5 |
ray data provides distributed data processing capabilities |
Use the OR operator (default) to match any keyword:
builder = table.new_full_text_search_builder()
builder.with_text_column('content')
builder.with_query_operator('or')
builder.with_query_text('spark flink')
builder.with_limit(10)
result = builder.execute_local()
Output formats
PyPaimon supports multiple output formats:
# PyArrow Table
arrow_table = table_read.to_arrow(splits)
print(arrow_table)# DuckDB
conn = table_read.to_duckdb(splits, 'articles')
print(conn.execute('SELECT * FROM articles').fetchdf())
Option 2: Search with Spark SQL
-- Search for articles containing "paimon" and return the top 3 results
SELECT * FROM full_text_search('articles', 'content', 'paimon', 3);
Sample output:
|
id |
content |
|
1 |
apache paimon is a lake storage format for big data |
|
3 |
paimon supports data evolution and row tracking features |
|
4 |
spark sql can query paimon tables directly with high performance |
Combine full-text search with projection and filtering:
SELECT id, title FROM full_text_search('default.articles', 'content', 'data', 10)
WHERE id > 1;
The full_text_search function accepts the following parameters:
|
Parameter |
Type |
Description |
|
table_name |
STRING |
Table name. Supports |
|
column_name |
STRING |
Text column to search. Must be of type STRING. |
|
query_text |
STRING |
Search keywords. |
|
limit |
INT |
Maximum number of results to return. |
Limitations
|
Item |
Description |
|
Table type |
Only append-only tables are supported. |
|
Column type |
Only STRING (VARCHAR / CHAR) columns can be indexed. |
|
Indexed columns |
Each table supports a full-text index on only one STRING column. |
|
Index building |
Indexes are built asynchronously through Flink batch jobs. Wait for the build to complete before searching the data. |
|
Search precision |
Full-text search uses inverted-index keyword matching, not semantic search. For semantic search, see Vector search. |
|
NULL values |
Rows where the indexed column is NULL are excluded from both index building and search results. |
|
Tokenizer |
The default Tantivy tokenizer is optimized for English text. Chinese tokenization and custom tokenizers are not yet supported. |