Full-text search for DLF Paimon tables-Data Lake Formation(DLF)-阿里云帮助中心

Overview

Full-text search uses the Tantivy engine to provide inverted-index-based search on STRING columns in Paimon append-only tables. Common use cases include log search, document retrieval, and content filtering.

Feature	Description
Index engine	Tantivy (implemented in Rust, called through JNI/Python bindings)
Supported column types	STRING (VARCHAR / CHAR)
Search types	Keyword matching, boolean queries (AND/OR)

Limitations

Item	Description
Table type	Only append-only tables are supported.
Column type	Only STRING (VARCHAR / CHAR) columns can be indexed.
Indexed columns	Each table supports a full-text index on only one STRING column.
Index building	Indexes are built asynchronously by DLF. Wait for the build to complete before searching the data.
Search precision	Full-text search uses inverted-index keyword matching, not semantic search. For semantic search, see Vector search.
NULL values	Rows where the indexed column is NULL are excluded from both index building and search results.
Tokenizer	The following tokenizers are supported: `default`, `simple`, `whitespace`, `raw`, `ngram`, and `jieba`. DLF uses `default` for automatic index builds.

Prerequisites

Verify that your environment meets the following requirements:

PyPaimon: pypaimon-1.5.dev20260727 is not yet available on public PyPI. Download the package from this document. The runtime environment must be Linux x86_64 (glibc >= 2.28).
1. Install pypaimon-1.5.dev20260704.tar.gz with full-text dependencies:
```
pip install 'pypaimon-1.5.dev20260727.tar.gz[full-text]'
```
2. (Optional) For DuckDB output, also install:
```
pip install duckdb
```

Spark SQL: Use Paimon Ali 1-ali-29.1. Prepare the following JARs and upload them to OSS:

spark.emr.serverless.excludedModules              paimon
spark.emr.serverless.user.defined.jars            oss://<your-bucket>/paimon-ali-emr-spark-3.5-1-ali-29.1.jar,oss://<your-bucket>/paimon-full-text-1-ali-29.1.jar,oss://<your-bucket>/paimon-full-text-index-0.1.0.jar

Create a full-text-indexed table

Create with Spark SQL

CREATE TABLE articles (
    id INT,
    title STRING,
    content STRING
) TBLPROPERTIES (
    'row-tracking.enabled' = 'true',
    'data-evolution.enabled' = 'true',
    'morax.full-text-index.enabled' = 'true',
    'global-index.full-text.index-column' = 'content'
);

Table property reference

Property	Description
`row-tracking.enabled = true`	Enables row-level tracking, which links index entries to data rows.
`data-evolution.enabled = true`	Enables data evolution, which allows incremental index building and ongoing index maintenance.
`morax.full-text-index.enabled = true`	Enables automatic full-text index scheduling.
`global-index.full-text.index-column`	Name of the column to index. Must be of type STRING.

Insert data and trigger index building

The following INSERT statement works with both Flink SQL and Spark SQL:

INSERT INTO articles VALUES
    (1, 'lake storage', 'apache paimon is a lake storage format for big data'),
    (2, 'stream engine', 'flink is a stream processing engine for real time analytics'),
    (3, 'data evolution', 'paimon supports data evolution and row tracking features'),
    (4, 'query engine', 'spark sql can query paimon tables directly with high performance'),
    (5, 'distributed', 'ray data provides distributed data processing capabilities');

After you insert data:

DLF automatically schedules the full-text index build.
Index building is asynchronous. Once complete, queries automatically use the index for faster searches.
To check the index build progress, go to the target Catalog > target table > Files tab in the DLF console. After the build completes, you can see the full-text index file (for example, tantivy-global-index-<UUID>.index).

Set the full-text index check interval

morax.full-text-index.check-interval sets how often DLF checks whether a full-text index build task needs to be submitted. The default is 1h. To get newly written data into the full-text index sooner, shorten the interval. For example:

ALTER TABLE articles SET TBLPROPERTIES (
    'morax.full-text-index.check-interval' = '5min'
);

Note

This property does not consume CUs by itself. CUs are consumed only when DLF detects new data to index and submits a build task. A shorter check interval can trigger index tasks more often and increase CU consumption. Choose a value based on your write frequency, index freshness requirements, and CU budget. Setting the interval to 5min only means DLF checks every 5 minutes. It does not guarantee that the index finishes building within 5 minutes.

Run full-text searches

Option 1: Search with DLF data exploration

On the AI Center > Data Discovery page in the DLF console, select the target Catalog and run full-text searches directly. For example, search the content column for records that contain paimon and return the top 10 results:

SELECT *
FROM full_text_search(
    'default.articles',
    'content',
    'paimon',
    10
);

In data exploration, pass the search text directly as the third parameter. The table name supports the db.table and catalog.db.table formats. If the full-text index has not finished building, the query may return no results. For more information about data exploration, see Data discovery.

Option 2: Search with PyPaimon

PyPaimon lets you run full-text searches without a Spark cluster.

Initialize the catalog

Set up a catalog connection to DLF before running searches. For the full parameter reference, see PyPaimon and Ray Data.

from pypaimon import CatalogFactory

CATALOG_OPTIONS = {
    "metastore": "rest",
    "uri": "http://<DLF-ENDPOINT>",  # For VPC access, use http://<REGION>-vpc.dlf.aliyuncs.com
    "warehouse": "<YOUR-CATALOG>",
    "token.provider": "dlf",
    "dlf.region": "<REGION-ID>",
    "dlf.access-key-id": "<ACCESS-KEY-ID>",
    "dlf.access-key-secret": "<ACCESS-KEY-SECRET>",
    "dlf.oss-endpoint": "<OSS-ENDPOINT>",
}

catalog = CatalogFactory.create(CATALOG_OPTIONS)

Basic usage

Run a full-text search to find matching rows:

table = catalog.get_table('default.articles')

builder = table.new_full_text_search_builder()
builder.with_query(
    'content',
    '{"match":{"query":"paimon"}}'
)
builder.with_limit(3)
result = builder.execute_local()

Read the matching rows from the table by using the search results:

read_builder = table.new_read_builder()
read_builder = read_builder.with_projection(['id', 'title', 'content'])
scan = read_builder.new_scan().with_global_index_result(result)
splits = scan.plan().splits()
table_read = read_builder.new_read()
df = table_read.to_pandas(splits)
print(df)

Sample output:

id	content
1	apache paimon is a lake storage format for big data
3	paimon supports data evolution and row tracking features
4	spark sql can query paimon tables directly with high performance

Retrieve relevance scores

# Continues from the builder and result variables above
score_fn = result.score_getter()
for row_id in result.results():
    print(f"row_id={row_id}, score={score_fn(row_id)}")

Sample output:

row_id=0, score=1.0508
row_id=2, score=1.1567
row_id=3, score=1.0508

Boolean queries

Use the AND operator to require all keywords:

builder = table.new_full_text_search_builder()
builder.with_query(
    'content',
    '{"match":{"query":"data processing","operator":"And"}}'
)
builder.with_limit(10)
result = builder.execute_local()

Sample output:

id	content
5	ray data provides distributed data processing capabilities

Use the OR operator (default) to match any keyword:

builder = table.new_full_text_search_builder()
builder.with_query(
    'content',
    '{"match":{"query":"spark flink","operator":"Or"}}'
)
builder.with_limit(10)
result = builder.execute_local()

Output formats

PyPaimon supports multiple output formats:

# PyArrow Table
arrow_table = table_read.to_arrow(splits)
print(arrow_table)

# DuckDB
conn = table_read.to_duckdb(splits, 'articles')
print(conn.execute('SELECT * FROM articles').fetchdf())

Option 3: Search with Spark SQL

-- Search for articles containing "paimon" and return the top 3 results
SELECT *
FROM full_text_search(
    'articles',
    'content',
    '{"match":{"query":"paimon"}}',
    3
);

Sample output:

id	content
1	apache paimon is a lake storage format for big data
3	paimon supports data evolution and row tracking features
4	spark sql can query paimon tables directly with high performance

Combine full-text search with projection and filtering:

SELECT id, title
FROM full_text_search(
    'default.articles',
    'content',
    '{"match":{"query":"data"}}',
    10
)
WHERE id > 1;

The full_text_search function accepts the following parameters:

Parameter	Type	Description
table_name	STRING	Table name. Supports `db.table` and `catalog.db.table` formats.
column_name	STRING	Text column to search. Must be of type STRING.
query	STRING	A JSON-formatted full-text search expression.
limit	INT	Maximum number of results to return.