Text search finds documents that contain specific terms or phrases and ranks results by relevance. Unlike vector search, which maximizes recall through semantic similarity, text search delivers controllable, explainable exact matches with deterministic keyword hits and filtering.
In generative AI applications, especially Retrieval-Augmented Generation (RAG), text search and vector search complement each other. Combined, they balance broad semantic reach with precise lexical accuracy, improving recall and explainability to provide accurate context for large language models.
Core text search features in SelectDB
Rich text operators
SelectDB provides full-text search operators that cover retrieval patterns from basic keyword matching to complex phrase queries.
Key operators include:
-
MATCH_ANY/MATCH_ALL: Matches any term (OR logic) or all terms (AND logic), ideal for general keyword searches. -
MATCH_PHRASE: Performs exact phrase matching, supporting custom term proximity (slop) and order control for proximity queries. -
MATCH_PHRASE_PREFIX: Matches the prefix of a phrase, useful for autocomplete and incremental search features. -
MATCH_REGEXP: Enables pattern-based text retrieval using a regular expression.
These operators can be used independently or combined within the SEARCH() function to build complex logical queries. For example:
-- Exact phrase search
SELECT * FROM docs WHERE content MATCH_PHRASE 'inverted index';
-- Prefix search
SELECT * FROM docs WHERE content MATCH_PHRASE_PREFIX 'data ware';
Custom tokenization
Tokenization directly impacts search precision and recall. SelectDB lets you build flexible tokenization pipelines by combining character filters, a tokenizer, and token filters to suit specific use cases.
Typical use cases include:
-
Custom character filtering: Pre-process text before tokenization by replacing, removing, or standardizing characters.
-
Choose a tokenizer: Select from various types, such as
standard,ngram,edge_ngram,keyword, andicu, to handle different languages and text structures. -
Apply token filters: Use filters like
lowercase,word_delimiter, andascii_foldingto normalize and refine the resulting tokens.
-- Example: Define a custom analyzer
CREATE INVERTED INDEX ANALYZER IF NOT EXISTS keyword_lowercase
PROPERTIES (
"tokenizer" = "keyword",
"token_filter" = "asciifolding, lowercase"
);
-- Use the custom analyzer when creating a table
CREATE TABLE docs (
id BIGINT,
content TEXT,
INDEX idx_content (content) USING INVERTED PROPERTIES(
"analyzer" = "keyword_lowercase",
"support_phrase" = "true"
)
);
BM25 relevance scoring
SelectDB implements the BM25 (Best Matching 25) algorithm to score and rank full-text search results by relevance.
-
A probabilistic model based on term frequency (TF), inverse document frequency (IDF), and document length
-
Robust performance for both long and short documents
-
The weighting strategy can be adjusted using the
k1andbparameters
SELECT id, title, score() AS relevance
FROM docs
WHERE content MATCH_ANY 'real-time OLAP analysis'
ORDER BY relevance DESC
LIMIT 10;
SEARCH function: unified query entry (4.0+)
The SEARCH() function provides a unified syntax for text retrieval. It supports multi-column searches and Boolean logic to simplify complex queries:
SELECT id, title, score() AS relevance
FROM docs
WHERE SEARCH('title:Machine AND tags:ANY(database sql)')
ORDER BY relevance DESC
LIMIT 20;
Quick start
Step 1: Create a table with an inverted index
CREATE TABLE docs (
id BIGINT,
title STRING,
content STRING,
category STRING,
tags ARRAY<STRING>,
created_at DATETIME,
-- Text search index
INDEX idx_title(title) USING INVERTED PROPERTIES ("parser" = "chinese"),
INDEX idx_content(content) USING INVERTED PROPERTIES ("parser" = "chinese", "support_phrase" = "true"),
INDEX idx_category(category) USING INVERTED,
INDEX idx_tags(tags) USING INVERTED
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 10;
Step 2: Run text queries
-- Simple keyword search
SELECT * FROM docs WHERE content MATCH_ANY ' SelectDB';
-- Phrase search
SELECT * FROM docs WHERE content MATCH_PHRASE 'full-text search';
-- Use SEARCH for Boolean queries
SELECT * FROM docs
WHERE SEARCH('title:apache AND (category:database OR tags:ANY(sql nosql))');
-- Sort by relevance
SELECT id, title, score() AS relevance
FROM docs
WHERE content MATCH_ANY 'real-time analysis OLAP'
ORDER BY relevance DESC
LIMIT 10;
Hybrid search: text + vector
In RAG applications, combine text search with vector similarity for comprehensive retrieval:
-- Hybrid retrieval: Semantic similarity + keyword filtering
SELECT id, title, score() AS text_relevance
FROM docs
WHERE
-- Vector filter for semantic similarity
cosine_distance(embedding, [0.1, 0.2, ...]) < 0.3
-- Text filter for keyword constraints
AND SEARCH('title:search AND content:engine AND category:technology')
ORDER BY text_relevance DESC
LIMIT 10;
Managing inverted indexes
For a detailed introduction to inverted indexes, see Inverted Index. The following examples show common operations.
Create an index
-- Create an index when creating a table
CREATE TABLE t (
content STRING,
INDEX idx(content) USING INVERTED PROPERTIES ("parser" = "chinese")
);
-- Create an index on an existing table
CREATE INDEX idx_content ON docs(content) USING INVERTED PROPERTIES ("parser" = "chinese");
-- Build the index for existing data
BUILD INDEX idx_content ON docs;
Drop an index
DROP INDEX idx_content ON docs;
View an index
SHOW CREATE TABLE docs;
SHOW INDEX FROM docs;