Text search-ApsaraDB for SelectDB(SelectDB)-阿里云帮助中心

Text search finds documents that contain specific terms or phrases and ranks results by relevance. Unlike vector search, which maximizes recall through semantic similarity, text search delivers controllable, explainable exact matches with deterministic keyword hits and filtering.

In generative AI applications, especially Retrieval-Augmented Generation (RAG), text search and vector search complement each other. Combined, they balance broad semantic reach with precise lexical accuracy, improving recall and explainability to provide accurate context for large language models.

Core text search features in SelectDB

Rich text operators

SelectDB provides full-text search operators that cover retrieval patterns from basic keyword matching to complex phrase queries.

Key operators include:

MATCH_ANY / MATCH_ALL: Matches any term (OR logic) or all terms (AND logic), ideal for general keyword searches.
MATCH_PHRASE: Performs exact phrase matching, supporting custom term proximity (slop) and order control for proximity queries.
MATCH_PHRASE_PREFIX: Matches the prefix of a phrase, useful for autocomplete and incremental search features.
MATCH_REGEXP: Enables pattern-based text retrieval using a regular expression.

These operators can be used independently or combined within the SEARCH() function to build complex logical queries. For example:

-- Exact phrase search
SELECT * FROM docs WHERE content MATCH_PHRASE 'inverted index';

-- Prefix search
SELECT * FROM docs WHERE content MATCH_PHRASE_PREFIX 'data ware';

See all operators →

Custom tokenization

Tokenization directly impacts search precision and recall. SelectDB lets you build flexible tokenization pipelines by combining character filters, a tokenizer, and token filters to suit specific use cases.

Typical use cases include:

Custom character filtering: Pre-process text before tokenization by replacing, removing, or standardizing characters.
Choose a tokenizer: Select from various types, such as standard, ngram, edge_ngram, keyword, and icu, to handle different languages and text structures.
Apply token filters: Use filters like lowercase, word_delimiter, and ascii_folding to normalize and refine the resulting tokens.

-- Example: Define a custom analyzer
CREATE INVERTED INDEX ANALYZER IF NOT EXISTS keyword_lowercase
PROPERTIES (
  "tokenizer" = "keyword",
  "token_filter" = "asciifolding, lowercase"
);

-- Use the custom analyzer when creating a table
CREATE TABLE docs (
    id BIGINT,
    content TEXT,
    INDEX idx_content (content) USING INVERTED PROPERTIES(
        "analyzer" = "keyword_lowercase",
        "support_phrase" = "true"
    )
);

Learn about custom tokenization →

BM25 relevance scoring

SelectDB implements the BM25 (Best Matching 25) algorithm to score and rank full-text search results by relevance.

A probabilistic model based on term frequency (TF), inverse document frequency (IDF), and document length
Robust performance for both long and short documents
The weighting strategy can be adjusted using the k1 and b parameters

SELECT id, title, score() AS relevance
FROM docs
WHERE content MATCH_ANY 'real-time OLAP analysis'
ORDER BY relevance DESC
LIMIT 10;

Learn more about scoring mechanisms →

SEARCH function: unified query entry (4.0+)

The SEARCH() function provides a unified syntax for text retrieval. It supports multi-column searches and Boolean logic to simplify complex queries:

SELECT id, title, score() AS relevance
FROM docs
WHERE SEARCH('title:Machine AND tags:ANY(database sql)')
ORDER BY relevance DESC
LIMIT 20;

Full SEARCH function guide →

Quick start

Step 1: Create a table with an inverted index

CREATE TABLE docs (
  id BIGINT,
  title STRING,
  content STRING,
  category STRING,
  tags ARRAY<STRING>,
  created_at DATETIME,
  -- Text search index
  INDEX idx_title(title) USING INVERTED PROPERTIES ("parser" = "chinese"),
  INDEX idx_content(content) USING INVERTED PROPERTIES ("parser" = "chinese", "support_phrase" = "true"),
  INDEX idx_category(category) USING INVERTED,
  INDEX idx_tags(tags) USING INVERTED
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 10;

Step 2: Run text queries

-- Simple keyword search
SELECT * FROM docs WHERE content MATCH_ANY ' SelectDB';

-- Phrase search
SELECT * FROM docs WHERE content MATCH_PHRASE 'full-text search';

-- Use SEARCH for Boolean queries
SELECT * FROM docs
WHERE SEARCH('title:apache AND (category:database OR tags:ANY(sql nosql))');

-- Sort by relevance
SELECT id, title, score() AS relevance
FROM docs
WHERE content MATCH_ANY 'real-time analysis OLAP'
ORDER BY relevance DESC
LIMIT 10;

Hybrid search: text + vector

In RAG applications, combine text search with vector similarity for comprehensive retrieval:

-- Hybrid retrieval: Semantic similarity + keyword filtering
SELECT id, title, score() AS text_relevance
FROM docs
WHERE
  -- Vector filter for semantic similarity
  cosine_distance(embedding, [0.1, 0.2, ...]) < 0.3
  -- Text filter for keyword constraints
  AND SEARCH('title:search AND content:engine AND category:technology')
ORDER BY text_relevance DESC
LIMIT 10;

Managing inverted indexes

For a detailed introduction to inverted indexes, see Inverted Index. The following examples show common operations.

Create an index

-- Create an index when creating a table
CREATE TABLE t (
  content STRING,
  INDEX idx(content) USING INVERTED PROPERTIES ("parser" = "chinese")
);

-- Create an index on an existing table
CREATE INDEX idx_content ON docs(content) USING INVERTED PROPERTIES ("parser" = "chinese");

-- Build the index for existing data
BUILD INDEX idx_content ON docs;

Drop an index

DROP INDEX idx_content ON docs;

View an index

SHOW CREATE TABLE docs;
SHOW INDEX FROM docs;