Full-text search is an information retrieval technology that lets you quickly and accurately find information relevant to a user's query from large amounts of text data. PolarDB for PostgreSQL provides a range of full-text search features.
Background
Unlike traditional keyword search, full-text search finds information relevant to a user's query quickly and accurately from massive amounts of text data. It processes the entire content of a document, not just specific fields or tags.
A full-text search system typically includes the following steps:
Text preprocessing: This step tokenizes text, removes stop words, and performs stemming to improve search efficiency and accuracy.
Index creation: An index is created for the processed text. This process typically uses an inverted index structure to record the position of each word and the documents in which it appears.
Query processing: When a user enters a query, the system analyzes the query statement and converts it into a suitable format for searching.
Result sorting: The search results are sorted based on a relevance algorithm to return the results that best match the user's query.
Scenarios
Full-text search is used in a wide range of scenarios. Some typical scenarios include the following:
Document management systems: You can quickly find internal company documents, reports, and clauses to improve work efficiency.
Online search engines: Full-text search technology helps users quickly find the information they need.
Academic research: In academic databases and literature management tools, researchers can use full-text search to quickly locate relevant literature and data.
E-commerce: E-commerce platforms use full-text search to help customers quickly find the products they need and improve the shopping experience.
Social media: Users can search for posts, comments, and images by keyword, which makes it easier to retrieve information.
Legal document retrieval: Lawyers and legal professionals use full-text search tools to quickly find relevant cases and legal clauses.
Medical record management: Hospitals can use full-text search technology to quickly find patient information and history in medical records and reports.
Customer support: Online customer service systems use full-text search to help customers quickly find answers to frequently asked questions (FAQs) and support documents.
Content management systems: Websites and blogs use full-text search to help visitors quickly find relevant articles and materials.
Libraries and information retrieval: Library information retrieval systems use full-text search technology to make it easy for readers to find books and articles.
Function overview
Tokenization
A PolarDB for PostgreSQL full-text index lets you preprocess documents and store an index for fast subsequent searches. Preprocessing includes the following steps:
Parsing documents into tokens. This step identifies different types of tokens, such as numbers, words, compound words, and email addresses, so that they can be processed in different ways.
Converting tokens into lexemes. A lexeme is a normalized string, similar to a token, that unifies different forms of the same word into a single representation. For example, normalization usually includes converting uppercase letters to lowercase and often involves removing suffixes, such as 's' or 'es' in English. This process allows searches to recognize different variations of the same word and eliminates the need to enter all possible variants.
Storing preprocessed documents for search optimization. For example, each document can be represented as an ordered array of normalized lexemes. In addition to lexemes, position information is usually stored for proximity ranking. As a result, a document with a dense area of query words ranks higher than a document with scattered query words.
tsvector
tsvector is a data type provided by PolarDB for PostgreSQL for full-text search. This data type is used to efficiently store processed text for fast searching and matching. A tsvector stores a document's lexemes and their positions in the text.
pg_bigm
pg_bigm is an extension for PolarDB for PostgreSQL that provides fuzzy search capabilities, especially for approximate string matching. This extension is ideal for applications that handle large amounts of text data, such as search engines and content management systems. Its core concept is to use n-grams to improve the efficiency and accuracy of text searches.
The pg_bigm extension is effective for prefix and suffix fuzzy queries, such as like '%xxxx%'.
pg_trgm
pg_trgm is an extension for PolarDB for PostgreSQL that provides support for trigrams. A trigram is a group of three consecutive characters taken from a string. This method is useful for similarity searches, fuzzy matching, and text similarity queries. pg_trgm primarily improves query efficiency on large text datasets by creating indexes and query operators. It is commonly used in scenarios such as full-text search, auto-completion, and spelling correction.
The pg_trgm extension is effective for prefix and suffix fuzzy queries, such as like '%xxxx%'.
Chinese tokenization
In the Chinese language, words are the smallest morphemic units. Unlike English, Chinese is written without spaces between words. Because of this characteristic, it is difficult to obtain tokenization results that match Chinese semantics when you use the default full-text search engine in PostgreSQL.
For Chinese tokenization, PolarDB for PostgreSQL supports two full-text search extensions: pg_jieba and Zhparser.
pg_jieba
Jieba is one of the most popular Chinese tokenization libraries. It can accurately identify and tokenize words in Chinese sentences. The pg_jieba extension integrates Jieba's tokenization capabilities into the database. This allows for efficient tokenization of Chinese text to support full-text search features.
Zhparser
Simple Chinese Word Segmentation (SCWS) is an open source Chinese tokenization engine based on a word frequency dictionary. It can accurately segment a block of Chinese text into words.
The Zhparser extension is a Chinese tokenization plug-in developed based on SCWS. It is compatible with existing full-text search features in PostgreSQL and provides a rich set of configuration options and a custom dictionary feature.
Indexes
PolarDB for PostgreSQL supports multiple index structures for full-text search.
GIN index
Generalized Inverted Index (GIN) is an index type in PostgreSQL that supports full-text search. Using a GIN index, you can efficiently perform full-text searches, especially when you handle large-scale text data. A GIN index allows for fast query operations, especially when you process complex text queries that use tsvector and tsquery. GIN indexes also support other data types, such as JSONB.
RUM index
RUM is a PostgreSQL extension that provides the RUM index type for full-text search and other indexing purposes. This index is designed to improve the performance of full-text search, especially in scenarios that require document ranking.
A RUM index is an inverted index similar to the built-in Generalized Inverted Index (GIN). The main difference is that a RUM index can store additional information. This allows for faster results when you perform sorting or other operations. For example, in a full-text search, a RUM index can store the position of words in a document. You can then use this position information to calculate relevance rankings during a query.
Query processing
tsquery
tsquery is a feature for full-text search that is specifically designed for querying text data. It allows users to create complex search conditions to find information quickly and effectively in large-scale text data. PolarDB for PostgreSQL also provides the to_tsquery method to convert text into a tsquery. You can perform a full-text search query by combining it with tsvector and full-text search operators.
tsquery supports the @@ (contains) operator and the Boolean operators & (AND), | (OR), and ! (NOT). This makes it easy to build search queries with combined conditions.
Sorting
ts_rank
ts_rank is a PostgreSQL function used for full-text search. It is mainly used to calculate a relevance score between a document and a query. This score can be used to evaluate the importance or relevance of a document for a specific query.