Glossary

更新时间:
复制 MD 格式

This page defines the key terms used in Alibaba Cloud Open Search Retrieval Engine Edition. Understanding these concepts helps you read configuration guides and API references without getting blocked by unfamiliar terminology.

Data-related terms

TermDescription
MaxCompute data sourceThe source for full data. Raw data is stored in MaxCompute by partition and loaded during full indexing.
API data sourceThe source for incremental data. Data is updated by calling API operations in real time.
documentThe basic unit of structured data in a search index. A document contains one or more fields and must have a primary key field. Think of a document as a row in a database table. Retrieval Engine Edition identifies documents by primary key value — if a new document shares the same primary key as an existing one, the existing document is overwritten.
fieldA named attribute of a document, consisting of a field name and a field value. Think of a field as a column in a database table.
multi-value fieldA field that holds multiple independent values — for example, a tags field containing ["cloud", "search", "analytics"].
primary keyThe field that uniquely identifies a document within an index.

Retrieval Engine Edition terms

Online search roles

A cluster is a search service consisting of QRS workers and Searcher workers that work together to handle query requests.

RoleDescription
Query Result Searcher (QRS) workerHandles online search. QRS workers parse incoming query requests, distribute them to Searcher workers, and merge the results before returning them to the caller.
Searcher workerHandles online search. Searcher workers load index data into memory and serve search queries.

Offline indexing roles

Processor, Builder, and Merger together form the offline indexing pipeline.

RoleDescription
ProcessorParses raw data during offline indexing.
BuilderBuilds indexes from raw data during offline indexing.
MergerMerges and sorts indexes during offline indexing.

Indexing types

TypeDescription
Full indexingIndexes all data in a MaxCompute data source. The output is a full index with full index versions.
Incremental indexingWhen data is updated in real time, the offline indexing pipeline generates new indexes and applies them to online clusters automatically.
Real-time indexingData pushed via API operations takes effect immediately. Real-time indexes are generated in the memory of Searcher workers.

Index types

Inverted index

An inverted index maps terms to the documents in which they appear. Inverted indexes power query clauses and make full-text search efficient.

For example, given two documents:

  • Document 1: "fast cloud search"

  • Document 2: "fast index builder"

The inverted index looks like this:

TermDocuments
fast1, 2
cloud1
search1
index2
builder2

Forward index

A forward index maps documents to their fields. Forward indexes are used in FILTER clauses. They are less efficient than inverted indexes but support field-level lookups.

For example:

DocumentFields
doc1id, type, create_time, ...
doc2id, type, create_time, ...

Summary index

A summary index stores the field values displayed in search result summaries. Query it by primary key or document ID to retrieve the display content for a given result. Retrieval Engine Edition paginates search results using summary index data.

Tokenization

Tokenization splits document text into individual searchable units called terms.

For TEXT-type fields, the system tokenizes sentences into meaningful terms. For example, the Chinese string 浙江大学 is tokenized into two terms: 浙江 and 大学.

A term is a single token or a set of tokens produced after tokenization. Terms are the atomic units used in inverted index lookups.

Data changes triggered by FSM

The finite-state machine (FSM) coordinates system state transitions. Each FSM-triggered change has a type, a rule for whether it can run multiple times (recurring), and a description of what it does.

For the same resource scope (cluster, index, or zone), non-recurring changes can only run once per instance. Recurring changes follow their own concurrency rules as described below.
Change typeRecurringDescription
Service discoveryYesPoints the IP address of a Retrieval Engine Edition instance to a domain name, enabling service calls. For the same cluster, all historical changes are terminated before the latest change runs.
ha3_biz_apendNoAdds a biz. Runs once per instance. Triggered automatically by the system. The change continues until the index table is added to the instance and the index is built.
update_biz_depend_index_fsmNoUpdates the index that a biz depends on. Runs once per instance. Triggered automatically by the system. The change continues until the index table is added and the index is built.
Online deploymentYesFor the same cluster, all historical changes are terminated before the latest change runs.
multi_biz_activateNoInitializes a Retrieval Engine Edition instance. Runs once per instance. The change continues until the index table is added and the index is built.
Index creationYesFor the same index, all historical changes are terminated before the latest change runs.
Automatically triggered full indexingYesTriggered automatically when new data partitions are detected. The latest change and historical changes can run concurrently.
Manually triggered full indexingYesThe latest change and historical changes can run concurrently.
Configuration pushYesAll historical changes are terminated before the latest change runs.
Online resourcesYesFor the same zone, all historical changes are terminated before the latest change runs.
Index rollbackYesThe latest change and historical changes can run concurrently.
FSM stands for finite-state machine — a mathematical model representing a finite set of states and the transitions between them.