what are text analyzers,how to use text analyzers-OpenSearch(Open Search)-阿里云帮助中心

Analyzer overview

The following table summarizes all available analyzers. "Dedicated only" means the analyzer requires a dedicated application instance.

Analyzer	Language / domain	Supported field types	Dedicated only
N-gram	Any	SHORT_TEXT	Yes
Keyword	Any	LITERAL, INT, LITERAL_ARRAY, INT_ARRAY	No
General analyzer for Chinese	Chinese	TEXT, SHORT_TEXT	No
E-commerce analyzer for Chinese	Chinese / E-commerce	TEXT, SHORT_TEXT	No
Single-character analyzer for Chinese	Chinese	TEXT, SHORT_TEXT	No
Character analyzer for Chinese	Chinese	TEXT, SHORT_TEXT	Yes
Fuzzy analyzer	Chinese / Pinyin	SHORT_TEXT	No
Full pinyin spelling analyzer	Pinyin	SHORT_TEXT	No
Abbreviated pinyin analyzer	Pinyin	SHORT_TEXT	No
Word stemming analyzer for English	English	TEXT, SHORT_TEXT	No
Unstemmed word analyzer for English	English	TEXT, SHORT_TEXT	No
Fine-grained analyzer for English	English	TEXT, SHORT_TEXT	Yes
General analyzer for Thai	Thai	TEXT, SHORT_TEXT	Yes
E-commerce analyzer for Thai	Thai / E-commerce	TEXT, SHORT_TEXT	Yes
General analyzer for Vietnamese	Vietnamese	TEXT, SHORT_TEXT	Yes
General analyzer for Indonesian	Indonesian	TEXT, SHORT_TEXT	Yes
General analyzer for Korean	Korean	TEXT, SHORT_TEXT	Yes
E-commerce analyzer for Korean	Korean / E-commerce	TEXT, SHORT_TEXT	Yes
General analyzer for Japanese	Japanese	TEXT, SHORT_TEXT	Yes
E-commerce analyzer for Japanese	Japanese / E-commerce	TEXT, SHORT_TEXT	Yes
Simple analyzer	Any	TEXT, SHORT_TEXT	No
Numeric analyzer	Numeric	INT, TIMESTAMP	No
Geo-location analyzer	Geographic	geo_point	No
IT content analyzer	IT industry	TEXT, SHORT_TEXT	No
General E-commerce analysis	E-commerce	TEXT	Yes (E-commerce Enhanced)
General analysis for the gaming industry	Gaming	TEXT, SHORT_TEXT	Yes (Gaming Enhanced)
General analyzer for English E-commerce	English / E-commerce	TEXT	Yes (E-commerce Enhanced)
Custom text analyzer	Any	TEXT, SHORT_TEXT	No

N-gram analyzer

Tokenizes text into sequences of N consecutive characters. Supports 2-gram and 3-gram tokenization. Use this analyzer for non-semantic search scenarios where you need character-level substring matching.

Important

Available only for dedicated applications. The field type must be SHORT_TEXT.

Example — 2-gram

Input: Open Search

Tokens: op, pe, en, n , s, se, ea, ar, rc, ch

Example — 3-gram

Input: Open Search

Tokens: ope, pen, en , n s, se, sea, ear, arc, rch

Keyword analyzer

Does not tokenize the field value. The entire value is treated as a single token. Use this analyzer for exact match scenarios — tags, identifiers, string codes, and numeric values that must not be split.

Supported field types: LITERAL, INT, LITERAL_ARRAY, INT_ARRAY

Example

Input: chrysanthemum tea

The document is retrieved only when the query is exactly chrysanthemum tea.

General analyzer for Chinese

A general-purpose semantic analyzer for Chinese text. Tokenizes content into meaningful search units based on Chinese language semantics. Suitable for most industries.

Supported field types: TEXT, SHORT_TEXT

Example

Input: juhuacha

The document is retrieved by queries juhuacha, juhua, cha, or huacha.

E-commerce analyzer for Chinese

A semantic analyzer optimized for Chinese e-commerce product names and descriptions. Produces finer-grained tokens than the general analyzer for common product terminology.

Supported field types: TEXT, SHORT_TEXT

Example

Input: Dabao SOD lotion

The document is retrieved by queries Dabao, sod, sod lotion, SOD lotion, or lotion.

Single-character analyzer for Chinese

Tokenizes Chinese text into individual characters and words. Use this analyzer for non-semantic Chinese search scenarios — author names, store names, or any field where individual character recall matters more than semantic grouping.

Supported field types: TEXT, SHORT_TEXT

This analyzer treats numbers and English words as single tokens. A search for he does not retrieve a document containing hello world. Use the fuzzy analyzer if you need partial word matching on numbers or English text.

If a search result summary is configured for a TEXT field, some extended tokens such as huacha are not highlighted.

Example

Input: juhuacha

The document is retrieved by queries juhuacha, juhua, cha, huacha, hua, or jucha.

Character analyzer for Chinese

Tokenizes text into individual Chinese characters, numbers, English letters, and punctuation marks. Suitable for non-semantic search scenarios that require maximum granularity.

Supported field types: TEXT, SHORT_TEXT

Important

Available only for dedicated applications.

Example

Input: kaifang sousuo OpenSearch123.

The document is retrieved by searching for any single character: kai, fang, sou, suo, O, p, e, n, S, e, a, r, c, h, or .

Fuzzy analyzer

Supports searches by pinyin, single characters, and letters, including prefix and suffix matching for numbers, letters, and pinyin. Chinese text does not support prefix or suffix matching. The field length is limited to 100 bytes.

Supported field types: SHORT_TEXT only

For details, see Fuzzy searches.

Example — Chinese and pinyin

Input: chrysanthemum tea

The document is retrieved by queries including chrysanthemum tea, chrysanthemum, tea, flower tea, flower, ju, juhua, juhuacha, j, jh, or jhc.

Example — Prefix and suffix matching

Input: 138****5678

Use ^138 to match phone numbers starting with 138.
Use 5678$ to match phone numbers ending with 5678.

Example — English letter combinations

Input: OpenSearch

The document is retrieved by any single letter or combination of letters from the word.

Full pinyin spelling analyzer

Searches Chinese characters in SHORT_TEXT fields using their full pinyin spelling or the first letter of each syllable. Suitable for searches by full pinyin or abbreviated pinyin — for example, movie names or author names.

To search by full pinyin, enter the complete pinyin of the Chinese characters. Partial syllable spelling is not supported.

Supported field types: SHORT_TEXT only

Example

Input: Da Nei Mi Tan 007

The document is retrieved by d, dn, dnm, dnmt, dnmt007, da, danei, daneimi, or daneimitan.

The document is not retrieved by an or anei.

Abbreviated pinyin analyzer

Retrieves Chinese characters in SHORT_TEXT fields using the first letter of each pinyin syllable. Use this analyzer for scenarios that require searches by pinyin initials — people's names, movie titles, and similar short content.

Supported field types: SHORT_TEXT

Example

Input: Da Nei Mi Tan 007

The document is retrieved by d, dn, dnm, dnmt, dnmt0, dnmt007, m, mt, mt007, or 007.

Word stemming analyzer for English

An English semantic analyzer that stems each token to its root form and handles pluralization. Use this analyzer when you want searches like analyzing or analyzers to match documents containing analyze.

Supported field types: TEXT, SHORT_TEXT

This analyzer does not support query analysis configurations. Consecutive Chinese characters are treated as a single token.

Example

Input: English tokenizer english analyzer

The document is retrieved by English tokenizer, english, analyz, analyzer, analyzers, analyze, analyzed, or analyzing.

Unstemmed word analyzer for English

Tokenizes English text based on spaces and punctuation marks without applying stemming. Use this analyzer for non-semantic English search scenarios — book titles, author names, or any field where exact word matching is required.

Supported field types: TEXT, SHORT_TEXT

This analyzer does not support query analysis configurations. Consecutive Chinese characters are treated as a single token.

Example

Input: English tokenizer english analyzer

The document is retrieved by English tokenizer, english, or analyzer.

Fine-grained analyzer for English

Tokenizes English text based on semantics with finer granularity than the standard English analyzer. Suitable for general industry applications where compound words need to be split into their component terms.

Supported field types: TEXT, SHORT_TEXT

Important

Available only for dedicated applications.

Example

Input: dataprocess

Tokenization result: data process

The document is retrieved by dataprocess, data process, data, or process.

General analyzer for Thai

A general-purpose analyzer that tokenizes Thai text into search units. Suitable for general industry applications.

Supported field types: TEXT, SHORT_TEXT

Important

Available only for dedicated applications.

Example

Input: แหล่งดึงดูดนักท่องเที่ยว

Tokenization result: แหล่ง ดึง ดูด นักท่องเที่ยว

The document is retrieved by นักท่องเที่ยว or แหล่งดึงดูดนักท่องเที่ยว.

E-commerce analyzer for Thai

A semantic analyzer designed for Thai-language e-commerce scenarios.

Supported field types: TEXT, SHORT_TEXT

Important

Available only for dedicated applications.

Example

Input: หน้าจอโทรศัพท์

Tokenization result: หน้าจอ โทรศัพท์

The document is retrieved by หน้าจอโทรศัพท์, หน้าจอ, or โทรศัพท์.

General analyzer for Vietnamese

A general-purpose analyzer for Vietnamese text in general industry applications.

Gives you full control over tokenization. Use this analyzer for special scenarios where no built-in analyzer meets your requirements. When pushing documents or performing searches, use the tab character (\t) to separate tokens. Field content and search queries must be tokenized the same way — otherwise, documents cannot be retrieved.

Supported field types: TEXT, SHORT_TEXT

This analyzer does not support query analysis configurations.

Example

Field value: chrysanthemum\tflower tea\thao

The document is retrieved by chrysanthemum, flower tea, chrysanthemum\tflower tea, flower tea\thao, chrysanthemum\thao, or chrysanthemum\tflower tea\thao.

Numeric analyzer

Supports searches based on time intervals or numerical ranges. Use this analyzer on INT and TIMESTAMP fields where range queries are required.

Supported field types: INT, TIMESTAMP

Example

query=default:'OpenSearch' AND index:[number1,number2]

In this example, index is the name of the index field configured with the numeric analyzer.

Geo-location analyzer

Supports geographic location range queries.

Supported field types: geo_point only

Example

query=spatial_index:'circle(116.5806 39.99624, 1000)'

This query finds points within a circle to locate nearby places within a few kilometers.

IT content analyzer

An industry-specific analyzer designed for IT content. It tokenizes IT-related terms differently than the general analyzer — for example, handling programming language names and technical abbreviations with higher precision.

Supported field types: TEXT, SHORT_TEXT

Example

Original content: C++ array usage notes

General analysis: C++ array usage notes

IT content analysis: C++ array usage notes

General E-commerce analysis

An industry-specific analyzer for e-commerce, powered by the natural language processing (NLP) technology of Alibaba DAMO Academy. It resolves common pain points in e-commerce search, including product attribute parsing and brand recognition.

Supported field types: TEXT only

Important

Available only for dedicated applications using the E-commerce Enhanced specification.

Example

Input: Small Gold Tube Concealer Cream

E-commerce analysis output: Small Gold Tube, Concealer, Cream

General analysis for the gaming industry

An industry-specific analyzer designed for gaming content.

Supported field types: TEXT, SHORT_TEXT

Important

Available only for dedicated applications enhanced for the gaming industry.

Example

Input: Genshin equipment

Tokenization result: Genshin, equipment

The document is retrieved by Genshin equipment, Genshin, or equipment.

General analyzer for English E-commerce

A semantic analyzer for English text in e-commerce scenarios.

Supported field types: TEXT only

Important

Available only for dedicated applications of the Industry-specific Enhanced Edition for E-commerce.

Custom text analyzer

Combines an industry-specific analyzer — such as a general analyzer, an e-commerce analyzer, or a person name analyzer — with custom intervention entries. Use this analyzer to extend built-in tokenization behavior with domain-specific vocabulary.

Supported field types: TEXT, SHORT_TEXT only

For configuration details, see Custom text analyzers.

Test analyzers

Test the tokenization results of industry-specific and custom analyzers in the OpenSearch console. Navigate to Search Algorithm Center > Retrieval Configuration > Analyzer Management, then click the Analysis Test tab.

Choose the right analyzer

Use the following guidance to select the right analyzer for your scenario.

Scenario	Recommended analyzer
Semantic Chinese search (most industries)	General analyzer for Chinese
Non-semantic Chinese search or short text with high recall needs	Single-character analyzer for Chinese
Chinese search requiring maximum character granularity	Character analyzer for Chinese
Pinyin search (full or abbreviated)	Fuzzy analyzer
Search by pinyin initials only	Abbreviated pinyin analyzer
Search by full pinyin spelling	Full pinyin spelling analyzer
English semantic search with stemming	Word stemming analyzer for English
English exact-word search (titles, names)	Unstemmed word analyzer for English
English compound word splitting	Fine-grained analyzer for English
E-commerce product search (Chinese)	E-commerce analyzer for Chinese or General E-commerce analysis
IT industry content	IT content analyzer
Numeric or time range queries	Numeric analyzer
Geographic proximity queries	Geo-location analyzer
Tags, identifiers, or non-tokenized strings	Keyword analyzer
Character-level substring matching	N-gram analyzer
Custom tokenization logic	Simple analyzer

Combining analyzers for better recall and precision

For Chinese search, combine the general analyzer with the single-character analyzer to retrieve documents containing individual characters while ranking exact phrases higher. For example:

query=title_index:'juhuacha' OR sws_title_index:'juhuacha'

Pair this query with the sort expression text_relevance(title)×5+field_proximity(sws_title). This combination retrieves documents that contain the individual characters for juhuacha even when separated, while ranking documents with the exact phrase juhuacha higher.

Usage notes

Supported field types for index fields

Supported	Unsupported
INT, INT_ARRAY, TEXT, SHORT_TEXT, LITERAL, LITERAL_ARRAY, TIMESTAMP, GEO_POINT	FLOAT, FLOAT_ARRAY, DOUBLE, DOUBLE_ARRAY

Additional constraints:

The default primary key of the primary table in the application schema is set as an index field named id. This configuration cannot be modified.
If a search result summary is configured for a TEXT field, some extended search unit phrases (such as huacha derived from juhuacha) are not highlighted in search results.
The single-character Chinese analyzer treats numbers and English words as single tokens. A search for he does not retrieve a document containing hello world. Use the fuzzy analyzer for partial word matching on numbers or English text.