Text analyzers

更新时间:
复制 MD 格式

OpenSearch Industry Algorithm Edition provides built-in text analyzers for different languages, industries, and search scenarios. Each analyzer controls how field content is tokenized at index time and matched at query time.

Analyzer overview

The following table summarizes all available analyzers. "Dedicated only" means the analyzer requires a dedicated application instance.

AnalyzerLanguage / domainSupported field typesDedicated only
N-gramAnySHORT_TEXTYes
KeywordAnyLITERAL, INT, LITERAL_ARRAY, INT_ARRAYNo
General analyzer for ChineseChineseTEXT, SHORT_TEXTNo
E-commerce analyzer for ChineseChinese / E-commerceTEXT, SHORT_TEXTNo
Single-character analyzer for ChineseChineseTEXT, SHORT_TEXTNo
Character analyzer for ChineseChineseTEXT, SHORT_TEXTYes
Fuzzy analyzerChinese / PinyinSHORT_TEXTNo
Full pinyin spelling analyzerPinyinSHORT_TEXTNo
Abbreviated pinyin analyzerPinyinSHORT_TEXTNo
Word stemming analyzer for EnglishEnglishTEXT, SHORT_TEXTNo
Unstemmed word analyzer for EnglishEnglishTEXT, SHORT_TEXTNo
Fine-grained analyzer for EnglishEnglishTEXT, SHORT_TEXTYes
General analyzer for ThaiThaiTEXT, SHORT_TEXTYes
E-commerce analyzer for ThaiThai / E-commerceTEXT, SHORT_TEXTYes
General analyzer for VietnameseVietnameseTEXT, SHORT_TEXTYes
General analyzer for IndonesianIndonesianTEXT, SHORT_TEXTYes
General analyzer for KoreanKoreanTEXT, SHORT_TEXTYes
E-commerce analyzer for KoreanKorean / E-commerceTEXT, SHORT_TEXTYes
General analyzer for JapaneseJapaneseTEXT, SHORT_TEXTYes
E-commerce analyzer for JapaneseJapanese / E-commerceTEXT, SHORT_TEXTYes
Simple analyzerAnyTEXT, SHORT_TEXTNo
Numeric analyzerNumericINT, TIMESTAMPNo
Geo-location analyzerGeographicgeo_pointNo
IT content analyzerIT industryTEXT, SHORT_TEXTNo
General E-commerce analysisE-commerceTEXTYes (E-commerce Enhanced)
General analysis for the gaming industryGamingTEXT, SHORT_TEXTYes (Gaming Enhanced)
General analyzer for English E-commerceEnglish / E-commerceTEXTYes (E-commerce Enhanced)
Custom text analyzerAnyTEXT, SHORT_TEXTNo

N-gram analyzer

Tokenizes text into sequences of N consecutive characters. Supports 2-gram and 3-gram tokenization. Use this analyzer for non-semantic search scenarios where you need character-level substring matching.

Important

Available only for dedicated applications. The field type must be SHORT_TEXT.

Example — 2-gram

Input: Open Search

Tokens: op, pe, en, n , s, se, ea, ar, rc, ch

Example — 3-gram

Input: Open Search

Tokens: ope, pen, en , n s, se, sea, ear, arc, rch

Keyword analyzer

Does not tokenize the field value. The entire value is treated as a single token. Use this analyzer for exact match scenarios — tags, identifiers, string codes, and numeric values that must not be split.

Supported field types: LITERAL, INT, LITERAL_ARRAY, INT_ARRAY

Example

Input: chrysanthemum tea

The document is retrieved only when the query is exactly chrysanthemum tea.

General analyzer for Chinese

A general-purpose semantic analyzer for Chinese text. Tokenizes content into meaningful search units based on Chinese language semantics. Suitable for most industries.

Supported field types: TEXT, SHORT_TEXT

Example

Input: 菊花茶

The document is retrieved by queries 菊花茶, 菊花, , or 花茶.

E-commerce analyzer for Chinese

A semantic analyzer optimized for Chinese e-commerce product names and descriptions. Produces finer-grained tokens than the general analyzer for common product terminology.

Supported field types: TEXT, SHORT_TEXT

Example

Input: Dabao SOD lotion

The document is retrieved by queries Dabao, sod, sod lotion, SOD lotion, or lotion.

Single-character analyzer for Chinese

Tokenizes Chinese text into individual characters and words. Use this analyzer for non-semantic Chinese search scenarios — author names, store names, or any field where individual character recall matters more than semantic grouping.

Supported field types: TEXT, SHORT_TEXT

This analyzer treats numbers and English words as single tokens. A search for he does not retrieve a document containing hello world. Use the fuzzy analyzer if you need partial word matching on numbers or English text.
If a search result summary is configured for a TEXT field, some extended tokens such as 花茶 are not highlighted.

Example

Input: 菊花茶

The document is retrieved by queries 菊花茶, 菊花, , 花茶, , or 菊茶.

Character analyzer for Chinese

Tokenizes text into individual Chinese characters, numbers, English letters, and punctuation marks. Suitable for non-semantic search scenarios that require maximum granularity.

Supported field types: TEXT, SHORT_TEXT

Important

Available only for dedicated applications.

Example

Input: 开放搜索OpenSearch123.

The document is retrieved by searching for any single character: , , , , O, p, e, n, S, e, a, r, c, h, or .

Fuzzy analyzer

Supports searches by pinyin, single characters, and letters, including prefix and suffix matching for numbers, letters, and pinyin. Chinese text does not support prefix or suffix matching. The field length is limited to 100 bytes.

Supported field types: SHORT_TEXT only

For details, see Fuzzy searches.

Example — Chinese and pinyin

Input: chrysanthemum tea

The document is retrieved by queries including chrysanthemum tea, chrysanthemum, tea, flower tea, flower, ju, juhua, juhuacha, j, jh, or jhc.

Example — Prefix and suffix matching

Input: 138****5678

  • Use ^138 to match phone numbers starting with 138.

  • Use 5678$ to match phone numbers ending with 5678.

Example — English letter combinations

Input: OpenSearch

The document is retrieved by any single letter or combination of letters from the word.

Full pinyin spelling analyzer

Searches Chinese characters in SHORT_TEXT fields using their full pinyin spelling or the first letter of each syllable. Suitable for searches by full pinyin or abbreviated pinyin — for example, movie names or author names.

To search by full pinyin, enter the complete pinyin of the Chinese characters. Partial syllable spelling is not supported.

Supported field types: SHORT_TEXT only

Example

Input: Da Nei Mi Tan 007

The document is retrieved by d, dn, dnm, dnmt, dnmt007, da, danei, daneimi, or daneimitan.

The document is not retrieved by an or anei.

Abbreviated pinyin analyzer

Retrieves Chinese characters in SHORT_TEXT fields using the first letter of each pinyin syllable. Use this analyzer for scenarios that require searches by pinyin initials — people's names, movie titles, and similar short content.

Supported field types: SHORT_TEXT

Example

Input: Da Nei Mi Tan 007

The document is retrieved by d, dn, dnm, dnmt, dnmt0, dnmt007, m, mt, mt007, or 007.

Word stemming analyzer for English

An English semantic analyzer that stems each token to its root form and handles pluralization. Use this analyzer when you want searches like analyzing or analyzers to match documents containing analyze.

Supported field types: TEXT, SHORT_TEXT

This analyzer does not support query analysis configurations. Consecutive Chinese characters are treated as a single token.

Example

Input: 英文分词器 english analyzer

The document is retrieved by 英文分词器, english, analyz, analyzer, analyzers, analyze, analyzed, or analyzing.

Unstemmed word analyzer for English

Tokenizes English text based on spaces and punctuation marks without applying stemming. Use this analyzer for non-semantic English search scenarios — book titles, author names, or any field where exact word matching is required.

Supported field types: TEXT, SHORT_TEXT

This analyzer does not support query analysis configurations. Consecutive Chinese characters are treated as a single token.

Example

Input: 英文分词器 english analyzer

The document is retrieved by 英文分词器, english, or analyzer.

Fine-grained analyzer for English

Tokenizes English text based on semantics with finer granularity than the standard English analyzer. Suitable for general industry applications where compound words need to be split into their component terms.

Supported field types: TEXT, SHORT_TEXT

Important

Available only for dedicated applications.

Example

Input: dataprocess

Tokenization result: data process

The document is retrieved by dataprocess, data process, data, or process.

General analyzer for Thai

A general-purpose analyzer that tokenizes Thai text into search units. Suitable for general industry applications.

Supported field types: TEXT, SHORT_TEXT

Important

Available only for dedicated applications.

Example

Input: แหล่งดึงดูดนักท่องเที่ยว

Tokenization result: แหล่ง ดึง ดูด นักท่องเที่ยว

The document is retrieved by นักท่องเที่ยว or แหล่งดึงดูดนักท่องเที่ยว.

E-commerce analyzer for Thai

A semantic analyzer designed for Thai-language e-commerce scenarios.

Supported field types: TEXT, SHORT_TEXT

Important

Available only for dedicated applications.

Example

Input: หน้าจอโทรศัพท์

Tokenization result: หน้าจอ โทรศัพท์

The document is retrieved by หน้าจอโทรศัพท์, หน้าจอ, or โทรศัพท์.

General analyzer for Vietnamese

A general-purpose analyzer for Vietnamese text in general industry applications.

Supported field types: TEXT, SHORT_TEXT

Important

Available only for dedicated applications.

General analyzer for Indonesian

A general-purpose analyzer for Indonesian text in general industry applications.

Supported field types: TEXT, SHORT_TEXT

Important

Available only for dedicated applications.

General analyzer for Korean

A general-purpose analyzer for Korean text in general industry applications.

Supported field types: TEXT, SHORT_TEXT

Important

Available only for dedicated applications.

Example

Input: 인제군의교육

Tokenization result: 인제군 의 교육

The document is retrieved by 인제군의교육, , or 교육.

E-commerce analyzer for Korean

A semantic analyzer designed for Korean text in e-commerce scenarios.

Supported field types: TEXT, SHORT_TEXT

Important

Available only for dedicated applications.

Example

Input: 스포츠캐주얼신발

Tokenization result: 스포츠 캐주얼 신발

The document is retrieved by 스포츠, 캐주얼, or 신발.

General analyzer for Japanese

A general-purpose analyzer for Japanese text in general industry applications.

Supported field types: TEXT, SHORT_TEXT

Important

Available only for dedicated applications.

Example

Input: メキシコアグーチ

Tokenization result: メキシコ アグーチ

The document is retrieved by メキシコ or アグーチ.

E-commerce analyzer for Japanese

A semantic analyzer designed for Japanese text in e-commerce scenarios.

Supported field types: TEXT, SHORT_TEXT

Important

Available only for dedicated applications.

Example

Input: ラウンドネックスーツ

Tokenization result: ラウンド ネック スーツ

The document is retrieved by ラウンド, ネック, or スーツ.

Simple analyzer

Gives you full control over tokenization. Use this analyzer for special scenarios where no built-in analyzer meets your requirements. When pushing documents or performing searches, use the tab character (\t) to separate tokens. Field content and search queries must be tokenized the same way — otherwise, documents cannot be retrieved.

Supported field types: TEXT, SHORT_TEXT

This analyzer does not support query analysis configurations.

Example

Field value: chrysanthemum\tflower tea\thao

The document is retrieved by chrysanthemum, flower tea, chrysanthemum\tflower tea, flower tea\thao, chrysanthemum\thao, or chrysanthemum\tflower tea\thao.

Numeric analyzer

Supports searches based on time intervals or numerical ranges. Use this analyzer on INT and TIMESTAMP fields where range queries are required.

Supported field types: INT, TIMESTAMP

Example

query=default:'OpenSearch' AND index:[number1,number2]

In this example, index is the name of the index field configured with the numeric analyzer.

Geo-location analyzer

Supports geographic location range queries.

Supported field types: geo_point only

Example

query=spatial_index:'circle(116.5806 39.99624, 1000)'

This query finds points within a circle to locate nearby places within a few kilometers.

IT content analyzer

An industry-specific analyzer designed for IT content. It tokenizes IT-related terms differently than the general analyzer — for example, handling programming language names and technical abbreviations with higher precision.

Supported field types: TEXT, SHORT_TEXT

Example

Original content: C++ array usage notes

General analysis: C++ array usage notes

IT content analysis: C++ array usage notes

General E-commerce analysis

An industry-specific analyzer for e-commerce, powered by the natural language processing (NLP) technology of Alibaba DAMO Academy. It resolves common pain points in e-commerce search, including product attribute parsing and brand recognition.

Supported field types: TEXT only

Important

Available only for dedicated applications using the E-commerce Enhanced specification.

Example

Input: Small Gold Tube Concealer Cream

E-commerce analysis output: Small Gold Tube, Concealer, Cream

General analysis for the gaming industry

An industry-specific analyzer designed for gaming content.

Supported field types: TEXT, SHORT_TEXT

Important

Available only for dedicated applications enhanced for the gaming industry.

Example

Input: Genshin equipment

Tokenization result: Genshin, equipment

The document is retrieved by Genshin equipment, Genshin, or equipment.

General analyzer for English E-commerce

A semantic analyzer for English text in e-commerce scenarios.

Supported field types: TEXT only

Important

Available only for dedicated applications of the Industry-specific Enhanced Edition for E-commerce.

Custom text analyzer

Combines an industry-specific analyzer — such as a general analyzer, an e-commerce analyzer, or a person name analyzer — with custom intervention entries. Use this analyzer to extend built-in tokenization behavior with domain-specific vocabulary.

Supported field types: TEXT, SHORT_TEXT only

For configuration details, see Custom text analyzers.

Test analyzers

Test the tokenization results of industry-specific and custom analyzers in the OpenSearch console. Navigate to Search Algorithm Center > Retrieval Configuration > Analyzer Management, then click the Analysis Test tab.

4

Choose the right analyzer

Use the following guidance to select the right analyzer for your scenario.

ScenarioRecommended analyzer
Semantic Chinese search (most industries)General analyzer for Chinese
Non-semantic Chinese search or short text with high recall needsSingle-character analyzer for Chinese
Chinese search requiring maximum character granularityCharacter analyzer for Chinese
Pinyin search (full or abbreviated)Fuzzy analyzer
Search by pinyin initials onlyAbbreviated pinyin analyzer
Search by full pinyin spellingFull pinyin spelling analyzer
English semantic search with stemmingWord stemming analyzer for English
English exact-word search (titles, names)Unstemmed word analyzer for English
English compound word splittingFine-grained analyzer for English
E-commerce product search (Chinese)E-commerce analyzer for Chinese or General E-commerce analysis
IT industry contentIT content analyzer
Numeric or time range queriesNumeric analyzer
Geographic proximity queriesGeo-location analyzer
Tags, identifiers, or non-tokenized stringsKeyword analyzer
Character-level substring matchingN-gram analyzer
Custom tokenization logicSimple analyzer

Combining analyzers for better recall and precision

For Chinese search, combine the general analyzer with the single-character analyzer to retrieve documents containing individual characters while ranking exact phrases higher. For example:

query=title_index:'菊花茶' OR sws_title_index:'菊花茶'

Pair this query with the sort expression text_relevance(title)×5+field_proximity(sws_title). This combination retrieves documents that contain the individual characters for 菊花茶 even when separated, while ranking documents with the exact phrase 菊花茶 higher.

Usage notes

Supported field types for index fields

SupportedUnsupported
INT, INT_ARRAY, TEXT, SHORT_TEXT, LITERAL, LITERAL_ARRAY, TIMESTAMP, GEO_POINTFLOAT, FLOAT_ARRAY, DOUBLE, DOUBLE_ARRAY

Additional constraints:

  • The default primary key of the primary table in the application schema is set as an index field named id. This configuration cannot be modified.

  • If a search result summary is configured for a TEXT field, some extended search unit phrases (such as 花茶 derived from 菊花茶) are not highlighted in search results.

  • The single-character Chinese analyzer treats numbers and English words as single tokens. A search for he does not retrieve a document containing hello world. Use the fuzzy analyzer for partial word matching on numbers or English text.