Fuzzy search

更新时间:
复制 MD 格式

Introduction to fuzzy analysis

Fuzzy search is a method that allows a search engine to perform a fuzzy match between a user's query and a document to find relevant content. This is useful when a user's search intent is unclear. A fuzzy match can occur in two ways: the query is the full Pinyin or Pinyin initial of the content in the document, or the query appears directly in the document. Because fuzzy search cannot precisely interpret user intent, the results may include a large amount of irrelevant information. Therefore, you should use fuzzy search with caution based on your specific scenario.

Notes

  • The field type for a fuzzy analyzer must be SHORT_TEXT.

  • For fuzzy tokenization queries, you should generally use single quotation marks. Use double quotation marks only for the specific requirements described in this document.

Scenarios

Fuzzy search is mainly used when a user's search intent is unclear, or when you have a small amount of data and want to retrieve more results. The main scenarios are as follows.

Pinyin search

Introduction: Pinyin search lets you search for Chinese data in a document using full Pinyin or Pinyin initials.

Example:

Document content: OpenSearch
The following queries all retrieve this document:
"kai", "kaifang", "sousuo", "kaifangsousuo", "k", "kf", "ss", "kfss"

Notes:

  • You must use double quotation marks for Pinyin search queries.

  • To search for content that is contiguous in a document, enclose the search query in double quotation marks. For Pinyin search, you must always use double quotation marks because a user-entered Pinyin query has a specific intent. For example, a user who searches for "kfss" (OpenSearch) expects the corresponding characters to appear together.

Prefix search

Introduction: Prefix search retrieves content that starts with a specified prefix.

Example:

# The prefix identifier for fuzzy search is '^'. To search for phone numbers that start with 138,
# write the query as "^138". Note: Use double quotation marks for the query.

Notes:

  • Prefix matching for Chinese characters is not supported.

  • For prefix matching, you must enclose the query in double quotation marks.

Suffix Match

Introduction: Suffix search retrieves content that ends with a specified suffix.

Example:

# The suffix identifier for fuzzy search is '$'. To search for phone numbers that end with 9527,
# write the query as "9527$". Note: Use double quotation marks for the query.

Notes:

  • Suffix matching for Chinese characters is not supported.

  • For suffix matching, you must enclose the query in double quotation marks.

Single-character or single-letter search

Introduction: Fuzzy search supports single-character or single-letter searches. This is mainly used to increase document recall. However, the results may not be very accurate.

Example:

# The document content is: 'Open Search open search'
query=default:'放' or query=default:'o' retrieves the document

Phrase query

Introduction: A phrase query uses double quotation marks to enforce the order of terms. A phrase query can contain only consecutive letters and numbers.

Example:

# 1. query=default:"OpenSearch"
# This only retrieves documents that contain "xxxOpenSearchxxx". It does not retrieve documents like "xxxSearchOpenxxx".

# 2. query=default:"HuaweiP"
# This cannot retrieve a document like "HuaweiP20", because it violates the rule that "Only consecutive letters and numbers can be used in a phrase query".
# For this scenario, use single quotation marks for the query.

Notes:

  • You must use double quotation marks for phrase queries.

  • Phrase queries retrieve more accurate results and reduce the number of retrieved documents. However, they consume more system resources. In such scenarios, consider using the chinese analyzer instead.

  • Fuzzy search is intended for scenarios where the search intent is unclear or you want to retrieve more results from a small dataset. Therefore, you should use single quotation marks for all queries except for Pinyin, prefix, suffix, and phrase queries.

Limits

When you create an application, you must set the field that requires fuzzy search to the short_text type and assign a fuzzy analyzer to it. By default, fuzzy search results are sorted based on the position of the matched term within the field. For example, assume the title field of an application requires fuzzy search. If Document 1 contains "OpenSearch" and Document 2 contains "I like to use OpenSearch", a search for "kfss" will rank Document 1 higher than Document 2 by default. Fuzzy search works well when the search intent is unclear, but you should note the following limits:

  • Prefix and suffix searches are supported only for English letters, numbers, and Pinyin. Chinese characters are not supported.

  • Punctuation marks in a short_text field are filtered out.

  • After punctuation is filtered out, the content of a short_text field is limited to 100 bytes. Any content exceeding this limit is discarded.

  • A short_text field supports drop-down suggestions.

  • Indexes created from a short_text field cannot use query analysis.

  • If a short_text field uses only a fuzzy analyzer and is not indexed by any other analyzer, full-width characters in the field are converted to half-width characters in the retrieved summary. To avoid this, you can create a new index that uses the chinese analyzer.

  • Highlighting is not supported for English letters, numbers, or Pinyin.