Custom analyzers

更新时间:
复制 MD 格式

Overview

Analysis is a critical component in a search engine because analysis results directly impact search performance. Because business scenarios vary, the semantics of a single phrase can differ across contexts, resulting in different requirements for analysis. To address this, OpenSearch provides both general-purpose and domain-specific analyzers, such as an e-commerce analyzer.

OpenSearch lets you create a custom analyzer by combining a built-in analyzer with intervention entries. You can then apply the custom analyzer to the index fields of your application to control the analysis results at both indexing and query time. This control ensures high-quality search results.

Intervention entries

You can control the behavior of intervention entries by enabling or disabling the secondary analysis feature.

Enabling secondary analysis allows the analyzer to further segment the result of an intervention entry. Disabling it ensures that the intervention result is treated as a final output and is not segmented further.

For example, if the entry is 'Open Search' and the analyzer is 'General Chinese', the result of enabling secondary analysis is:

The analysis result shows that "Open Search" is segmented into the two entries Open and Search.

With secondary analysis disabled, the result is:

The analysis results show that 'OpenSearch' is treated as a single entry, OpenSearch, without being further segmented.

Usage notes

  • A custom analyzer includes all entries from its base analyzer, plus your custom intervention entries. The intervention entries override the base analyzer's default entries.

  • You can create a maximum of 20 custom analyzers.

  • Each custom analyzer can contain a maximum of 1,000 intervention entries.

  • An entry's query cannot exceed 10 characters (Chinese or English), and its value cannot exceed 32 characters.

  • Entries cannot contain uppercase letters (A-Z), full-width characters (\uff01 to \uff5e), or Chinese punctuation.

  • For semantic segmentation intervention entries, the query and the value must be identical after spaces are removed from the value. For example:

    invalidkey=>wrong key
    validkey=>valid key

    The first entry is invalid because its query invalidkey and value wrong key (which becomes wrongkey after removing spaces) are not identical.

  • When adding an intervention entry, use spaces to separate multiple terms in the analysis result. Do not use commas. Commas are treated as part of the term content, not as delimiters, which will cause validation to fail and prevent the entry from being saved. For example:

    applehuawei=>apple huawei (Correct, separated by a space)
    applehuawei=>apple,huawei (Incorrect, the comma is treated as a content character)

    In the second entry, the value is separated by a comma. The comma is treated as a literal character, making the value apple,huawei after space removal, which does not match the query applehuawei. Therefore, validation fails.

  • The query of an entry cannot contain spaces. For example:

    invalid key=>an invalid key
    validkey=>a valid key

    The first entry is invalid because its query, invalid key, contains a space.

  • The query of an entry cannot be a substring of the value of another entry within the same intervention dictionary. For example:

    customanalyzer=>custom analyzer
    analyzer
    token

    The second entry, with the query analyzer, is invalid because its query is part of the value (custom analyzer) of the first entry. The third entry, token, is valid.

Procedure

Workflow overview

Create a custom analyzer -> Perform an offline modification -> Reindex -> Verify the results

Steps

1. In the left-side navigation pane of the OpenSearch console, go to Search Algorithm Center > Retrieval Configuration > Analyzer Management, and then click Create.

2. Create an analyzer, define a name, and select a type:

In the dialog box that appears, for Analyzer Type, select Built-in Analyzer or Custom Model Analyzer, and from the Analyzer dropdown list, select a specific analyzer, such as Chinese - General Analysis.

3. Add an intervention entry, enter the query (for example, "sticky rice") and the analysis result, and select secondary analysis:

On the Entry Management page, click Add. In the Add Intervention Entry dialog box, enter a Query and the Analysis Result, enable Secondary Analysis, and click Save. Note: Separate words with spaces. Example: "糯米" ==> "糯 米".

4. Test the analysis to verify the result of the intervention entry:

In the analyzer list on the Analyzer Management page, find the target custom analyzer and click Analysis Test in the Actions column.

  • 4.1. In the test text, enter "sticky rice".

After testing with the test_zw analyzer, the analysis result shows that "糯米" is split into the two entries and .

  • 4.2. Compare the analysis results from multiple custom analyzers.

The comparison results show that test_zw segments "糯米" into and , whereas kevintest2 keeps 糯米 as a complete entry.

5. After testing the analysis effect, return to Basic Configuration under Retrieval Configuration to apply the changes to your application:

In the Analysis Configuration area, click the Offline Modification button.

Note: An offline modification generates a new offline application version based on your current configuration. Your live application is not affected.

6. In the list of index fields, select your custom analyzer from the Analysis Method column for the relevant index field.

7. The changes take effect after reindexing is complete.

Demonstrating the custom analyzer's effect

For example, for a document that contains "glutinous rice", using the "Chinese - General" analyzer produces unexpected results. A search for "rice" fails to retrieve documents that contain "glutinous rice", "millet", or "rice". After you add the "test_zw" custom analyzer, modify the application structure, and perform reindexing as described in the procedure above, the resulting terms will be consistent with your custom analysis.

Additional notes

  • You can add entries to a custom analyzer that is already in use by an application, but you must reindex for the new entries to take effect. To apply the changes to specific documents, re-upload them. This action reindexes only those documents, applying the new intervention entries.

  • The query of an entry cannot exceed 10 characters.

  • The query of an entry cannot contain uppercase letters, full-width characters, or Chinese punctuation.

  • The analysis result of an entry cannot contain uppercase letters, full-width characters, or Chinese punctuation.

  • If you disable the secondary analysis toggle, OpenSearch strictly adheres to your intervention results and performs no further segmentation. If enabled, the results of your intervention may be segmented further.

  • To use the Industry - E-commerce General Analyzer as the base for a custom analyzer, your application must be of the Industry-specific Enhanced Edition.

  • You cannot delete a custom analyzer that is in use.