The AliNLP tokenization plug-in, also known as analysis-aliws, is a built-in plug-in of Alibaba Cloud Elasticsearch. After you install this plug-in on your Elasticsearch cluster, an analyzer and a tokenizer are integrated into the cluster to implement document analysis and retrieval. The plug-in allows you to upload a tailored dictionary file to it. After the upload, the system performs a rolling update for your Elasticsearch cluster to apply the dictionary file.
Introduction
After the analysis-aliws plug-in is installed, the following analyzer and tokenizer are integrated into your Elasticsearch cluster by default. You can use the analyzer and tokenizer to search for documents. You can also upload a tailored dictionary file to the plug-in.
-
Analyzer: The aliws analyzer preserves function words, function word phrases, and symbols without filtering them. It also automatically converts all uppercase English characters to lowercase.
Tokenizer: aliws_tokenizer
For more information, see Use the aliws analyzer to search for a document and Configure dictionaries.
If you fail to get the expected results by using the analysis-aliws plug-in, troubleshoot the failure based on the instructions in Test the analyzer and Test the tokenizer.
You can customize a tokenizer. For more information, see Customize a tokenizer.
Prerequisites
The analysis-aliws plug-in is installed. It is not installed by default. For more information about how to install the plug-in, see Install and remove a built-in plug-in.
Limits
The memory size of data nodes in your Elasticsearch cluster must be 8 GiB or higher. If the memory size of data nodes in your cluster does not meet the requirements, upgrade the configuration of your cluster. For more information, see Upgrade the configuration of a cluster.
Elasticsearch V5.X clusters, Elasticsearch V8.X clusters, and Elasticsearch Kernel-enhanced Edition clusters do not support the analysis-aliws plug-in. You can check whether your cluster support the plug-in in the Elasticsearch console.
Use the aliws analyzer to search for a document
Log on to the Kibana console of your Elasticsearch cluster.
For instructions, see Log on to the Kibana console.
NoteExamples here use Elasticsearch V6.7.0. Operations may vary slightly for other versions.
In the left navigation menu, choose .
-
In the Console, run one of the following commands to create an index.
Command for an Elasticsearch cluster of a version earlier than V7.0
PUT /index { "mappings": { "fulltext": { "properties": { "content": { "type": "text", "analyzer": "aliws" } } } } }Command for an Elasticsearch cluster of V7.0 or later
PUT /index { "mappings": { "properties": { "content": { "type": "text", "analyzer": "aliws" } } } }
In this example, an index named index is created. In a version earlier than V7.0, the type of the index is fulltext. In V7.0 or later, the type of the index is _doc. The index contains the content property. The type of the property is text. In addition, the aliws analyzer is added to the index.
If the command is successfully run, the following result is returned:
{ "acknowledged": true, "shards_acknowledged": true, "index": "index" } Run the following command to add a document:
ImportantThe following command is suitable only for an Elasticsearch cluster of a version earlier than V7.0. For an Elasticsearch cluster of V7.0 or later, you must change fulltext to _doc.
POST /index/fulltext/1 { "content": "I like go to school." }The preceding command adds a document named 1 and sets the value of the content field in the document to I like go to school..
If the command is successfully run, the following result is returned:
{ "_index": "index", "_type": "fulltext", "_id": "1", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 2, "failed": 0 }, "_seq_no": 0, "_primary_term": 1 }-
Run the following command to search for the document:
ImportantThe following command is suitable only for an Elasticsearch cluster of a version earlier than V7.0. For an Elasticsearch cluster of V7.0 or later, you must change fulltext to _doc.
GET /index/fulltext/_search { "query": { "match": { "content": "school" } } }The preceding command uses the aliws analyzer to analyze all documents of the fulltext type, and returns the document that has school contained in the content field.
If the command is successfully run, the following result is returned:
{ "took": 5, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1, "max_score": 0.2876821, "hits": [ { "_index": "index", "_type": "fulltext", "_id": "2", "_score": 0.2876821, "_source": { "content": "I like go to school." } } ] } }
If you fail to get the expected results by using the analysis-aliws plug-in, troubleshoot the failure based on the instructions in Test the analyzer and Test the tokenizer.
Configure dictionaries
The analysis-aliws plug-in allows you to upload a tailored dictionary file named aliws_ext_dict.txt. After you upload a tailored dictionary file, all the nodes in your Elasticsearch cluster automatically load the file. In this case, the system does not restart the cluster.
After the analysis-aliws plug-in is installed, no dictionary file is provided. You must manually upload a tailored dictionary file.
-
Before you upload a tailored dictionary file, you must make sure that the dictionary file meets the following requirements:
Name: aliws_ext_dict.txt.
Encoding format: UTF-8.
-
Content: Each line must contain one word with no leading or trailing whitespace. Use UNIX or Linux line endings (
\n). If you create the file in Windows, you must convert its line endings to the UNIX format (for example, with the dos2unix tool) before uploading.
Log on to the Alibaba Cloud Elasticsearch console.
In the left navigation menu, choose Elasticsearch Clusters.
Navigate to the target cluster.
In the top navigation bar, select the resource group to which the cluster belongs and the region where the cluster resides.
On the Elasticsearch Clusters page, find the cluster and click its ID.
-
In the left-side navigation pane, choose .
-
On the Built-in Plug-ins tab, find the analysis-aliws plug-in and click Configure Dictionary in the Actions column.
-
At the bottom of the Configure Dictionary panel, click Configure.
-
Select a method to upload the dictionary file. Then, upload the dictionary file based on the following instructions.
-
TXT File: Click Upload TXT File and select the dictionary file from your local machine.
-
Add OSS File: Enter the Bucket Name and File Name, and then click Add.
Make sure that the bucket that you specify resides in the same region as the Elasticsearch cluster. If the content of the dictionary that is stored in Object Storage Service (OSS) changes, you must upload the dictionary file again.
NoteThe analysis-aliws plug-in supports only one dictionary file. To update the file, click the X icon next to the aliws_ext_dict.txt file to delete it, and then upload a new file.
-
-
Click Save.
The system does not restart your cluster but performs a rolling update to make the uploaded dictionary file take effect. The update requires about 10 minutes.
NoteIf you want to download the uploaded dictionary file, click the
icon that corresponds to the file.
Test the analyzer
Run the following command to test the aliws analyzer:
GET _analyze
{
"text": "I like go to school.",
"analyzer": "aliws"
}If the command is successfully run, the following result is returned:
{
"tokens" : [
{
"token" : "i",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "like",
"start_offset" : 2,
"end_offset" : 6,
"type" : "word",
"position" : 2
},
{
"token" : "go",
"start_offset" : 7,
"end_offset" : 9,
"type" : "word",
"position" : 4
},
{
"token" : "school",
"start_offset" : 13,
"end_offset" : 19,
"type" : "word",
"position" : 8
}
]
}Test the tokenizer
Run the following command to test the aliws_tokenizer tokenizer:
GET _analyze
{
"text": "I like go to school.",
"tokenizer": "aliws_tokenizer"
}If the command is successfully run, the following result is returned:
{
"tokens" : [
{
"token" : "I",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : " ",
"start_offset" : 1,
"end_offset" : 2,
"type" : "word",
"position" : 1
},
{
"token" : "like",
"start_offset" : 2,
"end_offset" : 6,
"type" : "word",
"position" : 2
},
{
"token" : " ",
"start_offset" : 6,
"end_offset" : 7,
"type" : "word",
"position" : 3
},
{
"token" : "go",
"start_offset" : 7,
"end_offset" : 9,
"type" : "word",
"position" : 4
},
{
"token" : " ",
"start_offset" : 9,
"end_offset" : 10,
"type" : "word",
"position" : 5
},
{
"token" : "to",
"start_offset" : 10,
"end_offset" : 12,
"type" : "word",
"position" : 6
},
{
"token" : " ",
"start_offset" : 12,
"end_offset" : 13,
"type" : "word",
"position" : 7
},
{
"token" : "school",
"start_offset" : 13,
"end_offset" : 19,
"type" : "word",
"position" : 8
},
{
"token" : ".",
"start_offset" : 19,
"end_offset" : 20,
"type" : "word",
"position" : 9
}
]
}Customize a tokenizer
After the analysis-aliws plug-in performs tokenization on data, the following filters perform the related operations on the data: stemmer, lowercase, porter_stem, and stop. If you want to use these filters for your custom tokenizer, you can add the tokenizer aliws_tokenizer of the analysis-aliws plug-in to the custom tokenizer and add filter configurations based on your business requirements. The following code provides an example. You can use the stopwords field to configure stopwords.
PUT my-index-000001 { "settings": { "analysis": { "filter": { "my_stop": { "type": "stop", "stopwords": [ " ", ",", ".", " ", "a", "of" ] } }, "analyzer": { "my_custom_analyzer": { "type": "custom", "tokenizer": "aliws_tokenizer", "filter": [ "lowercase", "porter_stem", "my_stop" ] } } } } }NoteIf you do not require a filter, you can delete the filter configuration.
aliws_tokenizer allows you to use synonyms to configure a custom tokenizer. The configuration method is the same as that used for the analysis-ik plug-in. For more information, see Use synonyms.
FAQ
-
The
aliwsanalyzer may remove the last letter of a word. For example, tokenizingiPhoneandChineseproducesIphonandchines, without the finale.-
Cause: The aliws analyzer applies a stemming filter after tokenization, which removes the final
e. Solution: Run the following command, in which the my_custom_analyzer field is specified in the analysis configuration part and the filter configuration part is removed:
PUT my-index1 { "settings": { "number_of_shards": 1, "analysis": { "analyzer": { "my_custom_analyzer": { "type": "custom", "tokenizer": "aliws_tokenizer" } } } } }Verification: Run the following command to check whether the tokenization results meet expectations:
GET my-index1/_analyze { "analyzer": "my_custom_analyzer", "text": ["iphone"] }
-
References
For information about the plug-ins provided by Alibaba Cloud Elasticsearch, see Overview of plug-ins.
For information about how to call an API operation to install a built-in plug-in, see InstallSystemPlugin.
For information about how to call an API operation to update the dictionary file of the analysis-aliws plug-in, see UpdateAliwsDict.
For information about how to call an API operation to obtain the plug-ins that are installed on an Elasticsearch cluster, see ListPlugins.