阿里云Elasticsearch(简称ES)的分词配置模块提供了同义词配置功能,通过同义词配置,您可以上传自定义的同义词词典文件,作用于ES的同义词库。新的索引将会采用更新后的词库进行搜索。

注意事项

  • 阿里云Elasticsearch上传同义词词典操作不会重启节点,后台会进行同义词词典的下发,生效时间与节点数量相关。
  • 假设现存索引index-aliyun使用了aliyun.txt同义词文件,当aliyun.txt文件内容变更并重新上传后,现存索引不会动态加载更新后的同义词词典。建议您在词典文件内容发生变化后进行索引重建操作,否则可能会造成只有新增数据使用新词典的情况。
  • 同义词词典文件配置要求每行一个同义词表达式,保存为utf-8编码的.txt文件。例如:
    西红柿,番茄 =>西红柿,番茄
    社保,公积金 =>社保,公积金

同义词配置说明

您可以使用filter过滤器配置同义词,示例代码如下。
PUT /test_index
{    
    "settings": {        
        "index" : {            
            "analysis" : {                
                "analyzer" : {                    
                    "synonym" : {                        
                        "tokenizer" : "whitespace",                       
                        "filter" : ["synonym"]                    
                        }               
                   },                
                   "filter" : {                    
                        "synonym" : {                       
                             "type" : "synonym",                        
                              "synonyms_path" : "analysis/synonym.txt",                                          
                              "tokenizer" : "whitespace"                    
                          }               
                       }            
                    }        
                  }    
          }
}
  • filter:配置一个synonym(同义词)过滤器,其中包含一个路径analysis/synonym.txt(路径是相对于config的位置)。
  • tokenizer:用于控制标记同义词的分词器,默认为whitespace分词器,其他设置有:
    • ignore_case:默认值为false
    • expand:默认值为true
目前同义词分词器支持Solr和WordNet两种同义词格式。
  • Solr同义词
    文件的示例格式如下。
    # Blank lines and lines starting with pound are comments.
    # Explicit mappings match any token sequence on the LHS of "=>"
    # and replace with all alternatives on the RHS.  These types of mappings
    # ignore the expand parameter in the schema.
    # Examples:
    i-pod, i pod => ipod,
    sea biscuit, sea biscit => seabiscuit
    # Equivalent synonyms may be separated with commas and give
    # no explicit mapping.  In this case the mapping behavior will
    # be taken from the expand parameter in the schema.  This allows
    # the same synonym file to be used in different synonym handling strategies.
    # Examples:
    ipod, i-pod, i pod
    foozball , foosball
    universe , cosmos
    lol, laughing out loud
    # If expand==true, "ipod, i-pod, i pod" is equivalent
    # to the explicit mapping:
    ipod, i-pod, i pod => ipod, i-pod, i pod
    # If expand==false, "ipod, i-pod, i pod" is equivalent
    # to the explicit mapping:
    ipod, i-pod, i pod => ipod
    # Multiple synonym mapping entries are merged.
    foo => foo bar
    foo => baz
    # is equivalent to
    foo => foo bar, baz
    您也可以在配置文件中直接给过滤器定义同义词(请注意使用synonyms而不是synonyms_path),示例如下。
    PUT /test_index
    {
        "settings": {
            "index" : {
                "analysis" : {
                    "filter" : {
                        "synonym" : {
                            "type" : "synonym",
                            "synonyms" : [
                                "i-pod, i pod => ipod",
                                "begin, start"
                            ]
                        }
                    }
                }
            }
        }
    }
    说明 建议您使用synonyms_path在文件中定义大型同义词集,因为使用synonyms定义会增加群集大小。
  • WordNet同义词
    示例格式声明如下。
    PUT /test_index
    {
        "settings": {
            "index" : {
                "analysis" : {
                    "filter" : {
                        "synonym" : {
                            "type" : "synonym",
                            "format" : "wordnet",
                            "synonyms" : [
                                "s(100000001,1,'abstain',v,1,0).",
                                "s(100000001,2,'refrain',v,1,0).",
                                "s(100000001,3,'desist',v,1,0)."
                            ]
                        }
                    }
                }
            }
        }
    }

    以上示例使用synonyms定义WordNet同义词,您也可以使用synonyms_path在文本中定义WordNet同义词。

同义词配置步骤

  1. 在阿里云ES控制台上传同义词词典文件,保存并生效成功。
  2. 在创建索引配置 setting时,配置 "synonyms_path": "analysis/your_dict_name.txt",再为该索引配置 mapping,指定字段设置同义词。
  3. 校验同义词,并上传测试数据进行搜索测试。

使用示例一

以下示例使用filter过滤器配置同义词,操作步骤如下。

  1. ES集群配置页面,单击分词配置右侧的同义词配置
  2. 同义词配置页面,单击上传文件,选择您要上传的同义词词典(按照同义词配置说明中的规则生成的txt文件),单击保存
  3. 等待阿里云Elasticsearch实例生效并提示状态正常后即可使用。
    本示例中使用aliyun_synonyms.txt作为测试文件,内容为begin, start配置同义词
  4. 配置并测试同义词。
    1. 登录Kibana控制台
    2. Console中执行如下命令,创建索引。
      PUT /aliyun-index-test
      {
      "index": {
       "analysis": {
         "analyzer": {
           "by_smart": {
             "type": "custom",
             "tokenizer": "ik_smart",
             "filter": ["by_tfr","by_sfr"],
             "char_filter": ["by_cfr"]
           },
           "by_max_word": {
             "type": "custom",
             "tokenizer": "ik_max_word",
             "filter": ["by_tfr","by_sfr"],
             "char_filter": ["by_cfr"]
           }
         },
         "filter": {
           "by_tfr": {
             "type": "stop",
             "stopwords": [" "]
           },
           "by_sfr": {
             "type": "synonym",
             "synonyms_path": "analysis/aliyun_synonyms.txt"
           }
         },
         "char_filter": {
           "by_cfr": {
             "type": "mapping",
             "mappings": ["| => |"]
           }
         }
       }
      }
      }
    3. 执行以下命令,配置同义词字段title
      PUT /aliyun-index-test/_mapping/doc
      {
      "properties": {
       "title": {
         "type": "text",
         "analyzer": "by_max_word",
         "search_analyzer": "by_smart"
       }
      }
      }
    4. 执行以下命令,校验同义词。
      GET /aliyun-index-test/_analyze
      {
      "analyzer": "by_smart",
      "text":"begin"
      }
      命令执行成功后,返回结果如下。
      {
      "tokens": [
       {
         "token": "begin",
         "start_offset": 0,
         "end_offset": 5,
         "type": "ENGLISH",
         "position": 0
       },
       {
         "token": "start",
         "start_offset": 0,
         "end_offset": 5,
         "type": "SYNONYM",
         "position": 0
       }
      ]
      }
    5. 执行以下命令,添加数据,进行下一步测试。
      PUT /aliyun-index-test/doc/1
      {
      "title": "Shall I begin?"
      }
      PUT /aliyun-index-test/doc/2
      {
      "title": "I start work at nine."
      }
    6. 执行以下命令,测试查询。
      GET /aliyun-index-test/_search
      {
       "query" : { "match" : { "title" : "begin" }},
       "highlight" : {
           "pre_tags" : ["<red>", "<bule>"],
           "post_tags" : ["</red>", "</bule>"],
           "fields" : {
               "title" : {}
           }
       }
      }
      命令执行成功后,返回结果如下。
      {
      "took": 11,
      "timed_out": false,
      "_shards": {
       "total": 5,
       "successful": 5,
       "failed": 0
      },
      "hits": {
       "total": 2,
       "max_score": 0.41048482,
       "hits": [
         {
           "_index": "aliyun-index-test",
           "_type": "doc",
           "_id": "2",
           "_score": 0.41048482,
           "_source": {
             "title": "I start work at nine."
           },
           "highlight": {
             "title": [
               "I <red>start</red> work at nine."
             ]
           }
         },
         {
           "_index": "aliyun-index-test",
           "_type": "doc",
           "_id": "1",
           "_score": 0.39556286,
           "_source": {
             "title": "Shall I begin?"
           },
           "highlight": {
             "title": [
               "Shall I <red>begin</red>?"
             ]
           }
         }
       ]
      }
      }

使用示例二

以下示例直接引用同义词并使用IK过滤,操作步骤如下。
  1. 登录Kibana控制台,在Console中执行如下命令。
    PUT /my_index
    {
     "settings": {
         "analysis": {
             "analyzer": {
                 "my_synonyms": {
                     "filter": [
                         "lowercase",
                         "my_synonym_filter"
                     ],
                     "tokenizer": "ik_smart"
                 }
             },
             "filter": {
                 "my_synonym_filter": {
                     "synonyms": [
                         "begin,start"
                     ],
                     "type": "synonym"
                 }
             }
         }
     }
    }
    以上命令的原理为:
    1. 设置一个同义词过滤器my_synonym_filter,并配置同义词词库。
    2. 设置一个my_synonyms解释器,使用ik_smart分词。
    3. 经过ik_smart分词,把所有字母小写并作同义语查找。
  2. 执行以下命令,设置同义词字段title
    PUT /my_index/_mapping/doc
    {
    "properties": {
     "title": {
       "type": "text",
       "analyzer": "my_synonyms"
     }
    }
    }
  3. 执行以下命令,校验同义词。
    GET /my_index/_analyze
    {
     "analyzer":"my_synonyms",
     "text":"Shall I begin?"
    }
    命令执行成功后,返回数据如下。
    {
    "tokens": [
     {
       "token": "shall",
       "start_offset": 0,
       "end_offset": 5,
       "type": "ENGLISH",
       "position": 0
     },
     {
       "token": "i",
       "start_offset": 6,
       "end_offset": 7,
       "type": "ENGLISH",
       "position": 1
     },
     {
       "token": "begin",
       "start_offset": 8,
       "end_offset": 13,
       "type": "ENGLISH",
       "position": 2
     },
     {
       "token": "start",
       "start_offset": 8,
       "end_offset": 13,
       "type": "SYNONYM",
       "position": 2
     }
    ]
    }
  4. 执行以下命令,添加数据,进行下一步测试。
    PUT /my_index/doc/1
    {
    "title": "Shall I begin?"
    }
    PUT /my_index/doc/2
    {
    "title": "I start work at nine."
    }
  5. 执行以下命令,测试查询。
    GET /my_index/_search
    {
    "query" : { "match" : { "title" : "begin" }},
    "highlight" : {
      "pre_tags" : ["<red>", "<bule>"],
      "post_tags" : ["</red>", "</bule>"],
      "fields" : {
          "title" : {}
      }
    }
    }
    命令执行成功后,返回数据如下。
    {
    "took": 11,
    "timed_out": false,
    "_shards": {
     "total": 5,
     "successful": 5,
     "failed": 0
    },
    "hits": {
     "total": 2,
     "max_score": 0.41913947,
     "hits": [
       {
         "_index": "my_index",
         "_type": "doc",
         "_id": "2",
         "_score": 0.41913947,
         "_source": {
           "title": "I start work at nine."
         },
         "highlight": {
           "title": [
             "I <red>start</red> work at nine."
           ]
         }
       },
       {
         "_index": "my_index",
         "_type": "doc",
         "_id": "1",
         "_score": 0.39556286,
         "_source": {
           "title": "Shall I begin?"
         },
         "highlight": {
           "title": [
             "Shall I <red>begin</red>?"
           ]
         }
       }
     ]
    }
    }

本文档部分内容参考了官方Elasticsearch文档,详情请参见Synonym Token FilterUsing Synonyms