AliNLP分词器即analysis-aliws,是阿里云Elasticsearch自带的一个系统默认插件。通过analysis-aliws插件,您可以在Elasticsearch中集成对应的分析器和分词器,完成文档的分析和检索。

安装AliNLP分词器

注意 在安装AliNLP分词器前,请确保您的实例为8G及以上规格。如果不满足,需要首先将实例规格升级至8G及以上,详情请参见集群升配
登录阿里云Elasticsearch控制台,单击实例ID > 插件配置 > 系统默认插件列表 。在系统默认插件列表列表中安装analysis-aliws插件,详情请参见操作步骤analysis-aliws插件
注意 analysis-aliws插件默认为未安装状态。

使用AliNLP分词器

AliNLP分词器安装成功后,阿里云Elasticsearch默认会集成如下的分析器和分词器。
  • 分析器:aliws(不会截取虚词、虚词短语、符号)。
  • 分词器:aliws_tokenizer

您可以使用上述的分析器和分词器完成文档的查询,具体步骤如下。

  1. 创建索引。
    PUT /index
    {
        "mappings": {
            "fulltext": {
                "properties": {
                    "content": {
                        "type": "text",
                        "analyzer": "aliws"
                    }
                }
            }
        }
    }

    以上代码创建了名称为index的索引,类型为fulltext。包含了一个content属性,类型为text,并添加了aliws分析器。

    执行成功后,返回如下结果:
    {
      "acknowledged": true,
      "shards_acknowledged": true,
      "index": "index"
    }
  2. 添加文档。
    POST /index/fulltext/1
    {
      "content": "I like go to school."
    }

    以上代码创建了名称为1的文档,并设置了文档中的content字段的内容为中华人民共和国国歌。

    执行成功后,返回如下结果:
    {
      "_index": "index",
      "_type": "fulltext",
      "_id": "1",
      "_version": 1,
      "result": "created",
      "_shards": {
        "total": 2,
        "successful": 2,
        "failed": 0
      },
      "_seq_no": 0,
      "_primary_term": 1
    }
  3. 查询。
    GET /index/fulltext/_search
    {
      "query": {
        "match": {
          "content": "school"
        }
      }
    }

    以上代码在所有fulltext类型的文档中,使用aliws分析器,搜索content字段中包含共和国的文档。

    执行成功后,返回如下结果:
    {
      "took": 5,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 1,
        "max_score": 0.2876821,
        "hits": [
          {
            "_index": "index",
            "_type": "fulltext",
            "_id": "2",
            "_score": 0.2876821,
            "_source": {
              "content": "I like go to school."
            }
          }
        ]
      }
    }
说明 如果您在使用analysis-aliws插件时,得到的结果不符合预期,可通过下文的分析器测试分词器测试进行排查调试。

分析器测试

GET _analyze
{
  "text": "I like go to school.",
  "analyzer": "aliws"
}
返回结果如下:
{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "like",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "go",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "school",
      "start_offset" : 13,
      "end_offset" : 19,
      "type" : "word",
      "position" : 8
    }
  ]
}

分词器测试

GET _analyze
{
  "text": "I like go to school.",
  "tokenizer": "aliws_tokenizer"
}
返回结果如下:
{
  "tokens" : [
    {
      "token" : "I",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : " ",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "like",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : " ",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "go",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : " ",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "to",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : " ",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "school",
      "start_offset" : 13,
      "end_offset" : 19,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : ".",
      "start_offset" : 19,
      "end_offset" : 20,
      "type" : "word",
      "position" : 9
    }
  ]
}