distinct clause-OpenSearch(Open Search)-阿里云帮助中心

Add a distinct clause to a query to control result diversity — limiting how many documents from the same group appear in ranked results. This prevents a single user, brand, or company from dominating an entire results page.

Example use cases:

Skew correction: Your results for a query contain too many documents from the same company. Set dist_key to company_id and dist_count to 2 so that at most two documents per company appear in each extraction round.
Deduplication: You want only one representative document per category. Set dist_count to 1 and dist_times to 1.

Syntax

"distinct": {
  "default": {
    "dist_key": "field",
    "dist_count": number,
    "dist_times": number,
    "dist_filter": "filter_expression",
    "reserved": boolean,
    "max_item_count": number,
    "grade": []
  },
  "rank": {
    "dist_key": "field",
    "dist_count": number,
    "dist_times": number,
    "dist_filter": "filter_expression",
    "reserved": boolean,
    "max_item_count": number,
    "grade": []
  },
  "rerank": {
    "dist_key": "field",
    "dist_count": number,
    "dist_times": number,
    "dist_filter": "filter_expression",
    "reserved": boolean,
    "max_item_count": number,
    "grade": []
  }
}

OpenSearch Retrieval Engine Edition applies dispersing in two phases: the rough sort phase and the fine sort phase. Use the default, rank, and rerank rule keys to control which rule applies to which phase.

Rules specified	Rough sort phase	Fine sort phase
`default` only	`default`	`default`
`rank` only	`rank`	—
`rerank` only	—	`rerank`
`default` + `rank`	`rank`	`default`
`default` + `rerank`	`default`	`rerank`
`rank` + `rerank`	`rank`	`rerank`
`default` + `rank` + `rerank`	`rank`	`rerank`

At least one of default, rank, or rerank must be specified.

Parameters

Parameter	Required	Default	Description
`dist_key`	Yes	—	The attribute field to group documents by for dispersing.
`dist_count`	No	`1`	Number of documents to extract per group in each round.
`dist_times`	No	`1`	Number of extraction rounds to perform.
`dist_filter`	No	All documents	A filter expression. Documents matching the filter are excluded from dispersing. In the fine sort phase, filtered documents are sorted together with the extracted documents.
`reserved`	No	`true`	Specifies whether to retain documents not extracted by the distinct clause. Set to `false` to discard them. When set to `false`, the `total` and `viewtotal` values in the response may be inaccurate.
`max_item_count`	No	—	Maximum number of documents retained in the DISTINCT calculation, computed as `max(max_item_count, hit)`. For example, if 10 results appear per page and up to 100 pages are returned, set this to `1000`.
`grade`	No	One grade	Threshold values (separated by `\|`) that classify documents into relevance grades based on rough sort scores. Documents within each grade are sorted in the same order as the rough sort phase.

grade examples

grade:3.0 — two grades: score < 3.0 (grade 1), score >= 3.0 (grade 2)
grade:3.0|5.0 — three grades: score < 3.0 (grade 1), 3.0 <= score < 5.0 (grade 2), score >= 5.0 (grade 3)

Example

The following example performs 10 rounds of extraction based on company_id, extracting 2 documents per round. Documents not extracted are assigned lower ranks.

"distinct": {
  "default": {
    "dist_key": "company_id",
    "dist_count": 2,
    "dist_times": 10
  }
}

How dist_count and dist_times interact

dist_count controls how many documents are extracted per group per round; dist_times controls how many rounds of extraction run. Together they determine the final document order.

Consider six documents where name is the distinct key:

doc1: id:1, name:a
doc2: id:2, name:a
doc3: id:3, name:a
doc4: id:4, name:b
doc5: id:5, name:c
doc6: id:6, name:c

Case 1 — dist_count:2, dist_times:1

Extract 2 documents per group, 1 round:

"distinct": {
  "default": {
    "dist_key": "name",
    "dist_count": 2,
    "dist_times": 1
  }
}

Result order: doc1, doc2, doc4, doc5, doc6

Round 1 extracts up to 2 documents from each group: 2 from group a (doc1, doc2), 1 from group b (doc4), 2 from group c (doc5, doc6). doc3 is not extracted.

Case 2 — dist_count:1, dist_times:2

Extract 1 document per group, 2 rounds:

"distinct": {
  "default": {
    "dist_key": "name",
    "dist_count": 1,
    "dist_times": 2
  }
}

Result order: doc1, doc4, doc5, doc2, doc6

Round 1 extracts 1 from each group: doc1 (a), doc4 (b), doc5 (c). Round 2 extracts the next 1 from each group: doc2 (a), doc6 (c). doc3 is not extracted.

Case 3 — dist_count:1, dist_times:1

Extract 1 document per group, 1 round:

"distinct": {
  "default": {
    "dist_key": "name",
    "dist_count": 1,
    "dist_times": 1
  }
}

Result order: doc1, doc4, doc5

Only 1 round runs. One document is extracted per group: doc1 (a), doc4 (b), doc5 (c). The remaining documents are not extracted.

Fix inaccurate total counts with the distinct uniq plug-in

When reserved is set to false, the total and viewtotal values in the response may be inaccurate, which can cause errors in pagination or any logic that depends on those values.

The distinct uniq plug-in corrects this by computing accurate counts. To activate it, add duniqfield:<field> to a kvpairs clause.

The plug-in only works when dist_times is 1, dist_count is 1, and reserved is false. The duniqfield value must match the dist_key value. For performance reasons, the plug-in returns at most 5,000 results per query.

{
  "distinct": {
    "default": {
      "dist_key": "company_id",
      "dist_count": 1,
      "dist_times": 1,
      "reserved": false
    }
  },
  "kvpairs": {
    "duniqfield": "company_id"
  }
}

Usage notes

Fields specified in a distinct clause must be attribute fields defined in schema.json.
Only INT and LITERAL field types are supported. ARRAY is not supported.