distinct clause

更新时间:
复制 MD 格式

Add a distinct clause to a query to control result diversity — limiting how many documents from the same group appear in ranked results. This prevents a single user, brand, or company from dominating an entire results page.

Example use cases:

  • Skew correction: Your results for a query contain too many documents from the same company. Set dist_key to company_id and dist_count to 2 so that at most two documents per company appear in each extraction round.

  • Deduplication: You want only one representative document per category. Set dist_count to 1 and dist_times to 1.

Syntax

"distinct": {
  "default": {
    "dist_key": "field",
    "dist_count": number,
    "dist_times": number,
    "dist_filter": "filter_expression",
    "reserved": boolean,
    "max_item_count": number,
    "grade": []
  },
  "rank": {
    "dist_key": "field",
    "dist_count": number,
    "dist_times": number,
    "dist_filter": "filter_expression",
    "reserved": boolean,
    "max_item_count": number,
    "grade": []
  },
  "rerank": {
    "dist_key": "field",
    "dist_count": number,
    "dist_times": number,
    "dist_filter": "filter_expression",
    "reserved": boolean,
    "max_item_count": number,
    "grade": []
  }
}

OpenSearch Retrieval Engine Edition applies dispersing in two phases: the rough sort phase and the fine sort phase. Use the default, rank, and rerank rule keys to control which rule applies to which phase.

Rules specifiedRough sort phaseFine sort phase
default onlydefaultdefault
rank onlyrank
rerank onlyrerank
default + rankrankdefault
default + rerankdefaultrerank
rank + rerankrankrerank
default + rank + rerankrankrerank

At least one of default, rank, or rerank must be specified.

Parameters

ParameterRequiredDefaultDescription
dist_keyYesThe attribute field to group documents by for dispersing.
dist_countNo1Number of documents to extract per group in each round.
dist_timesNo1Number of extraction rounds to perform.
dist_filterNoAll documentsA filter expression. Documents matching the filter are excluded from dispersing. In the fine sort phase, filtered documents are sorted together with the extracted documents.
reservedNotrueSpecifies whether to retain documents not extracted by the distinct clause. Set to false to discard them. When set to false, the total and viewtotal values in the response may be inaccurate.
max_item_countNoMaximum number of documents retained in the DISTINCT calculation, computed as max(max_item_count, hit). For example, if 10 results appear per page and up to 100 pages are returned, set this to 1000.
gradeNoOne gradeThreshold values (separated by |) that classify documents into relevance grades based on rough sort scores. Documents within each grade are sorted in the same order as the rough sort phase.

grade examples

  • grade:3.0 — two grades: score < 3.0 (grade 1), score >= 3.0 (grade 2)

  • grade:3.0|5.0 — three grades: score < 3.0 (grade 1), 3.0 <= score < 5.0 (grade 2), score >= 5.0 (grade 3)

Example

The following example performs 10 rounds of extraction based on company_id, extracting 2 documents per round. Documents not extracted are assigned lower ranks.

"distinct": {
  "default": {
    "dist_key": "company_id",
    "dist_count": 2,
    "dist_times": 10
  }
}

How dist_count and dist_times interact

dist_count controls how many documents are extracted per group per round; dist_times controls how many rounds of extraction run. Together they determine the final document order.

Consider six documents where name is the distinct key:

doc1: id:1, name:a
doc2: id:2, name:a
doc3: id:3, name:a
doc4: id:4, name:b
doc5: id:5, name:c
doc6: id:6, name:c

Case 1dist_count:2, dist_times:1

Extract 2 documents per group, 1 round:

"distinct": {
  "default": {
    "dist_key": "name",
    "dist_count": 2,
    "dist_times": 1
  }
}

Result order: doc1, doc2, doc4, doc5, doc6

Round 1 extracts up to 2 documents from each group: 2 from group a (doc1, doc2), 1 from group b (doc4), 2 from group c (doc5, doc6). doc3 is not extracted.

Case 2dist_count:1, dist_times:2

Extract 1 document per group, 2 rounds:

"distinct": {
  "default": {
    "dist_key": "name",
    "dist_count": 1,
    "dist_times": 2
  }
}

Result order: doc1, doc4, doc5, doc2, doc6

Round 1 extracts 1 from each group: doc1 (a), doc4 (b), doc5 (c). Round 2 extracts the next 1 from each group: doc2 (a), doc6 (c). doc3 is not extracted.

Case 3dist_count:1, dist_times:1

Extract 1 document per group, 1 round:

"distinct": {
  "default": {
    "dist_key": "name",
    "dist_count": 1,
    "dist_times": 1
  }
}

Result order: doc1, doc4, doc5

Only 1 round runs. One document is extracted per group: doc1 (a), doc4 (b), doc5 (c). The remaining documents are not extracted.

Fix inaccurate total counts with the distinct uniq plug-in

When reserved is set to false, the total and viewtotal values in the response may be inaccurate, which can cause errors in pagination or any logic that depends on those values.

The distinct uniq plug-in corrects this by computing accurate counts. To activate it, add duniqfield:<field> to a kvpairs clause.

The plug-in only works when dist_times is 1, dist_count is 1, and reserved is false. The duniqfield value must match the dist_key value. For performance reasons, the plug-in returns at most 5,000 results per query.
{
  "distinct": {
    "default": {
      "dist_key": "company_id",
      "dist_count": 1,
      "dist_times": 1,
      "reserved": false
    }
  },
  "kvpairs": {
    "duniqfield": "company_id"
  }
}

Usage notes

  • Fields specified in a distinct clause must be attribute fields defined in schema.json.

  • Only INT and LITERAL field types are supported. ARRAY is not supported.