Forward index compression

更新时间:
复制 MD 格式

Forward index attributes can consume significant storage space, especially when documents share many duplicate values or use floating-point fields. OpenSearch Retrieval Engine Edition provides three compression techniques to reduce forward index storage size. Enable one or more based on your field types and data distribution.

Multi-value attribute deduplication

When documents share many identical attribute values, storing each duplicate separately wastes space. Multi-value attribute deduplication removes duplicate values from the index before storage, reducing the size of generated indexes.

Applies to: multi-value attributes and single-value fields with the STRING data type.

Trade-off: deduplication requires additional memory during build and merge operations due to the dictionaries used internally. Skip this option if your duplication rate is low.

Equal-value compression

After the offset values of single-value and multi-value attributes are globally sorted by a field, duplicate values often appear consecutively. Equal-value compression stores these runs of duplicate values using fewer bits, reducing the size of the corresponding offset files.

Applies to: offset files of single-value attributes and multi-value attributes. For multi-value attributes and STRING-type single-value attributes, combine this with multi-value attribute deduplication for greater space savings.

Self-adaptive storage for offset files

Each multi-value attribute has its own offset file. Using 8 bits per offset file results in high storage overhead. OpenSearch automatically uses 4 bits per offset file when the total size of all offset files is under 4 GB. No configuration is required.

Configure compression

Set compress_type on each field in your schema configuration to enable compression. The default value is an empty string (no compression).

{
  "fields": [
    {
      "field_name": "category",
      "field_type": "INTEGER",
      "multi_value": true,
      "compress_type": "uniq|equal"
    },
    {
      "field_name": "price",
      "field_type": "INTEGER",
      "user_defined_param": {
        "key": "hello"
      }
    }
  ]
}

compress_type values

ValueEffectApplicable field typesCombinable with
uniqMulti-value attribute deduplicationMulti-value attributes; single-value STRING fieldsequal
equalEqual-value compressionSingle-value and multi-value attributes (offset files)uniq
patch_compressPatch file-based compression
block_fpFloating-point block compressionMulti-value FLOAT attributes
fp16Half-precision floating-point compressionSingle-value and multi-value FLOAT attributes
int8#NINT8 quantization; N defines the value range (−N to +N)Single-value and multi-value FLOAT attributes

To combine compression methods, separate values with a vertical bar (|). For example, "compress_type": "uniq|equal".

Do not combine fp16 or int8#N with uniq|equal on a single-value FLOAT attribute.

Field parameters

ParameterDescriptionDefault
field_nameName of the field
field_typeData type of the field
multi_valueWhether the field holds multiple values per documentfalse
compress_typeCompression method(s) for attribute storage"" (no compression)