Vector indexes

更新时间:
复制 MD 格式

Vector indexes store embeddings alongside your document data and retrieve the top-k most similar items by vector distance. Use vector indexes to build semantic search, recommendation, and similarity-matching features on OpenSearch Retrieval Engine Edition.

Each vector index pairs a builder (constructs the index at ingest time) with a matching searcher (runs queries). The builder you choose determines the retrieval mode:

  • Approximate nearest neighbor (ANN): fast, scalable search for large production workloads (> 10,000 documents). Uses QcBuilder + QcSearcher.

  • Exact brute-force search: lossless recall for small datasets (< 10,000 documents). Uses LinearBuilder + LinearSearcher.

Choose a builder

ScenarioBuilderSearcherNotes
Large-scale production (> 10,000 documents)QcBuilderQcSearcherANN search via quantized clustering; CPU-based
Small datasets (< 10,000 documents)LinearBuilderLinearSearcherExact, brute-force search; lossless recall
Dataset starts small but may growQcBuilderQcSearcherSet linear_build_threshold to auto-switch to LinearBuilder below the threshold
The searcher_name value must match the builder_name value. For GPU-accelerated search, contact technical support.

Configure a vector index

All vector indexes use index_type: CUSTOMIZED and indexer: aitheta2_indexer. The field order in index_fields must match the order in fields:

  1. Primary key field

  2. Category field (only if using categories)

  3. Vector field

Without categories

{
  "table_name": "test_vector",
  "summarys": {
    "summary_fields": [
      "id",
      "vector_field"
    ]
  },
  "indexs": [
    {
      "index_name": "pk",
      "index_type": "PRIMARYKEY64",
      "index_fields": "id",
      "has_primary_key_attribute": true,
      "is_primary_key_sorted": false
    },
    {
      "index_name": "embedding",
      "index_type": "CUSTOMIZED",
      "index_fields": [
        {
          "boost": 1,
          "field_name": "id"
        },
        {
          "boost": 1,
          "field_name": "vector_field"
        }
      ],
      "indexer": "aitheta2_indexer",
      "parameters": {
        "enable_rt_build": "false",
        "min_scan_doc_cnt": "20000",
        "vector_index_type": "Qc",
        "major_order": "col",
        "builder_name": "QcBuilder",
        "distance_type": "SquaredEuclidean",
        "embedding_delimiter": ",",
        "enable_recall_report": "false",
        "is_embedding_saved": "false",
        "linear_build_threshold": "5000",
        "dimension": "128",
        "search_index_params": "{\"proxima.qc.searcher.scan_ratio\":0.01}",
        "searcher_name": "QcSearcher",
        "build_index_params": "{\"proxima.qc.builder.quantizer_class\":\"Int8QuantizerConverter\",\"proxima.qc.builder.quantize_by_centroid\":true,\"proxima.qc.builder.optimizer_class\":\"BruteForceBuilder\",\"proxima.qc.builder.thread_count\":10,\"proxima.qc.builder.optimizer_params\":{\"proxima.linear.builder.column_major_order\":true},\"proxima.qc.builder.store_original_features\":false,\"proxima.qc.builder.train_sample_count\":3000000,\"proxima.qc.builder.train_sample_ratio\":0.5}"
      }
    }
  ],
  "attributes": [
    "id",
    "vector_field"
  ],
  "fields": [
    {
      "field_name": "id",
      "field_type": "INTEGER"
    },
    {
      "user_defined_param": {
        "multi_value_sep": ","
      },
      "field_name": "vector_field",
      "field_type": "FLOAT",
      "multi_value": true
    }
  ]
}

With categories

Add a category_id field to enable category-scoped vector search. Without category-based indexing, a post-retrieval filter on a category field can return empty results because the filter is applied after the top-k candidates are selected. Category-based indexing partitions the search space upfront, which helps ensure that the filter returns results.

{
  "table_name": "test_vector",
  "summarys": {
    "summary_fields": [
      "id",
      "vector_field",
      "category_id"
    ]
  },
  "indexs": [
    {
      "index_name": "pk",
      "index_type": "PRIMARYKEY64",
      "index_fields": "id",
      "has_primary_key_attribute": true,
      "is_primary_key_sorted": false
    },
    {
      "index_name": "embedding",
      "index_type": "CUSTOMIZED",
      "index_fields": [
        {
          "boost": 1,
          "field_name": "id"
        },
        {
          "field_name": "category_id",
          "boost": 1
        },
        {
          "boost": 1,
          "field_name": "vector_field"
        }
      ],
      "indexer": "aitheta2_indexer",
      "parameters": {
        "enable_rt_build": "false",
        "min_scan_doc_cnt": "20000",
        "vector_index_type": "Qc",
        "major_order": "col",
        "builder_name": "QcBuilder",
        "distance_type": "SquaredEuclidean",
        "embedding_delimiter": ",",
        "enable_recall_report": "false",
        "is_embedding_saved": "false",
        "linear_build_threshold": "5000",
        "dimension": "128",
        "search_index_params": "{\"proxima.qc.searcher.scan_ratio\":0.01}",
        "searcher_name": "QcSearcher",
        "build_index_params": "{\"proxima.qc.builder.quantizer_class\":\"Int8QuantizerConverter\",\"proxima.qc.builder.quantize_by_centroid\":true,\"proxima.qc.builder.optimizer_class\":\"BruteForceBuilder\",\"proxima.qc.builder.thread_count\":10,\"proxima.qc.builder.optimizer_params\":{\"proxima.linear.builder.column_major_order\":true},\"proxima.qc.builder.store_original_features\":false,\"proxima.qc.builder.train_sample_count\":3000000,\"proxima.qc.builder.train_sample_ratio\":0.5}"
      }
    }
  ],
  "attributes": [
    "id",
    "vector_field",
    "category_id"
  ],
  "fields": [
    {
      "field_name": "id",
      "field_type": "INTEGER"
    },
    {
      "user_defined_param": {
        "multi_value_sep": ","
      },
      "field_name": "vector_field",
      "field_type": "FLOAT",
      "multi_value": true
    },
    {
      "field_name": "category_id",
      "field_type": "INTEGER"
    }
  ]
}
Important

If you configure a vector index as an administrator, remove the escape characters from the values of build_index_params and search_index_params.

Parameter reference

Field configuration

ParameterDescription
field_nameFields used to build the vector index. All fields must be of the RAW data type. Requires at least two fields: one INTEGER primary key (or its hash value) and one vector field. Add an INTEGER category field for category-based indexing. Field order in index_fields must match fields: primary key → category (if present) → vector field.
index_nameName of the vector index.
index_typeType of vector index. Set to CUSTOMIZED.
indexerPlug-in used to build the index. Set to aitheta2_indexer.

Builder and searcher parameters

ParameterTypeDefaultDescription
dimensionintegerNumber of vector dimensions. Required.
builder_namestringBuilder type. Use QcBuilder for large datasets; LinearBuilder for datasets with fewer than 10,000 documents.
searcher_namestringSearcher type. Must match builder_name: QcSearcher pairs with QcBuilder; LinearSearcher pairs with LinearBuilder.
distance_typestringDistance metric. InnerProduct for inner product similarity; SquaredEuclidean for normalized data.
major_orderstringrowStorage layout. col (column store) delivers better performance but requires dimension to be a power of 2. row uses row store.
embedding_delimiterstring,Delimiter used to separate values in the vector field.
linear_build_thresholdinteger10000Document count below which the system automatically uses LinearBuilder and LinearSearcher, regardless of builder_name.
min_scan_doc_cntinteger10000Minimum candidate set size for ANN search. If both min_scan_doc_cnt and proxima.qc.searcher.scan_ratio are set, the larger effective value applies.
enable_recall_reportbooleanfalseWhether to report the recall rate.
is_embedding_savedbooleanfalseWhether to save the original vectors. Set to true when INT8 or FP16 quantization is enabled together with real-time retrieval; otherwise incremental vectors cannot be rebuilt in batches.
enable_rt_buildbooleantrueWhether to support real-time indexing.
ignore_invalid_docbooleantrueWhether to skip documents with invalid vector data.
build_index_paramsJSON stringAdvanced builder parameters. See Quantized clustering configurations.
search_index_paramsJSON stringAdvanced searcher parameters. See HNSW (Hierarchical Navigable Small World) configurations.
rt_index_paramsJSON stringReal-time indexing parameters. Applies only when enable_rt_build is true. Example: {"proxima.oswg.streamer.segment_size": 2048}.

Tune candidate set size

The min_scan_doc_cnt parameter and the proxima.qc.searcher.scan_ratio parameter in search_index_params both control the minimum candidate set size for ANN retrieval. When both are set, the system uses whichever value produces the larger candidate set.

Use the following formulas as a starting point for top-k retrieval:

  • min_scan_doc_cnt = max(10000, 100 * topk)

  • proxima.qc.searcher.scan_ratio = max(10000, 100 * topk) / total_doc_cnt

Then adjust based on your latency requirements, recall targets, and document count. Setting either value too high increases latency and degrades throughput.

For most use cases, setting only proxima.qc.searcher.scan_ratio is sufficient. Use min_scan_doc_cnt primarily for real-time and multi-category scenarios where the ratio-based calculation may not produce enough candidates.

What's next