Vector indexes store embeddings alongside your document data and retrieve the top-k most similar items by vector distance. Use vector indexes to build semantic search, recommendation, and similarity-matching features on OpenSearch Retrieval Engine Edition.
Each vector index pairs a builder (constructs the index at ingest time) with a matching searcher (runs queries). The builder you choose determines the retrieval mode:
Approximate nearest neighbor (ANN): fast, scalable search for large production workloads (> 10,000 documents). Uses
QcBuilder+QcSearcher.Exact brute-force search: lossless recall for small datasets (< 10,000 documents). Uses
LinearBuilder+LinearSearcher.
Choose a builder
| Scenario | Builder | Searcher | Notes |
|---|---|---|---|
| Large-scale production (> 10,000 documents) | QcBuilder | QcSearcher | ANN search via quantized clustering; CPU-based |
| Small datasets (< 10,000 documents) | LinearBuilder | LinearSearcher | Exact, brute-force search; lossless recall |
| Dataset starts small but may grow | QcBuilder | QcSearcher | Set linear_build_threshold to auto-switch to LinearBuilder below the threshold |
Thesearcher_namevalue must match thebuilder_namevalue. For GPU-accelerated search, contact technical support.
Configure a vector index
All vector indexes use index_type: CUSTOMIZED and indexer: aitheta2_indexer. The field order in index_fields must match the order in fields:
Primary key field
Category field (only if using categories)
Vector field
Without categories
{
"table_name": "test_vector",
"summarys": {
"summary_fields": [
"id",
"vector_field"
]
},
"indexs": [
{
"index_name": "pk",
"index_type": "PRIMARYKEY64",
"index_fields": "id",
"has_primary_key_attribute": true,
"is_primary_key_sorted": false
},
{
"index_name": "embedding",
"index_type": "CUSTOMIZED",
"index_fields": [
{
"boost": 1,
"field_name": "id"
},
{
"boost": 1,
"field_name": "vector_field"
}
],
"indexer": "aitheta2_indexer",
"parameters": {
"enable_rt_build": "false",
"min_scan_doc_cnt": "20000",
"vector_index_type": "Qc",
"major_order": "col",
"builder_name": "QcBuilder",
"distance_type": "SquaredEuclidean",
"embedding_delimiter": ",",
"enable_recall_report": "false",
"is_embedding_saved": "false",
"linear_build_threshold": "5000",
"dimension": "128",
"search_index_params": "{\"proxima.qc.searcher.scan_ratio\":0.01}",
"searcher_name": "QcSearcher",
"build_index_params": "{\"proxima.qc.builder.quantizer_class\":\"Int8QuantizerConverter\",\"proxima.qc.builder.quantize_by_centroid\":true,\"proxima.qc.builder.optimizer_class\":\"BruteForceBuilder\",\"proxima.qc.builder.thread_count\":10,\"proxima.qc.builder.optimizer_params\":{\"proxima.linear.builder.column_major_order\":true},\"proxima.qc.builder.store_original_features\":false,\"proxima.qc.builder.train_sample_count\":3000000,\"proxima.qc.builder.train_sample_ratio\":0.5}"
}
}
],
"attributes": [
"id",
"vector_field"
],
"fields": [
{
"field_name": "id",
"field_type": "INTEGER"
},
{
"user_defined_param": {
"multi_value_sep": ","
},
"field_name": "vector_field",
"field_type": "FLOAT",
"multi_value": true
}
]
}With categories
Add a category_id field to enable category-scoped vector search. Without category-based indexing, a post-retrieval filter on a category field can return empty results because the filter is applied after the top-k candidates are selected. Category-based indexing partitions the search space upfront, which helps ensure that the filter returns results.
{
"table_name": "test_vector",
"summarys": {
"summary_fields": [
"id",
"vector_field",
"category_id"
]
},
"indexs": [
{
"index_name": "pk",
"index_type": "PRIMARYKEY64",
"index_fields": "id",
"has_primary_key_attribute": true,
"is_primary_key_sorted": false
},
{
"index_name": "embedding",
"index_type": "CUSTOMIZED",
"index_fields": [
{
"boost": 1,
"field_name": "id"
},
{
"field_name": "category_id",
"boost": 1
},
{
"boost": 1,
"field_name": "vector_field"
}
],
"indexer": "aitheta2_indexer",
"parameters": {
"enable_rt_build": "false",
"min_scan_doc_cnt": "20000",
"vector_index_type": "Qc",
"major_order": "col",
"builder_name": "QcBuilder",
"distance_type": "SquaredEuclidean",
"embedding_delimiter": ",",
"enable_recall_report": "false",
"is_embedding_saved": "false",
"linear_build_threshold": "5000",
"dimension": "128",
"search_index_params": "{\"proxima.qc.searcher.scan_ratio\":0.01}",
"searcher_name": "QcSearcher",
"build_index_params": "{\"proxima.qc.builder.quantizer_class\":\"Int8QuantizerConverter\",\"proxima.qc.builder.quantize_by_centroid\":true,\"proxima.qc.builder.optimizer_class\":\"BruteForceBuilder\",\"proxima.qc.builder.thread_count\":10,\"proxima.qc.builder.optimizer_params\":{\"proxima.linear.builder.column_major_order\":true},\"proxima.qc.builder.store_original_features\":false,\"proxima.qc.builder.train_sample_count\":3000000,\"proxima.qc.builder.train_sample_ratio\":0.5}"
}
}
],
"attributes": [
"id",
"vector_field",
"category_id"
],
"fields": [
{
"field_name": "id",
"field_type": "INTEGER"
},
{
"user_defined_param": {
"multi_value_sep": ","
},
"field_name": "vector_field",
"field_type": "FLOAT",
"multi_value": true
},
{
"field_name": "category_id",
"field_type": "INTEGER"
}
]
}If you configure a vector index as an administrator, remove the escape characters from the values of build_index_params and search_index_params.
Parameter reference
Field configuration
| Parameter | Description |
|---|---|
field_name | Fields used to build the vector index. All fields must be of the RAW data type. Requires at least two fields: one INTEGER primary key (or its hash value) and one vector field. Add an INTEGER category field for category-based indexing. Field order in index_fields must match fields: primary key → category (if present) → vector field. |
index_name | Name of the vector index. |
index_type | Type of vector index. Set to CUSTOMIZED. |
indexer | Plug-in used to build the index. Set to aitheta2_indexer. |
Builder and searcher parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
dimension | integer | — | Number of vector dimensions. Required. |
builder_name | string | — | Builder type. Use QcBuilder for large datasets; LinearBuilder for datasets with fewer than 10,000 documents. |
searcher_name | string | — | Searcher type. Must match builder_name: QcSearcher pairs with QcBuilder; LinearSearcher pairs with LinearBuilder. |
distance_type | string | — | Distance metric. InnerProduct for inner product similarity; SquaredEuclidean for normalized data. |
major_order | string | row | Storage layout. col (column store) delivers better performance but requires dimension to be a power of 2. row uses row store. |
embedding_delimiter | string | , | Delimiter used to separate values in the vector field. |
linear_build_threshold | integer | 10000 | Document count below which the system automatically uses LinearBuilder and LinearSearcher, regardless of builder_name. |
min_scan_doc_cnt | integer | 10000 | Minimum candidate set size for ANN search. If both min_scan_doc_cnt and proxima.qc.searcher.scan_ratio are set, the larger effective value applies. |
enable_recall_report | boolean | false | Whether to report the recall rate. |
is_embedding_saved | boolean | false | Whether to save the original vectors. Set to true when INT8 or FP16 quantization is enabled together with real-time retrieval; otherwise incremental vectors cannot be rebuilt in batches. |
enable_rt_build | boolean | true | Whether to support real-time indexing. |
ignore_invalid_doc | boolean | true | Whether to skip documents with invalid vector data. |
build_index_params | JSON string | — | Advanced builder parameters. See Quantized clustering configurations. |
search_index_params | JSON string | — | Advanced searcher parameters. See HNSW (Hierarchical Navigable Small World) configurations. |
rt_index_params | JSON string | — | Real-time indexing parameters. Applies only when enable_rt_build is true. Example: {"proxima.oswg.streamer.segment_size": 2048}. |
Tune candidate set size
The min_scan_doc_cnt parameter and the proxima.qc.searcher.scan_ratio parameter in search_index_params both control the minimum candidate set size for ANN retrieval. When both are set, the system uses whichever value produces the larger candidate set.
Use the following formulas as a starting point for top-k retrieval:
min_scan_doc_cnt=max(10000, 100 * topk)proxima.qc.searcher.scan_ratio=max(10000, 100 * topk) / total_doc_cnt
Then adjust based on your latency requirements, recall targets, and document count. Setting either value too high increases latency and degrades throughput.
For most use cases, setting only proxima.qc.searcher.scan_ratio is sufficient. Use min_scan_doc_cnt primarily for real-time and multi-category scenarios where the ratio-based calculation may not produce enough candidates.