Vector indexes-OpenSearch(Open Search)-阿里云帮助中心

Vector indexes store embeddings alongside your document data and retrieve the top-k most similar items by vector distance. Use vector indexes to build semantic search, recommendation, and similarity-matching features on OpenSearch Retrieval Engine Edition.

Each vector index pairs a builder (constructs the index at ingest time) with a matching searcher (runs queries). The builder you choose determines the retrieval mode:

Approximate nearest neighbor (ANN): fast, scalable search for large production workloads (> 10,000 documents). Uses QcBuilder + QcSearcher.
Exact brute-force search: lossless recall for small datasets (< 10,000 documents). Uses LinearBuilder + LinearSearcher.

Choose a builder

Scenario	Builder	Searcher	Notes
Large-scale production (> 10,000 documents)	`QcBuilder`	`QcSearcher`	ANN search via quantized clustering; CPU-based
Small datasets (< 10,000 documents)	`LinearBuilder`	`LinearSearcher`	Exact, brute-force search; lossless recall
Dataset starts small but may grow	`QcBuilder`	`QcSearcher`	Set `linear_build_threshold` to auto-switch to `LinearBuilder` below the threshold

The searcher_name value must match the builder_name value. For GPU-accelerated search, contact technical support.

Configure a vector index

All vector indexes use index_type: CUSTOMIZED and indexer: aitheta2_indexer. The field order in index_fields must match the order in fields:

Primary key field
Category field (only if using categories)
Vector field

Without categories

{
  "table_name": "test_vector",
  "summarys": {
    "summary_fields": [
      "id",
      "vector_field"
    ]
  },
  "indexs": [
    {
      "index_name": "pk",
      "index_type": "PRIMARYKEY64",
      "index_fields": "id",
      "has_primary_key_attribute": true,
      "is_primary_key_sorted": false
    },
    {
      "index_name": "embedding",
      "index_type": "CUSTOMIZED",
      "index_fields": [
        {
          "boost": 1,
          "field_name": "id"
        },
        {
          "boost": 1,
          "field_name": "vector_field"
        }
      ],
      "indexer": "aitheta2_indexer",
      "parameters": {
        "enable_rt_build": "false",
        "min_scan_doc_cnt": "20000",
        "vector_index_type": "Qc",
        "major_order": "col",
        "builder_name": "QcBuilder",
        "distance_type": "SquaredEuclidean",
        "embedding_delimiter": ",",
        "enable_recall_report": "false",
        "is_embedding_saved": "false",
        "linear_build_threshold": "5000",
        "dimension": "128",
        "search_index_params": "{\"proxima.qc.searcher.scan_ratio\":0.01}",
        "searcher_name": "QcSearcher",
        "build_index_params": "{\"proxima.qc.builder.quantizer_class\":\"Int8QuantizerConverter\",\"proxima.qc.builder.quantize_by_centroid\":true,\"proxima.qc.builder.optimizer_class\":\"BruteForceBuilder\",\"proxima.qc.builder.thread_count\":10,\"proxima.qc.builder.optimizer_params\":{\"proxima.linear.builder.column_major_order\":true},\"proxima.qc.builder.store_original_features\":false,\"proxima.qc.builder.train_sample_count\":3000000,\"proxima.qc.builder.train_sample_ratio\":0.5}"
      }
    }
  ],
  "attributes": [
    "id",
    "vector_field"
  ],
  "fields": [
    {
      "field_name": "id",
      "field_type": "INTEGER"
    },
    {
      "user_defined_param": {
        "multi_value_sep": ","
      },
      "field_name": "vector_field",
      "field_type": "FLOAT",
      "multi_value": true
    }
  ]
}

With categories

Add a category_id field to enable category-scoped vector search. Without category-based indexing, a post-retrieval filter on a category field can return empty results because the filter is applied after the top-k candidates are selected. Category-based indexing partitions the search space upfront, which helps ensure that the filter returns results.

{
  "table_name": "test_vector",
  "summarys": {
    "summary_fields": [
      "id",
      "vector_field",
      "category_id"
    ]
  },
  "indexs": [
    {
      "index_name": "pk",
      "index_type": "PRIMARYKEY64",
      "index_fields": "id",
      "has_primary_key_attribute": true,
      "is_primary_key_sorted": false
    },
    {
      "index_name": "embedding",
      "index_type": "CUSTOMIZED",
      "index_fields": [
        {
          "boost": 1,
          "field_name": "id"
        },
        {
          "field_name": "category_id",
          "boost": 1
        },
        {
          "boost": 1,
          "field_name": "vector_field"
        }
      ],
      "indexer": "aitheta2_indexer",
      "parameters": {
        "enable_rt_build": "false",
        "min_scan_doc_cnt": "20000",
        "vector_index_type": "Qc",
        "major_order": "col",
        "builder_name": "QcBuilder",
        "distance_type": "SquaredEuclidean",
        "embedding_delimiter": ",",
        "enable_recall_report": "false",
        "is_embedding_saved": "false",
        "linear_build_threshold": "5000",
        "dimension": "128",
        "search_index_params": "{\"proxima.qc.searcher.scan_ratio\":0.01}",
        "searcher_name": "QcSearcher",
        "build_index_params": "{\"proxima.qc.builder.quantizer_class\":\"Int8QuantizerConverter\",\"proxima.qc.builder.quantize_by_centroid\":true,\"proxima.qc.builder.optimizer_class\":\"BruteForceBuilder\",\"proxima.qc.builder.thread_count\":10,\"proxima.qc.builder.optimizer_params\":{\"proxima.linear.builder.column_major_order\":true},\"proxima.qc.builder.store_original_features\":false,\"proxima.qc.builder.train_sample_count\":3000000,\"proxima.qc.builder.train_sample_ratio\":0.5}"
      }
    }
  ],
  "attributes": [
    "id",
    "vector_field",
    "category_id"
  ],
  "fields": [
    {
      "field_name": "id",
      "field_type": "INTEGER"
    },
    {
      "user_defined_param": {
        "multi_value_sep": ","
      },
      "field_name": "vector_field",
      "field_type": "FLOAT",
      "multi_value": true
    },
    {
      "field_name": "category_id",
      "field_type": "INTEGER"
    }
  ]
}

Important

If you configure a vector index as an administrator, remove the escape characters from the values of build_index_params and search_index_params.

Parameter reference

Field configuration

Parameter	Description
`field_name`	Fields used to build the vector index. All fields must be of the RAW data type. Requires at least two fields: one INTEGER primary key (or its hash value) and one vector field. Add an INTEGER category field for category-based indexing. Field order in `index_fields` must match `fields`: primary key → category (if present) → vector field.
`index_name`	Name of the vector index.
`index_type`	Type of vector index. Set to `CUSTOMIZED`.
`indexer`	Plug-in used to build the index. Set to `aitheta2_indexer`.

Builder and searcher parameters

Parameter	Type	Default	Description
`dimension`	integer	—	Number of vector dimensions. Required.
`builder_name`	string	—	Builder type. Use `QcBuilder` for large datasets; `LinearBuilder` for datasets with fewer than 10,000 documents.
`searcher_name`	string	—	Searcher type. Must match `builder_name`: `QcSearcher` pairs with `QcBuilder`; `LinearSearcher` pairs with `LinearBuilder`.
`distance_type`	string	—	Distance metric. `InnerProduct` for inner product similarity; `SquaredEuclidean` for normalized data.
`major_order`	string	`row`	Storage layout. `col` (column store) delivers better performance but requires `dimension` to be a power of 2. `row` uses row store.
`embedding_delimiter`	string	`,`	Delimiter used to separate values in the vector field.
`linear_build_threshold`	integer	`10000`	Document count below which the system automatically uses `LinearBuilder` and `LinearSearcher`, regardless of `builder_name`.
`min_scan_doc_cnt`	integer	`10000`	Minimum candidate set size for ANN search. If both `min_scan_doc_cnt` and `proxima.qc.searcher.scan_ratio` are set, the larger effective value applies.
`enable_recall_report`	boolean	`false`	Whether to report the recall rate.
`is_embedding_saved`	boolean	`false`	Whether to save the original vectors. Set to `true` when INT8 or FP16 quantization is enabled together with real-time retrieval; otherwise incremental vectors cannot be rebuilt in batches.
`enable_rt_build`	boolean	`true`	Whether to support real-time indexing.
`ignore_invalid_doc`	boolean	`true`	Whether to skip documents with invalid vector data.
`build_index_params`	JSON string	—	Advanced builder parameters. See Quantized clustering configurations.
`search_index_params`	JSON string	—	Advanced searcher parameters. See HNSW (Hierarchical Navigable Small World) configurations.
`rt_index_params`	JSON string	—	Real-time indexing parameters. Applies only when `enable_rt_build` is `true`. Example: `{"proxima.oswg.streamer.segment_size": 2048}`.

Tune candidate set size

The min_scan_doc_cnt parameter and the proxima.qc.searcher.scan_ratio parameter in search_index_params both control the minimum candidate set size for ANN retrieval. When both are set, the system uses whichever value produces the larger candidate set.

Use the following formulas as a starting point for top-k retrieval:

min_scan_doc_cnt = max(10000, 100 * topk)
proxima.qc.searcher.scan_ratio = max(10000, 100 * topk) / total_doc_cnt

Then adjust based on your latency requirements, recall targets, and document count. Setting either value too high increases latency and degrades throughput.

For most use cases, setting only proxima.qc.searcher.scan_ratio is sufficient. Use min_scan_doc_cnt primarily for real-time and multi-category scenarios where the ratio-based calculation may not produce enough candidates.