what are vector indexes-OpenSearch(Open Search)-阿里云帮助中心

Vector indexes represent items or content as vectors and support top-K similarity retrieval based on vector distance.

Introduction to vector indexes

Vector retrieval represents items or content as vectors and builds a vector index. You can input one or more user or item vectors to retrieve the top-K items or content based on vector distance.

Vector index configuration

Vector configuration without categories

{
  "table_name": "test_vector",
  "summarys": {
    "summary_fields": [
      "id",
      "vector_field"
    ]
  },
  "indexs": [
    {
      "index_name": "pk",
      "index_type": "PRIMARYKEY64",
      "index_fields": "id",
      "has_primary_key_attribute": true,
      "is_primary_key_sorted": false
    },
    {
      "index_name": "embedding",
      "index_type": "CUSTOMIZED",
      "index_fields": [
        {
          "boost": 1,
          "field_name": "id"
        },
        {
          "boost": 1,
          "field_name": "vector_field"
        }
      ],
      "indexer": "aitheta2_indexer",
      "parameters": {
        "enable_rt_build": "false",
        "min_scan_doc_cnt": "20000",
        "vector_index_type": "Qc",
        "major_order": "col",
        "builder_name": "QcBuilder",
        "distance_type": "SquaredEuclidean",
        "embedding_delimiter": ",",
        "enable_recall_report": "false",
        "is_embedding_saved": "false",
        "linear_build_threshold": "5000",
        "dimension": "128",
        "search_index_params": "{\"proxima.qc.searcher.scan_ratio\":0.01}",
        "searcher_name": "QcSearcher",
        "build_index_params": "{\"proxima.qc.builder.quantizer_class\":\"Int8QuantizerConverter\",\"proxima.qc.builder.quantize_by_centroid\":true,\"proxima.qc.builder.optimizer_class\":\"BruteForceBuilder\",\"proxima.qc.builder.thread_count\":10,\"proxima.qc.builder.optimizer_params\":{\"proxima.linear.builder.column_major_order\":true},\"proxima.qc.builder.store_original_features\":false,\"proxima.qc.builder.train_sample_count\":3000000,\"proxima.qc.builder.train_sample_ratio\":0.5}"
      }
    }
  ],
  "attributes": [
    "id",
    "vector_field"
  ],
  "fields": [
    {
      "field_name": "id",
      "field_type": "INTEGER"
    },
    {
      "user_defined_param": {
        "multi_value_sep": ","
      },
      "field_name": "vector_field",
      "field_type": "FLOAT",
      "multi_value": true
    }
  ]
}

Vector configuration with categories

{
  "table_name": "test_vector",
  "summarys": {
    "summary_fields": [
      "id",
      "vector_field",
      "category_id"
    ]
  },
  "indexs": [
    {
      "index_name": "pk",
      "index_type": "PRIMARYKEY64",
      "index_fields": "id",
      "has_primary_key_attribute": true,
      "is_primary_key_sorted": false
    },
    {
      "index_name": "embedding",
      "index_type": "CUSTOMIZED",
      "index_fields": [
        {
          "boost": 1,
          "field_name": "id"
        },
        {
          "field_name": "category_id",
          "boost": 1
        },
        {
          "boost": 1,
          "field_name": "vector_field"
        }
      ],
      "indexer": "aitheta2_indexer",
      "parameters": {
        "enable_rt_build": "false",
        "min_scan_doc_cnt": "20000",
        "vector_index_type": "Qc",
        "major_order": "col",
        "builder_name": "QcBuilder",
        "distance_type": "SquaredEuclidean",
        "embedding_delimiter": ",",
        "enable_recall_report": "false",
        "is_embedding_saved": "false",
        "linear_build_threshold": "5000",
        "dimension": "128",
        "search_index_params": "{\"proxima.qc.searcher.scan_ratio\":0.01}",
        "searcher_name": "QcSearcher",
        "build_index_params": "{\"proxima.qc.builder.quantizer_class\":\"Int8QuantizerConverter\",\"proxima.qc.builder.quantize_by_centroid\":true,\"proxima.qc.builder.optimizer_class\":\"BruteForceBuilder\",\"proxima.qc.builder.thread_count\":10,\"proxima.qc.builder.optimizer_params\":{\"proxima.linear.builder.column_major_order\":true},\"proxima.qc.builder.store_original_features\":false,\"proxima.qc.builder.train_sample_count\":3000000,\"proxima.qc.builder.train_sample_ratio\":0.5}"
      }
    }
  ],
  "attributes": [
    "id",
    "vector_field",
    "category_id"
  ],
  "fields": [
    {
      "field_name": "id",
      "field_type": "INTEGER"
    },
    {
      "user_defined_param": {
        "multi_value_sep": ","
      },
      "field_name": "vector_field",
      "field_type": "FLOAT",
      "multi_value": true
    },
    {
      "field_name": "category_id",
      "field_type": "INTEGER"
    }
  ]
}

Important

You can use categories to support vector retrieval by category. For example, an image can have multiple categories. If you build an index without categories and only filter the retrieved vectors, the search may return no results.
When you configure a vector index in administrator mode, you must unescape the content of the build_index_params and search_index_params parameters.

Field descriptions

field_name: The fields for building the vector index. You must include at least two fields: a primary key (or its hash value) and a vector field. The primary key value must be an integer. To build an index by category, you must also add a category field with an integer value. The fields in the index_fields array must be in the following order: primary key, category (if applicable), and vector.
index_name: The name of the vector index.
index_type: The index type. Must be set to CUSTOMIZED.
indexer: The plugin for building the vector index. Only aitheta2_indexer is supported.
parameters: The build and query parameters for the vector index.
- Dimension: dimension
- embedding_delimiter: The separator for vector elements. Default: comma (,).
- distance_type: The distance metric. Supported values:
  - Inner Product
  - SquaredEuclidean (squared Euclidean distance): Use this type for normalized data.
- major_order: The data storage format. Supported values:
  - col (column store): For better performance, use this format. The dimension must be a power of 2.
  - row (row store): This is the default format.
- builder_name: The index builder type. Recommended types are listed below. For more parameters, contact us.
  - QcBuilder
  - LinearBuilder (linear build): This builder is recommended when the number of documents is less than 10,000.
- searcher_name: The index searcher type. Must correspond to the builder_name value. For GPU support, contact us.
  - QcSearcher: A CPU-based searcher that corresponds to QcBuilder.
  - LinearSearcher: A CPU-based brute-force searcher that corresponds to LinearBuilder.
- build_index_params: The index build parameters, which correspond to the builder_name value. For more information, see Quantized Clustering (QC) configuration.
- search_index_params: The index search parameters, which correspond to the searcher_name value. For more information, see search_index_params.
- linear_build_threshold: The threshold for linear builds. If the document count is below this value, LinearBuilder and LinearSearcher are used. Default: 10,000. Linear builds save memory and provide lossless retrieval but perform poorly on large datasets.
- min_scan_doc_cnt: The minimum number of documents in the candidate set for retrieval. Default: 10,000. Similar to proxima.qc.searcher.scan_ratio. If both are configured, the larger value is used.
  - A larger value for scan_ratio or min_scan_doc_cnt is not always better. A value that is too large can significantly impact performance and increase latency.
  - Generally, to retrieve the top-K vectors, the recommended value for min_scan_doc_cnt is max(10000, 100 * topk). The recommended value for scan_ratio is max(10000, 100 * topk) / total_doc_cnt. The specific values depend on your data size, recall rate, and performance requirements.
  - These two similar parameters exist to meet the requirements of real-time and multi-category scenarios. In most cases, you only need to configure scan_ratio.
- enable_recall_report: Whether to report recall rate metrics. Default: false.
- is_embedding_saved: Whether to save the original vectors. Default: false. Must be set to true if you enable INT8/FP16 quantization and real-time retrieval. Otherwise, batch incremental builds fail.
- enable_rt_build: Whether to enable real-time indexing. Default: true.
- ignore_invalid_doc: Whether to ignore invalid vector data. Default: true.
- rt_index_params: Real-time index parameters. Applicable when enable_rt_build is true. Example:
```
{
  "proxima.oswg.streamer.segment_size": 2048
}
```