Introduction to vector indexes
Vector retrieval represents items or content as vectors and builds a vector index. This index lets you input one or more user or item vectors to retrieve the top-K items or content based on vector distance.
Vector index configuration
Vector configuration without categories
{
"table_name": "test_vector",
"summarys": {
"summary_fields": [
"id",
"vector_field"
]
},
"indexs": [
{
"index_name": "pk",
"index_type": "PRIMARYKEY64",
"index_fields": "id",
"has_primary_key_attribute": true,
"is_primary_key_sorted": false
},
{
"index_name": "embedding",
"index_type": "CUSTOMIZED",
"index_fields": [
{
"boost": 1,
"field_name": "id"
},
{
"boost": 1,
"field_name": "vector_field"
}
],
"indexer": "aitheta2_indexer",
"parameters": {
"enable_rt_build": "false",
"min_scan_doc_cnt": "20000",
"vector_index_type": "Qc",
"major_order": "col",
"builder_name": "QcBuilder",
"distance_type": "SquaredEuclidean",
"embedding_delimiter": ",",
"enable_recall_report": "false",
"is_embedding_saved": "false",
"linear_build_threshold": "5000",
"dimension": "128",
"search_index_params": "{\"proxima.qc.searcher.scan_ratio\":0.01}",
"searcher_name": "QcSearcher",
"build_index_params": "{\"proxima.qc.builder.quantizer_class\":\"Int8QuantizerConverter\",\"proxima.qc.builder.quantize_by_centroid\":true,\"proxima.qc.builder.optimizer_class\":\"BruteForceBuilder\",\"proxima.qc.builder.thread_count\":10,\"proxima.qc.builder.optimizer_params\":{\"proxima.linear.builder.column_major_order\":true},\"proxima.qc.builder.store_original_features\":false,\"proxima.qc.builder.train_sample_count\":3000000,\"proxima.qc.builder.train_sample_ratio\":0.5}"
}
}
],
"attributes": [
"id",
"vector_field"
],
"fields": [
{
"field_name": "id",
"field_type": "INTEGER"
},
{
"user_defined_param": {
"multi_value_sep": ","
},
"field_name": "vector_field",
"field_type": "FLOAT",
"multi_value": true
}
]
}Vector configuration with categories
{
"table_name": "test_vector",
"summarys": {
"summary_fields": [
"id",
"vector_field",
"category_id"
]
},
"indexs": [
{
"index_name": "pk",
"index_type": "PRIMARYKEY64",
"index_fields": "id",
"has_primary_key_attribute": true,
"is_primary_key_sorted": false
},
{
"index_name": "embedding",
"index_type": "CUSTOMIZED",
"index_fields": [
{
"boost": 1,
"field_name": "id"
},
{
"field_name": "category_id",
"boost": 1
},
{
"boost": 1,
"field_name": "vector_field"
}
],
"indexer": "aitheta2_indexer",
"parameters": {
"enable_rt_build": "false",
"min_scan_doc_cnt": "20000",
"vector_index_type": "Qc",
"major_order": "col",
"builder_name": "QcBuilder",
"distance_type": "SquaredEuclidean",
"embedding_delimiter": ",",
"enable_recall_report": "false",
"is_embedding_saved": "false",
"linear_build_threshold": "5000",
"dimension": "128",
"search_index_params": "{\"proxima.qc.searcher.scan_ratio\":0.01}",
"searcher_name": "QcSearcher",
"build_index_params": "{\"proxima.qc.builder.quantizer_class\":\"Int8QuantizerConverter\",\"proxima.qc.builder.quantize_by_centroid\":true,\"proxima.qc.builder.optimizer_class\":\"BruteForceBuilder\",\"proxima.qc.builder.thread_count\":10,\"proxima.qc.builder.optimizer_params\":{\"proxima.linear.builder.column_major_order\":true},\"proxima.qc.builder.store_original_features\":false,\"proxima.qc.builder.train_sample_count\":3000000,\"proxima.qc.builder.train_sample_ratio\":0.5}"
}
}
],
"attributes": [
"id",
"vector_field",
"category_id"
],
"fields": [
{
"field_name": "id",
"field_type": "INTEGER"
},
{
"user_defined_param": {
"multi_value_sep": ","
},
"field_name": "vector_field",
"field_type": "FLOAT",
"multi_value": true
},
{
"field_name": "category_id",
"field_type": "INTEGER"
}
]
}You can use categories to support vector retrieval by category. For example, an image can have multiple categories. If you build an index without categories and only filter the retrieved vectors, the search may return no results.
When you configure a vector index in administrator mode, you must unescape the content of the build_index_params and search_index_params parameters.
Field descriptions
field_name: Specifies the fields for building the vector index. You must include at least two fields: a primary key (or its hash value) and a vector field. The primary key value must be an integer. To build an index by category, you must also add a category field with an integer value. The fields in the index_fields array must be in the following order: primary key, category (if applicable), and vector.
index_name: Specifies the name of the vector index.
index_type: Specifies the index type. This parameter must be set to CUSTOMIZED.
indexer: Specifies the plugin for building the vector index. Currently, only aitheta2_indexer is supported.
parameters: Specifies the build and query parameters for the vector index.
Dimension: dimension
embedding_delimiter: Specifies the separator for elements in the vector. The default value is a comma (,).
distance_type: Specifies the distance metric. The following values are supported:
Inner Product
SquaredEuclidean (squared Euclidean distance): Use this type for normalized data.
major_order: Specifies the data storage format. The following values are supported:
col (column store): For better performance, use this format. The dimension must be a power of 2.
row (row store): This is the default format.
builder_name: Specifies the index builder type. We recommend one of the following types. For information about more parameters, contact us.
QcBuilder
LinearBuilder (linear build): This builder is recommended when the number of documents is less than 10,000.
searcher_name: Specifies the index searcher type. The value of this parameter must correspond to the value of the builder_name parameter. If you require GPU support, contact us.
QcSearcher: A CPU-based searcher that corresponds to QcBuilder.
LinearSearcher: A CPU-based brute-force searcher that corresponds to LinearBuilder.
build_index_params: Specifies the index build parameters. These parameters correspond to the value of the builder_name parameter. For more information, see Quantized Clustering (QC) configuration.
search_index_params: Specifies the index search parameters. These parameters correspond to the value of the searcher_name parameter. For more information, see search_index_params.
linear_build_threshold: Specifies the threshold for a linear build. If the number of documents is below this threshold, LinearBuilder is used for building and LinearSearcher is used for searching. The default value is 10,000. A linear build saves memory and provides lossless retrieval. However, its performance is very poor for large datasets.
min_scan_doc_cnt: Specifies the minimum number of documents in the candidate set for retrieval. The default value is 10,000. This parameter is similar to proxima.qc.searcher.scan_ratio. If both parameters are configured, the larger value is used.
A larger value for scan_ratio or min_scan_doc_cnt is not always better. A value that is too large can significantly impact performance and increase latency.
Generally, to retrieve the top-K vectors, the recommended value for min_scan_doc_cnt is max(10000, 100 * topk). The recommended value for scan_ratio is max(10000, 100 * topk) / total_doc_cnt. The specific values depend on your data size, recall rate, and performance requirements.
These two similar parameters exist to meet the requirements of real-time and multi-category scenarios. In most cases, you only need to configure scan_ratio.
enable_recall_report: Specifies whether to report recall rate metrics. The default value is false.
is_embedding_saved: Specifies whether to save the original vectors. The default value is false. If you enable INT8/FP16 quantization and real-time retrieval, you must set this parameter to true. Otherwise, batch incremental builds fail.
enable_rt_build: Specifies whether to support real-time indexing. The default value is true.
ignore_invalid_doc: Specifies whether to ignore problematic vector data. The default value is true.
rt_index_params: Specifies the real-time index parameters. You can configure this parameter if enable_rt_build is set to true. For example:
{ "proxima.oswg.streamer.segment_size": 2048 }