Dynamic vector quantization

更新时间:
复制 MD 格式

Dynamic vector quantization compresses FP32 (32-bit floating-point) vectors to INT8 (8-bit integer) at insert time, reducing index size by roughly two thirds and increasing query throughput by 48% or more — with a modest drop in recall rate. Enable it by setting a quantization policy when you create a collection.

How it works

When you insert a document into a quantization-enabled collection, DashVector automatically converts each FP32 vector component to an INT8 value. Subsequent queries run against the compressed index, which fits more data in memory and processes queries faster. When you fetch a document by ID or include vectors in a search result, the returned vector is a dequantized approximation, not the original value.

When to use quantization

Quantization works well when your dataset has a high-dimensional embedding space and your workload is throughput-sensitive. It is not suitable for all datasets — recall rate can drop significantly depending on the embedding model and data distribution. Test against your specific dataset before enabling it in production.

Important

Collections with quantization enabled do not support sparse vectors. To use both features together, contact us by joining the DingTalk group (ID: 25130022704).

Prerequisites

Before you begin, ensure that you have:

Enable quantization on a collection

Set quantize_type in extra_params when calling create().

Replace YOUR_API_KEY with your API key and YOUR_CLUSTER_ENDPOINT with your cluster endpoint. Find the endpoint on the Cluster Detail page in the DashVector console.

import dashvector
import numpy as np

client = dashvector.Client(
    api_key='YOUR_API_KEY',
    endpoint='YOUR_CLUSTER_ENDPOINT'
)
assert client

# Create a collection with INT8 quantization enabled.
ret = client.create(
    'quantize_demo',
    dimension=768,
    extra_params={
        'quantize_type': 'DT_VECTOR_INT8'
    }
)
print(ret)

collection = client.get('quantize_demo')

# Insert a document. DashVector quantizes the vector automatically on write.
collection.insert(('1', np.random.rand(768).astype('float32')))

# Fetch by ID. The returned vector is a dequantized approximation, not the original.
doc = collection.fetch('1')

# Query with vectors returned. Same as fetch: the returned vector is approximate.
docs = collection.query(
    vector=np.random.rand(768).astype('float32'),
    include_vector=True
)
Note

When you fetch a document by ID or query with include_vector=True, the returned vector is a dequantized approximation, not the original inserted value. For details, see Obtain documents and Search for documents.

Parameters

Use the quantize_type field of extra_params: Dict[str, str] to set the quantization policy when creating a collection.

ValueDescription
DT_VECTOR_INT8Quantizes each FP32 vector component to INT8, reducing index size to approximately one third of the original.

Performance and recall rate

The figures below are measured on a P.large cluster with cosine distance and TopK 100.

1 million 768-dimensional vectors

Quantization policyIndex size ratioQPS (queries per second)Recall rate
None100%495.699.05%
DT_VECTOR_INT833.33%733.8 (+48%)94.67%

On this dataset, DT_VECTOR_INT8 reduces index size by two thirds, increases QPS by 48%, and reduces recall rate by 4.38 percentage points.

Note

Results are based on the Cohere Wikipedia dataset and are for reference only. Actual results vary with your dataset and data distribution.

Benchmark across datasets

DatasetIndex size ratioRecall rateQPS increase
Cohere 10M 768-dim Cosine33%95.28%170%
GIST 1M 960-dim L235%99.54%134%
OpenAI 5M 1536-dim Cosine34%67.34%189%
Deep1B 10M 96-dim Cosine52%99.97%135%
Internal dataset 8M 512-dim Cosine38%99.92%152%

The OpenAI 1536-dimensional result (67.34% recall) is notably lower than other datasets. If your workload uses high-dimensional embeddings, test recall carefully before enabling quantization in production.

Apply in production

Quantization is not suitable for all datasets. Before rolling it out to production:

  1. Create two collections with the same data — one with DT_VECTOR_INT8 and one without.

  2. Run representative queries against both and compare recall rate and QPS.

  3. Enable quantization in production only if the recall rate meets your requirements.

Next steps