Dynamic vector quantization compresses FP32 (32-bit floating-point) vectors to INT8 (8-bit integer) at insert time, reducing index size by roughly two thirds and increasing query throughput by 48% or more — with a modest drop in recall rate. Enable it by setting a quantization policy when you create a collection.
How it works
When you insert a document into a quantization-enabled collection, DashVector automatically converts each FP32 vector component to an INT8 value. Subsequent queries run against the compressed index, which fits more data in memory and processes queries faster. When you fetch a document by ID or include vectors in a search result, the returned vector is a dequantized approximation, not the original value.
When to use quantization
Quantization works well when your dataset has a high-dimensional embedding space and your workload is throughput-sensitive. It is not suitable for all datasets — recall rate can drop significantly depending on the embedding model and data distribution. Test against your specific dataset before enabling it in production.
Collections with quantization enabled do not support sparse vectors. To use both features together, contact us by joining the DingTalk group (ID: 25130022704).
Prerequisites
Before you begin, ensure that you have:
A cluster (Create a cluster)
An API key (Manage API keys)
The latest DashVector SDK installed (Install DashVector SDK)
Enable quantization on a collection
Set quantize_type in extra_params when calling create().
Replace YOUR_API_KEY with your API key and YOUR_CLUSTER_ENDPOINT with your cluster endpoint. Find the endpoint on the Cluster Detail page in the DashVector console.
import dashvector
import numpy as np
client = dashvector.Client(
api_key='YOUR_API_KEY',
endpoint='YOUR_CLUSTER_ENDPOINT'
)
assert client
# Create a collection with INT8 quantization enabled.
ret = client.create(
'quantize_demo',
dimension=768,
extra_params={
'quantize_type': 'DT_VECTOR_INT8'
}
)
print(ret)
collection = client.get('quantize_demo')
# Insert a document. DashVector quantizes the vector automatically on write.
collection.insert(('1', np.random.rand(768).astype('float32')))
# Fetch by ID. The returned vector is a dequantized approximation, not the original.
doc = collection.fetch('1')
# Query with vectors returned. Same as fetch: the returned vector is approximate.
docs = collection.query(
vector=np.random.rand(768).astype('float32'),
include_vector=True
)When you fetch a document by ID or query with include_vector=True, the returned vector is a dequantized approximation, not the original inserted value. For details, see Obtain documents and Search for documents.
Parameters
Use the quantize_type field of extra_params: Dict[str, str] to set the quantization policy when creating a collection.
| Value | Description |
|---|---|
DT_VECTOR_INT8 | Quantizes each FP32 vector component to INT8, reducing index size to approximately one third of the original. |
Performance and recall rate
The figures below are measured on a P.large cluster with cosine distance and TopK 100.
1 million 768-dimensional vectors
| Quantization policy | Index size ratio | QPS (queries per second) | Recall rate |
|---|---|---|---|
| None | 100% | 495.6 | 99.05% |
DT_VECTOR_INT8 | 33.33% | 733.8 (+48%) | 94.67% |
On this dataset, DT_VECTOR_INT8 reduces index size by two thirds, increases QPS by 48%, and reduces recall rate by 4.38 percentage points.
Results are based on the Cohere Wikipedia dataset and are for reference only. Actual results vary with your dataset and data distribution.
Benchmark across datasets
| Dataset | Index size ratio | Recall rate | QPS increase |
|---|---|---|---|
| Cohere 10M 768-dim Cosine | 33% | 95.28% | 170% |
| GIST 1M 960-dim L2 | 35% | 99.54% | 134% |
| OpenAI 5M 1536-dim Cosine | 34% | 67.34% | 189% |
| Deep1B 10M 96-dim Cosine | 52% | 99.97% | 135% |
| Internal dataset 8M 512-dim Cosine | 38% | 99.92% | 152% |
The OpenAI 1536-dimensional result (67.34% recall) is notably lower than other datasets. If your workload uses high-dimensional embeddings, test recall carefully before enabling quantization in production.
Apply in production
Quantization is not suitable for all datasets. Before rolling it out to production:
Create two collections with the same data — one with
DT_VECTOR_INT8and one without.Run representative queries against both and compare recall rate and QPS.
Enable quantization in production only if the recall rate meets your requirements.