QLean model quantization (v0.1.0)
Overview
This quantization tool enables easy and efficient model quantization and dequantization. Users can switch between modes by simply specifying different recipes. It supports the following features:
FP8 → BF16 dequantization: Restores FP8 model weights to BF16 precision for debugging or further processing.
W8A8-INT8 quantization: Applies weight quantization (W8) and activation quantization (A8) to a model to produce a standard INT8 quantized model.
Mixed-precision quantization: Preserves high precision for specific layers of a W8A8 model based on its architecture to achieve a better balance between precision and performance.
Supported models | Original | Dequantization (FP8 → BF16) | Quantization (→ W8A8 INT8) | |
DeepSeek | DeepSeek-V3.1 | FP8 | Y | Y |
DeepSeek-V3.1-Terminus | FP8 | Y | Y | |
DeepSeek-V3.2-Exp | FP8 | Y | Y | |
DeepSeek-R1 | FP8 | Y | Y | |
DeepSeek-R1-0528 | FP8 | Y | Y | |
Qwen3-MOE | Qwen3-235B-A22B | BF16 | NA | Y (mixed-precision quantization) |
Qwen3-235B-A22B-Instruct-2507 | BF16 | NA | Y (mixed-precision quantization) | |
Qwen3-Coder-480B-A35B-Instruct | BF16 | NA | Y (mixed-precision quantization) | |
Qwen3-Next | Qwen3-Next-80B-A3B-Instruct | BF16 | NA | Y |
Qwen3-VL | Qwen3-VL-235B-A22B-Instruct | BF16 | NA | Y |
Qwen3-VL-30B-A3B-Instruct | BF16 | NA | Y | |
Qwen3-VL-32B/8B/4B/2B-Instruct | BF16 | NA | Y | |
Usage
The tool is released as a wheel package. You can find it on PyPI and install it with the pip install command.
Installation
# Install dependencies
pip install triton_kernel==1.0.0+ppu2.0.0.oe
# Install qlean
pip install qlean==0.1.0+ppu2.0.0.oeParameters
--model_name: The name of the model. Required. The format must match the model ID on Hugging Face, such as "deepseek-ai/DeepSeek-V3.2-Exp".--model_path: The local path to the original model. Required. For example, you can specify the path to a Hugging Face model's local cache.--save_path: The path to save the dequantized or quantized model. Required.--mix_path: The path to save the mixed-precision quantized model. Optional.--recipe: The path to the quantization recipe. Optional. Must be a path to a YAML file.For models listed in the "Models that support one-click quantization" table, you can omit the
recipeparameter. The system automatically loads the corresponding recipe.For other models, you must create and provide a
recipefile. For details on creating a recipe, see the "Create a recipe" section.
Features
You can switch between features by using different recipes. The following examples show how to use each feature.
W8A8-INT8 quantization
The original model can be in FP8 or BF16 format. The system automatically selects an appropriate strategy based on the precision of the original model. If the model is in FP8 format, the system switches to a more precise quantization mode. If the model is in BF16 format, the system uses the general quantization mode.
# Replace /path/to/.../ with your actual path # Original model is FP8 qlean --model_name deepseek-ai/DeepSeek-R1-0528 --model_path /path/to/DeepSeek-R1-0528/ --save_path /path/to/DeepSeek-R1-0528-INT8/ # Original model is BF16 qlean --model_name Qwen/Qwen3-VL-30B-A3B-Instruct --model_path /path/to/Qwen3-VL-30B-A3B-Instruct/ --save_path /path/to/Qwen3-VL-30B-A3B-Instruct-INT8/FP8 → BF16 dequantization
# Replace /path/to/.../ with your actual path qlean --model_name deepseek-ai/DeepSeek-V3.2 --model_path /path/to/DeepSeek-V3.2/ --save_path /path/to/DeepSeek-V3.2-BF16/ --recipe /path/to/dequant.yaml# Contents of dequant.yaml --- dequant_stage: dequant_modifiers: dequantModifier: {} ...Mixed-precision quantization
After W8A8-INT8 quantization, some models may show a drop in accuracy for specific use cases. This can happen because certain layers in the model are sensitive to precision. You can use the mixed-precision quantization feature to replace these specific layers in the W8A8-quantized model with their BF16 versions. In this case,
save_pathpoints to the W8A8 model, andmix_pathspecifies the save path for the mixed-precision model.# Replace /path/to/.../ with your actual path qlean --model_name Qwen/Qwen3-235B-A22B-Instruct-2507 --model_path /path/to/Qwen3-235B-A22B-Instruct-2507/ --save_path /path/to/Qwen3-235B-A22B-Instruct-2507-INT8/ --mix_path /path/to/Qwen3-235B-A22B-Instruct-2507-MIX/
Writing a recipe
To perform W8A8-INT8 quantization on a model not listed in the "Models that support one-click quantization" table, create a YAML recipe file based on the following example. You only need to modify the ignore section.
---
quant_stage:
quant_modifiers:
generalDay0Modifier:
ignore: ["module_to_ignore"] # Modify this based on your requirements
scheme: W8A8
...When you create the ignore list in your quantization configuration, you should add all quantization-sensitive modules. Modules that should typically not be quantized include re:.*lm_head, re:.*embed_tokens, and re:.*mlp.gate$. Note that the modules that need to be ignored vary for different model architectures, and you should add them as needed.
To perform mixed-precision quantization on a model that is not listed in the "Models that support one-click quantization" table, add a mixedPrecisionModifier section to the preceding recipe. The layers parameter specifies the IDs of the layers to replace with their BF16 versions, and num_layers specifies the total number of layers in the model.
---
quant_stage:
quant_modifiers:
generalDay0Modifier:
ignore: ["re:.*lm_head", "re:.*embed_tokens", "re:.*mlp.gate$"] # Modify this based on your requirements
scheme: W8A8
mixedPrecisionModifier:
layers: ['88', '89', '92-93'] # Modify this based on your requirements
num_layers: 94 # Modify this based on your requirements
...After creating the YAML file, pass its path using the --recipe parameter.
# Replace /name/of/.../ and /path/to/.../ with your actual names and paths
qlean --model_name /name/of/model/ --model_path /path/to/original_model/ --save_path /path/to/result_model/ --recipe /path/to/your/recipe/Example: W8A8-INT8 quantization of DeepSeek-V3.2
Step 1: Create a deepseek-v3.2-recipe.yaml file.
---
quant_stage:
quant_modifiers:
generalDay0Modifier:
ignore: ["re:.*lm_head", "re:.*embed_tokens", "re:.*mlp.gate$"]
scheme: W8A8
...Step 2: Run the quantization command, specifying the model name, the original model path, the save path for the quantized model, and the recipe file.
qlean --model_name deepseek-ai/DeepSeek-V3.2 --model_path /path/to/DeepSeek-V3.2/ --save_path /path/to/DeepSeek-V3.2-INT8/ --recipe /path/to/deepseek-v3.2-recipe.yaml Test data
Tests show that models converted using the QLean tool perform as expected in terms of inference accuracy and performance. The details are as follows:
Sglang inference accuracy
Test image: pytorch2.8.0-ubuntu24.04-cuda12.9-sglang0.5.5-py312
Model | Dataset | Original accuracy | Accuracy after QLean | Accuracy ratio | ||
Data type | Value | Data type | Value | |||
deepseek-ai/DeepSeek-V3.2 | ifeval | W8A8-INT8 | 88.5 | W8A8-INT8 | 88.1 | 99.55% |
gsm8k | W8A8-INT8 | 95.3 | W8A8-INT8 | 96.3 | 101.05% | |
ceval | W8A8-INT8 | 92.1 | W8A8-INT8 | 91.0 | 98.81% | |
deepseek-ai/DeepSeek-R1-0528 | ifeval | BF16 | 80.2 | BF16 | 81.8 | 102.00% |
gsm8k | BF16 | 96.1 | BF16 | 96.7 | 100.62% | |
ceval | BF16 | 89.6 | BF16 | 90.1 | 100.56% | |
deepseek-ai/DeepSeek-R1-0528 | ifeval | W8A8-INT8 | 81.7 | W8A8-INT8 | 81.5 | 99.76% |
gsm8k | W8A8-INT8 | 96.3 | W8A8-INT8 | 96.4 | 100.10% | |
ceval | W8A8-INT8 | 89.8 | W8A8-INT8 | 90.2 | 100.45% | |
Qwen/Qwen3-235B-A22B-Instruct-2507 | ifeval | BF16 | 87.2 | W8A8-INT8 mixed quantization | 88.1 | 101.03% |
gsm8k | BF16 | 96.6 | W8A8-INT8 mixed quantization | 96.4 | 99.79% | |
ceval | BF16 | 91.8 | W8A8-INT8 mixed quantization | 91.0 | 99.13% | |
Note: The Zhenwu 810E does not support FP8. For models originally in FP8 format, their W8A8-quantized version is used as the comparison baseline.
vLLM inference precision
Test image: pytorch2.9.0-ubuntu24.04-cuda12.9-vllm0.12.0-py312
Model | Dataset | Original precision (Zhenwu 810E) | Quantized precision (QLean) | Precision ratio | ||
Type | Evaluation | Type | Evaluation | |||
deepseek-ai/DeepSeek-V3.2 | ifeval | W8A8-INT8 | 86.5 | W8A8-INT8 | 88.5 | 102.3% |
gsm8k | W8A8-INT8 | 95.6 | W8A8-INT8 | 96.0 | 100.4% | |
ceval | W8A8-INT8 | 91.4 | W8A8-INT8 | 91.3 | 99.9% | |
deepseek-ai/DeepSeek-R1-0528 | ifeval | BF16 | 79.1 | BF16 | 80.5 | 101.8% |
gsm8k | BF16 | 96.2 | BF16 | 96.7 | 100.5% | |
ceval | BF16 | 89.6 | BF16 | 91.3 | 101.9% | |
deepseek-ai/DeepSeek-R1-0528 | ifeval | W8A8-INT8 | 79.8 | W8A8-INT8 | 79.6 | 99.7% |
gsm8k | W8A8-INT8 | 96.2 | W8A8-INT8 | 96.7 | 100.5% | |
ceval | W8A8-INT8 | 90.1 | W8A8-INT8 | 90.0 | 99.9% | |
Qwen/Qwen3-235B-A22B-Instruct-2507 | ifeval | BF16 | 88.5 | W8A8-INT8 mixed quantization | 87.2 | 98.5% |
gsm8k | BF16 | 96.8 | W8A8-INT8 mixed quantization | 96.5 | 99.7% | |
ceval | BF16 | 90.6 | W8A8-INT8 mixed quantization | 91.2 | 100.7% | |
mmlu_pro | BF16 | 79.5 | W8A8-INT8 mixed quantization | 79.2 | 99.6% | |
Qwen/Qwen3-Next-80B-A3B-Instruct | ifeval | BF16 | 87.2 | W8A8-INT8 | 87.4 | 100.2% |
gsm8k | BF16 | 96.5 | W8A8-INT8 | 96.2 | 99.7% | |
ceval | BF16 | 90.6 | W8A8-INT8 | 90.2 | 99.6% | |
mmlu_pro | BF16 | 81.2 | W8A8-INT8 | 81.3 | 100.1% | |
Qwen/Qwen3-VL-30B-A3B-Instruct | mvbench | BF16 | 66.1 | W8A8-INT8 | 65.8 | 99.5% |
ifeval | BF16 | 84.8 | W8A8-INT8 | 85.2 | 100.5% | |
gsm8k | BF16 | 96.0 | W8A8-INT8 | 96.0 | 100.0% | |
ceval | BF16 | 84.9 | W8A8-INT8 | 84.4 | 99.4% | |
Note: Since the Zhenwu 810E does not support FP8, this test uses W8A8 weights as the baseline for models with original FP8 weights.
Sglang performance
Test image: pytorch2.8.0-ubuntu24.04-cuda12.9-sglang0.5.5-py312
Model | input_len | output_len | Original weights | QLean-converted weights | Ratio | ||||||||||||||
Type | TP | Concurrency | Requests | Total throughput (tokens/s/card) | Output throughput (tokens/s/card) | Ttft (ms) | Tpot (ms) | Type | TP | Concurrency | Requests | Total throughput (tokens/s/card) | Output throughput (tokens/s/card) | Ttft (ms) | Tpot (ms) | ||||
deepseek-ai/DeepSeek-V3.2 | 800-1000 | 300-500 | W8A8-INT8 | 8 | 32 | 320 | 56.286 | 17.762 | 1479.319 | 100.041 | W8A8-INT8 | 8 | 30 | 300 | 55.364 | 17.444 | 1364.088 | 100.625 | 98.4% |
deepseek-ai/DeepSeek-R1-0528 | 2800-5200 | 1050-1950 | BF16 | 16 | 13 | 130 | 76.078 | 20.745 | 668.408 | 36.507 | BF16 | 16 | 13 | 130 | 76.063 | 20.741 | 670.283 | 36.53 | 100.0% |
2800-5200 | 1050-1950 | W8A8-INT8 | 8 | 13 | 130 | 132.563 | 36.148 | 639.606 | 42.055 | W8A8-INT8 | 8 | 13 | 130 | 132.123 | 36.028 | 638.907 | 42.18 | 99.7% | |
Qwen/Qwen3-235B-A22B-Instruct-2507 | 2800-5200 | 1050-1950 | BF16 | 8 | 106 | 1060 | 489.726 | 133.54 | 2780.778 | 92.187 | W8A8-INT8 mixed-precision quantization | 4 | 95 | 950 | 832.724 | 227.071 | 2337.891 | 97.278 | 170.0% |
Qwen/Qwen3-VL-30B-A3B-Instruct | 2800-5200 | 1050-1950 | BF16 | 2 | 72 | 720 | 2615.439 | 713.185 | 431.293 | 47.78 | W8A8-INT8 | 2 | 98 | 980 | 3371.56 | 919.365 | 430.125 | 50.481 | 128.9% |
Note: Since the Zhenwu 810E does not support FP8, the W8A8 weights from this test serve as the baseline for models with original FP8 weights.
VLLM performance
Test image: pytorch2.9.0-ubuntu24.04-cuda12.9-vllm0.12.0-py312
Model | Input len | Output len | Original weight performance | QLean-converted weight performance | Comparison ratio | ||||||||||||||
Type | TP | Concurrency | Operations | Total throughput (tokens/s/card) | Output throughput (tokens/s/card) | ttft (ms) | tpot (ms) | Type | TP | Concurrency | Operations | Total throughput (tokens/s/card) | Output throughput (tokens/s/card) | ttft (ms) | tpot (ms) | ||||
deepseek-ai/DeepSeek-V3.2 | 800-1000 | 300-500 | W8A8-INT8 | 8 | 39 | 390 | 72.259 | 22.777 | 804.091 | 100.271 | W8A8-INT8 | 8 | 40 | 400 | 72.506 | 22.845 | 947.792 | 98.14 | 100.3% |
deepseek-ai/DeepSeek-R1-0528 | 2800-5200 | 1050-1950 | BF16 | 16 | 28 | 280 | 107.755 | 29.382 | 2371.914 | 55.35 | BF16 | 16 | 26 | 260 | 105.566 | 28.786 | 589.644 | 53.167 | 98.0% |
W8A8-INT8 | 8 | 28 | 280 | 194.585 | 53.059 | 2652.877 | 61.402 | W8A8-INT8 | 8 | 26 | 260 | 191.23 | 52.145 | 686.914 | 58.728 | 98.3% | |||
Qwen/Qwen3-235B-A22B-Instruct-2507 | 2800-5200 | 1050-1950 | BF16 | 8 | 120 | 1200 | 507.971 | 138.515 | 2690.231 | 100.796 | W8A8-INT8 mixed quantization | 4 | 104 | 1040 | 899.99 | 245.412 | 956.795 | 98.988 | 177.2% |
Qwen/Qwen3-VL-30B-A3B-Instruct | 2800-5200 | 1050-1950 | BF16 | 2 | 82 | 820 | 2822.723 | 769.657 | 321.365 | 50.33 | W8A8-INT8 | 2 | 104 | 1040 | 3628.596 | 988.915 | 287.425 | 49.591 | 128.5% |
Note: The Zhenwu 810E does not support FP8. For models with original FP8 weights, W8A8 weights are used for the comparison.
Known issues
Due to its new model architecture, the Qwen3-Next series requires performance optimization on vllm after W8A8-INT8 quantization and is temporarily incompatible with higher versions of sglang.
The framework does not currently support the dequantized DeepSeek-V3.2 model.