QLean model quantization (v0.1.0)-ZHENWU PPU Cloud Service(ppu)-阿里云帮助中心

Overview

This quantization tool enables easy and efficient model quantization and dequantization. Users can switch between modes by simply specifying different recipes. It supports the following features:

FP8 → BF16 dequantization: Restores FP8 model weights to BF16 precision for debugging or further processing.
W8A8-INT8 quantization: Applies weight quantization (W8) and activation quantization (A8) to a model to produce a standard INT8 quantized model.
Mixed-precision quantization: Preserves high precision for specific layers of a W8A8 model based on its architecture to achieve a better balance between precision and performance.

Supported models		Original	Dequantization (FP8 → BF16)	Quantization (→ W8A8 INT8)
DeepSeek	DeepSeek-V3.1	FP8	Y	Y
	DeepSeek-V3.1-Terminus	FP8	Y	Y
	DeepSeek-V3.2-Exp	FP8	Y	Y
	DeepSeek-R1	FP8	Y	Y
	DeepSeek-R1-0528	FP8	Y	Y
Qwen3-MOE	Qwen3-235B-A22B	BF16	NA	Y (mixed-precision quantization)
	Qwen3-235B-A22B-Instruct-2507	BF16	NA	Y (mixed-precision quantization)
	Qwen3-Coder-480B-A35B-Instruct	BF16	NA	Y (mixed-precision quantization)
Qwen3-Next	Qwen3-Next-80B-A3B-Instruct	BF16	NA	Y
Qwen3-VL	Qwen3-VL-235B-A22B-Instruct	BF16	NA	Y
	Qwen3-VL-30B-A3B-Instruct	BF16	NA	Y
	Qwen3-VL-32B/8B/4B/2B-Instruct	BF16	NA	Y

Usage

The tool is released as a wheel package. You can find it on PyPI and install it with the pip install command.

Installation

# Install dependencies
pip install triton_kernel==1.0.0+ppu2.0.0.oe

# Install qlean
pip install qlean==0.1.0+ppu2.0.0.oe

Parameters

--model_name: The name of the model. Required. The format must match the model ID on Hugging Face, such as "deepseek-ai/DeepSeek-V3.2-Exp".
--model_path: The local path to the original model. Required. For example, you can specify the path to a Hugging Face model's local cache.
--save_path: The path to save the dequantized or quantized model. Required.
--mix_path: The path to save the mixed-precision quantized model. Optional.
--recipe: The path to the quantization recipe. Optional. Must be a path to a YAML file.
- For models listed in the "Models that support one-click quantization" table, you can omit the recipe parameter. The system automatically loads the corresponding recipe.
- For other models, you must create and provide a recipe file. For details on creating a recipe, see the "Create a recipe" section.

Features

You can switch between features by using different recipes. The following examples show how to use each feature.

W8A8-INT8 quantization

The original model can be in FP8 or BF16 format. The system automatically selects an appropriate strategy based on the precision of the original model. If the model is in FP8 format, the system switches to a more precise quantization mode. If the model is in BF16 format, the system uses the general quantization mode.

# Replace /path/to/.../ with your actual path

# Original model is FP8
qlean --model_name deepseek-ai/DeepSeek-R1-0528 --model_path /path/to/DeepSeek-R1-0528/ --save_path /path/to/DeepSeek-R1-0528-INT8/

# Original model is BF16
qlean --model_name Qwen/Qwen3-VL-30B-A3B-Instruct --model_path /path/to/Qwen3-VL-30B-A3B-Instruct/ --save_path /path/to/Qwen3-VL-30B-A3B-Instruct-INT8/

FP8 → BF16 dequantization

# Replace /path/to/.../ with your actual path
qlean --model_name deepseek-ai/DeepSeek-V3.2 --model_path /path/to/DeepSeek-V3.2/ --save_path /path/to/DeepSeek-V3.2-BF16/ --recipe /path/to/dequant.yaml

# Contents of dequant.yaml
---
dequant_stage:
  dequant_modifiers:
    dequantModifier: {}
...

Mixed-precision quantization
After W8A8-INT8 quantization, some models may show a drop in accuracy for specific use cases. This can happen because certain layers in the model are sensitive to precision. You can use the mixed-precision quantization feature to replace these specific layers in the W8A8-quantized model with their BF16 versions. In this case, save_path points to the W8A8 model, and mix_path specifies the save path for the mixed-precision model.
```
# Replace /path/to/.../ with your actual path
qlean --model_name Qwen/Qwen3-235B-A22B-Instruct-2507 --model_path /path/to/Qwen3-235B-A22B-Instruct-2507/ --save_path /path/to/Qwen3-235B-A22B-Instruct-2507-INT8/ --mix_path /path/to/Qwen3-235B-A22B-Instruct-2507-MIX/
```

Writing a recipe

To perform W8A8-INT8 quantization on a model not listed in the "Models that support one-click quantization" table, create a YAML recipe file based on the following example. You only need to modify the ignore section.

---
quant_stage:
  quant_modifiers:
    generalDay0Modifier:
      ignore: ["module_to_ignore"] # Modify this based on your requirements
      scheme: W8A8
...

When you create the ignore list in your quantization configuration, you should add all quantization-sensitive modules. Modules that should typically not be quantized include re:.*lm_head, re:.*embed_tokens, and re:.*mlp.gate$. Note that the modules that need to be ignored vary for different model architectures, and you should add them as needed.

To perform mixed-precision quantization on a model that is not listed in the "Models that support one-click quantization" table, add a mixedPrecisionModifier section to the preceding recipe. The layers parameter specifies the IDs of the layers to replace with their BF16 versions, and num_layers specifies the total number of layers in the model.

---
quant_stage:
  quant_modifiers:
    generalDay0Modifier:
      ignore: ["re:.*lm_head", "re:.*embed_tokens", "re:.*mlp.gate$"] # Modify this based on your requirements
      scheme: W8A8
    mixedPrecisionModifier:
      layers: ['88', '89', '92-93']  # Modify this based on your requirements
      num_layers: 94                 # Modify this based on your requirements
...

After creating the YAML file, pass its path using the --recipe parameter.

# Replace /name/of/.../ and /path/to/.../ with your actual names and paths
qlean --model_name /name/of/model/ --model_path /path/to/original_model/ --save_path /path/to/result_model/ --recipe /path/to/your/recipe/

Example: W8A8-INT8 quantization of DeepSeek-V3.2

Step 1: Create a deepseek-v3.2-recipe.yaml file.

---
quant_stage:
  quant_modifiers:
    generalDay0Modifier:
      ignore: ["re:.*lm_head", "re:.*embed_tokens", "re:.*mlp.gate$"]
      scheme: W8A8
...

Step 2: Run the quantization command, specifying the model name, the original model path, the save path for the quantized model, and the recipe file.

qlean --model_name deepseek-ai/DeepSeek-V3.2 --model_path /path/to/DeepSeek-V3.2/ --save_path /path/to/DeepSeek-V3.2-INT8/ --recipe /path/to/deepseek-v3.2-recipe.yaml

Test data

Tests show that models converted using the QLean tool perform as expected in terms of inference accuracy and performance. The details are as follows:

Sglang inference accuracy

Test image: pytorch2.8.0-ubuntu24.04-cuda12.9-sglang0.5.5-py312

Model	Dataset	Original accuracy		Accuracy after QLean		Accuracy ratio
Model	Dataset	Data type	Value	Data type	Value	Accuracy ratio
deepseek-ai/DeepSeek-V3.2	ifeval	W8A8-INT8	88.5	W8A8-INT8	88.1	99.55%
	gsm8k	W8A8-INT8	95.3	W8A8-INT8	96.3	101.05%
	ceval	W8A8-INT8	92.1	W8A8-INT8	91.0	98.81%
deepseek-ai/DeepSeek-R1-0528	ifeval	BF16	80.2	BF16	81.8	102.00%
	gsm8k	BF16	96.1	BF16	96.7	100.62%
	ceval	BF16	89.6	BF16	90.1	100.56%
deepseek-ai/DeepSeek-R1-0528	ifeval	W8A8-INT8	81.7	W8A8-INT8	81.5	99.76%
	gsm8k	W8A8-INT8	96.3	W8A8-INT8	96.4	100.10%
	ceval	W8A8-INT8	89.8	W8A8-INT8	90.2	100.45%
Qwen/Qwen3-235B-A22B-Instruct-2507	ifeval	BF16	87.2	W8A8-INT8 mixed quantization	88.1	101.03%
	gsm8k	BF16	96.6	W8A8-INT8 mixed quantization	96.4	99.79%
	ceval	BF16	91.8	W8A8-INT8 mixed quantization	91.0	99.13%

Note: The Zhenwu 810E does not support FP8. For models originally in FP8 format, their W8A8-quantized version is used as the comparison baseline.

vLLM inference precision

Test image: pytorch2.9.0-ubuntu24.04-cuda12.9-vllm0.12.0-py312

Model	Dataset	Original precision (Zhenwu 810E)		Quantized precision (QLean)		Precision ratio
Model	Dataset	Type	Evaluation	Type	Evaluation	Precision ratio
deepseek-ai/DeepSeek-V3.2	ifeval	W8A8-INT8	86.5	W8A8-INT8	88.5	102.3%
	gsm8k	W8A8-INT8	95.6	W8A8-INT8	96.0	100.4%
	ceval	W8A8-INT8	91.4	W8A8-INT8	91.3	99.9%
deepseek-ai/DeepSeek-R1-0528	ifeval	BF16	79.1	BF16	80.5	101.8%
	gsm8k	BF16	96.2	BF16	96.7	100.5%
	ceval	BF16	89.6	BF16	91.3	101.9%
deepseek-ai/DeepSeek-R1-0528	ifeval	W8A8-INT8	79.8	W8A8-INT8	79.6	99.7%
	gsm8k	W8A8-INT8	96.2	W8A8-INT8	96.7	100.5%
	ceval	W8A8-INT8	90.1	W8A8-INT8	90.0	99.9%
Qwen/Qwen3-235B-A22B-Instruct-2507	ifeval	BF16	88.5	W8A8-INT8 mixed quantization	87.2	98.5%
	gsm8k	BF16	96.8	W8A8-INT8 mixed quantization	96.5	99.7%
	ceval	BF16	90.6	W8A8-INT8 mixed quantization	91.2	100.7%
	mmlu_pro	BF16	79.5	W8A8-INT8 mixed quantization	79.2	99.6%
Qwen/Qwen3-Next-80B-A3B-Instruct	ifeval	BF16	87.2	W8A8-INT8	87.4	100.2%
	gsm8k	BF16	96.5	W8A8-INT8	96.2	99.7%
	ceval	BF16	90.6	W8A8-INT8	90.2	99.6%
	mmlu_pro	BF16	81.2	W8A8-INT8	81.3	100.1%
Qwen/Qwen3-VL-30B-A3B-Instruct	mvbench	BF16	66.1	W8A8-INT8	65.8	99.5%
	ifeval	BF16	84.8	W8A8-INT8	85.2	100.5%
	gsm8k	BF16	96.0	W8A8-INT8	96.0	100.0%
	ceval	BF16	84.9	W8A8-INT8	84.4	99.4%

Note: Since the Zhenwu 810E does not support FP8, this test uses W8A8 weights as the baseline for models with original FP8 weights.

Sglang performance

Test image: pytorch2.8.0-ubuntu24.04-cuda12.9-sglang0.5.5-py312

Model	input_len	output_len	Original weights								QLean-converted weights								Ratio
Model	input_len	output_len	Type	TP	Concurrency	Requests	Total throughput (tokens/s/card)	Output throughput (tokens/s/card)	Ttft (ms)	Tpot (ms)	Type	TP	Concurrency	Requests	Total throughput (tokens/s/card)	Output throughput (tokens/s/card)	Ttft (ms)	Tpot (ms)	Ratio
deepseek-ai/DeepSeek-V3.2	800-1000	300-500	W8A8-INT8	8	32	320	56.286	17.762	1479.319	100.041	W8A8-INT8	8	30	300	55.364	17.444	1364.088	100.625	98.4%
deepseek-ai/DeepSeek-R1-0528	2800-5200	1050-1950	BF16	16	13	130	76.078	20.745	668.408	36.507	BF16	16	13	130	76.063	20.741	670.283	36.53	100.0%
deepseek-ai/DeepSeek-R1-0528	2800-5200	1050-1950	W8A8-INT8	8	13	130	132.563	36.148	639.606	42.055	W8A8-INT8	8	13	130	132.123	36.028	638.907	42.18	99.7%
Qwen/Qwen3-235B-A22B-Instruct-2507	2800-5200	1050-1950	BF16	8	106	1060	489.726	133.54	2780.778	92.187	W8A8-INT8 mixed-precision quantization	4	95	950	832.724	227.071	2337.891	97.278	170.0%
Qwen/Qwen3-VL-30B-A3B-Instruct	2800-5200	1050-1950	BF16	2	72	720	2615.439	713.185	431.293	47.78	W8A8-INT8	2	98	980	3371.56	919.365	430.125	50.481	128.9%

Note: Since the Zhenwu 810E does not support FP8, the W8A8 weights from this test serve as the baseline for models with original FP8 weights.

VLLM performance

Test image: pytorch2.9.0-ubuntu24.04-cuda12.9-vllm0.12.0-py312

Model	Input len	Output len	Original weight performance								QLean-converted weight performance								Comparison ratio
Model	Input len	Output len	Type	TP	Concurrency	Operations	Total throughput (tokens/s/card)	Output throughput (tokens/s/card)	ttft (ms)	tpot (ms)	Type	TP	Concurrency	Operations	Total throughput (tokens/s/card)	Output throughput (tokens/s/card)	ttft (ms)	tpot (ms)	Comparison ratio
deepseek-ai/DeepSeek-V3.2	800-1000	300-500	W8A8-INT8	8	39	390	72.259	22.777	804.091	100.271	W8A8-INT8	8	40	400	72.506	22.845	947.792	98.14	100.3%
deepseek-ai/DeepSeek-R1-0528	2800-5200	1050-1950	BF16	16	28	280	107.755	29.382	2371.914	55.35	BF16	16	26	260	105.566	28.786	589.644	53.167	98.0%
deepseek-ai/DeepSeek-R1-0528	2800-5200	1050-1950	W8A8-INT8	8	28	280	194.585	53.059	2652.877	61.402	W8A8-INT8	8	26	260	191.23	52.145	686.914	58.728	98.3%
Qwen/Qwen3-235B-A22B-Instruct-2507	2800-5200	1050-1950	BF16	8	120	1200	507.971	138.515	2690.231	100.796	W8A8-INT8 mixed quantization	4	104	1040	899.99	245.412	956.795	98.988	177.2%
Qwen/Qwen3-VL-30B-A3B-Instruct	2800-5200	1050-1950	BF16	2	82	820	2822.723	769.657	321.365	50.33	W8A8-INT8	2	104	1040	3628.596	988.915	287.425	49.591	128.5%

Note: The Zhenwu 810E does not support FP8. For models with original FP8 weights, W8A8 weights are used for the comparison.

Known issues

Due to its new model architecture, the Qwen3-Next series requires performance optimization on vllm after W8A8-INT8 quantization and is temporarily incompatible with higher versions of sglang.
The framework does not currently support the dequantized DeepSeek-V3.2 model.