QLean model quantization (v0.1.0)

更新时间:
复制 MD 格式

Overview

This quantization tool enables easy and efficient model quantization and dequantization. Users can switch between modes by simply specifying different recipes. It supports the following features:

  • FP8 → BF16 dequantization: Restores FP8 model weights to BF16 precision for debugging or further processing.

  • W8A8-INT8 quantization: Applies weight quantization (W8) and activation quantization (A8) to a model to produce a standard INT8 quantized model.

  • Mixed-precision quantization: Preserves high precision for specific layers of a W8A8 model based on its architecture to achieve a better balance between precision and performance.

Supported models

Original

Dequantization (FP8 → BF16)

Quantization (→ W8A8 INT8)

DeepSeek

DeepSeek-V3.1

FP8

Y

Y

DeepSeek-V3.1-Terminus

FP8

Y

Y

DeepSeek-V3.2-Exp

FP8

Y

Y

DeepSeek-R1

FP8

Y

Y

DeepSeek-R1-0528

FP8

Y

Y

Qwen3-MOE

Qwen3-235B-A22B

BF16

NA

Y (mixed-precision quantization)

Qwen3-235B-A22B-Instruct-2507

BF16

NA

Y (mixed-precision quantization)

Qwen3-Coder-480B-A35B-Instruct

BF16

NA

Y (mixed-precision quantization)

Qwen3-Next

Qwen3-Next-80B-A3B-Instruct

BF16

NA

Y

Qwen3-VL

Qwen3-VL-235B-A22B-Instruct

BF16

NA

Y

Qwen3-VL-30B-A3B-Instruct

BF16

NA

Y

Qwen3-VL-32B/8B/4B/2B-Instruct

BF16

NA

Y

Usage

The tool is released as a wheel package. You can find it on PyPI and install it with the pip install command.

Installation

# Install dependencies
pip install triton_kernel==1.0.0+ppu2.0.0.oe

# Install qlean
pip install qlean==0.1.0+ppu2.0.0.oe

Parameters

  • --model_name: The name of the model. Required. The format must match the model ID on Hugging Face, such as "deepseek-ai/DeepSeek-V3.2-Exp".

  • --model_path: The local path to the original model. Required. For example, you can specify the path to a Hugging Face model's local cache.

  • --save_path: The path to save the dequantized or quantized model. Required.

  • --mix_path: The path to save the mixed-precision quantized model. Optional.

  • --recipe: The path to the quantization recipe. Optional. Must be a path to a YAML file.

    • For models listed in the "Models that support one-click quantization" table, you can omit the recipe parameter. The system automatically loads the corresponding recipe.

    • For other models, you must create and provide a recipe file. For details on creating a recipe, see the "Create a recipe" section.

Features

You can switch between features by using different recipes. The following examples show how to use each feature.

  1. W8A8-INT8 quantization

    The original model can be in FP8 or BF16 format. The system automatically selects an appropriate strategy based on the precision of the original model. If the model is in FP8 format, the system switches to a more precise quantization mode. If the model is in BF16 format, the system uses the general quantization mode.

    # Replace /path/to/.../ with your actual path
    
    # Original model is FP8
    qlean --model_name deepseek-ai/DeepSeek-R1-0528 --model_path /path/to/DeepSeek-R1-0528/ --save_path /path/to/DeepSeek-R1-0528-INT8/
    
    # Original model is BF16
    qlean --model_name Qwen/Qwen3-VL-30B-A3B-Instruct --model_path /path/to/Qwen3-VL-30B-A3B-Instruct/ --save_path /path/to/Qwen3-VL-30B-A3B-Instruct-INT8/
  2. FP8 → BF16 dequantization

    # Replace /path/to/.../ with your actual path
    qlean --model_name deepseek-ai/DeepSeek-V3.2 --model_path /path/to/DeepSeek-V3.2/ --save_path /path/to/DeepSeek-V3.2-BF16/ --recipe /path/to/dequant.yaml
    # Contents of dequant.yaml
    ---
    dequant_stage:
      dequant_modifiers:
        dequantModifier: {}
    ...
  3. Mixed-precision quantization

    After W8A8-INT8 quantization, some models may show a drop in accuracy for specific use cases. This can happen because certain layers in the model are sensitive to precision. You can use the mixed-precision quantization feature to replace these specific layers in the W8A8-quantized model with their BF16 versions. In this case, save_path points to the W8A8 model, and mix_path specifies the save path for the mixed-precision model.

    # Replace /path/to/.../ with your actual path
    qlean --model_name Qwen/Qwen3-235B-A22B-Instruct-2507 --model_path /path/to/Qwen3-235B-A22B-Instruct-2507/ --save_path /path/to/Qwen3-235B-A22B-Instruct-2507-INT8/ --mix_path /path/to/Qwen3-235B-A22B-Instruct-2507-MIX/

Writing a recipe

To perform W8A8-INT8 quantization on a model not listed in the "Models that support one-click quantization" table, create a YAML recipe file based on the following example. You only need to modify the ignore section.

---
quant_stage:
  quant_modifiers:
    generalDay0Modifier:
      ignore: ["module_to_ignore"] # Modify this based on your requirements
      scheme: W8A8
...

When you create the ignore list in your quantization configuration, you should add all quantization-sensitive modules. Modules that should typically not be quantized include re:.*lm_head, re:.*embed_tokens, and re:.*mlp.gate$. Note that the modules that need to be ignored vary for different model architectures, and you should add them as needed.

To perform mixed-precision quantization on a model that is not listed in the "Models that support one-click quantization" table, add a mixedPrecisionModifier section to the preceding recipe. The layers parameter specifies the IDs of the layers to replace with their BF16 versions, and num_layers specifies the total number of layers in the model.

---
quant_stage:
  quant_modifiers:
    generalDay0Modifier:
      ignore: ["re:.*lm_head", "re:.*embed_tokens", "re:.*mlp.gate$"] # Modify this based on your requirements
      scheme: W8A8
    mixedPrecisionModifier:
      layers: ['88', '89', '92-93']  # Modify this based on your requirements
      num_layers: 94                 # Modify this based on your requirements
...

After creating the YAML file, pass its path using the --recipe parameter.

# Replace /name/of/.../ and /path/to/.../ with your actual names and paths
qlean --model_name /name/of/model/ --model_path /path/to/original_model/ --save_path /path/to/result_model/ --recipe /path/to/your/recipe/

Example: W8A8-INT8 quantization of DeepSeek-V3.2

Step 1: Create a deepseek-v3.2-recipe.yaml file.

---
quant_stage:
  quant_modifiers:
    generalDay0Modifier:
      ignore: ["re:.*lm_head", "re:.*embed_tokens", "re:.*mlp.gate$"]
      scheme: W8A8
...

Step 2: Run the quantization command, specifying the model name, the original model path, the save path for the quantized model, and the recipe file.

qlean --model_name deepseek-ai/DeepSeek-V3.2 --model_path /path/to/DeepSeek-V3.2/ --save_path /path/to/DeepSeek-V3.2-INT8/ --recipe /path/to/deepseek-v3.2-recipe.yaml 

Test data

Tests show that models converted using the QLean tool perform as expected in terms of inference accuracy and performance. The details are as follows:

Sglang inference accuracy

Test image: pytorch2.8.0-ubuntu24.04-cuda12.9-sglang0.5.5-py312

Model

Dataset

Original accuracy

Accuracy after QLean

Accuracy ratio

Data type

Value

Data type

Value

deepseek-ai/DeepSeek-V3.2

ifeval

W8A8-INT8

88.5

W8A8-INT8

88.1

99.55%

gsm8k

W8A8-INT8

95.3

W8A8-INT8

96.3

101.05%

ceval

W8A8-INT8

92.1

W8A8-INT8

91.0

98.81%

deepseek-ai/DeepSeek-R1-0528

ifeval

BF16

80.2

BF16

81.8

102.00%

gsm8k

BF16

96.1

BF16

96.7

100.62%

ceval

BF16

89.6

BF16

90.1

100.56%

deepseek-ai/DeepSeek-R1-0528

ifeval

W8A8-INT8

81.7

W8A8-INT8

81.5

99.76%

gsm8k

W8A8-INT8

96.3

W8A8-INT8

96.4

100.10%

ceval

W8A8-INT8

89.8

W8A8-INT8

90.2

100.45%

Qwen/Qwen3-235B-A22B-Instruct-2507

ifeval

BF16

87.2

W8A8-INT8 mixed quantization

88.1

101.03%

gsm8k

BF16

96.6

W8A8-INT8 mixed quantization

96.4

99.79%

ceval

BF16

91.8

W8A8-INT8 mixed quantization

91.0

99.13%

Note: The Zhenwu 810E does not support FP8. For models originally in FP8 format, their W8A8-quantized version is used as the comparison baseline.

vLLM inference precision

Test image: pytorch2.9.0-ubuntu24.04-cuda12.9-vllm0.12.0-py312

Model

Dataset

Original precision (Zhenwu 810E)

Quantized precision (QLean)

Precision ratio

Type

Evaluation

Type

Evaluation

deepseek-ai/DeepSeek-V3.2

ifeval

W8A8-INT8

86.5

W8A8-INT8

88.5

102.3%

gsm8k

W8A8-INT8

95.6

W8A8-INT8

96.0

100.4%

ceval

W8A8-INT8

91.4

W8A8-INT8

91.3

99.9%

deepseek-ai/DeepSeek-R1-0528

ifeval

BF16

79.1

BF16

80.5

101.8%

gsm8k

BF16

96.2

BF16

96.7

100.5%

ceval

BF16

89.6

BF16

91.3

101.9%

deepseek-ai/DeepSeek-R1-0528

ifeval

W8A8-INT8

79.8

W8A8-INT8

79.6

99.7%

gsm8k

W8A8-INT8

96.2

W8A8-INT8

96.7

100.5%

ceval

W8A8-INT8

90.1

W8A8-INT8

90.0

99.9%

Qwen/Qwen3-235B-A22B-Instruct-2507

ifeval

BF16

88.5

W8A8-INT8 mixed quantization

87.2

98.5%

gsm8k

BF16

96.8

W8A8-INT8 mixed quantization

96.5

99.7%

ceval

BF16

90.6

W8A8-INT8 mixed quantization

91.2

100.7%

mmlu_pro

BF16

79.5

W8A8-INT8 mixed quantization

79.2

99.6%

Qwen/Qwen3-Next-80B-A3B-Instruct

ifeval

BF16

87.2

W8A8-INT8

87.4

100.2%

gsm8k

BF16

96.5

W8A8-INT8

96.2

99.7%

ceval

BF16

90.6

W8A8-INT8

90.2

99.6%

mmlu_pro

BF16

81.2

W8A8-INT8

81.3

100.1%

Qwen/Qwen3-VL-30B-A3B-Instruct

mvbench

BF16

66.1

W8A8-INT8

65.8

99.5%

ifeval

BF16

84.8

W8A8-INT8

85.2

100.5%

gsm8k

BF16

96.0

W8A8-INT8

96.0

100.0%

ceval

BF16

84.9

W8A8-INT8

84.4

99.4%

Note: Since the Zhenwu 810E does not support FP8, this test uses W8A8 weights as the baseline for models with original FP8 weights.

Sglang performance

Test image: pytorch2.8.0-ubuntu24.04-cuda12.9-sglang0.5.5-py312

Model

input_len

output_len

Original weights

QLean-converted weights

Ratio

Type

TP

Concurrency

Requests

Total throughput (tokens/s/card)

Output throughput (tokens/s/card)

Ttft (ms)

Tpot (ms)

Type

TP

Concurrency

Requests

Total throughput (tokens/s/card)

Output throughput (tokens/s/card)

Ttft (ms)

Tpot (ms)

deepseek-ai/DeepSeek-V3.2

800-1000

300-500

W8A8-INT8

8

32

320

56.286

17.762

1479.319

100.041

W8A8-INT8

8

30

300

55.364

17.444

1364.088

100.625

98.4%

deepseek-ai/DeepSeek-R1-0528

2800-5200

1050-1950

BF16

16

13

130

76.078

20.745

668.408

36.507

BF16

16

13

130

76.063

20.741

670.283

36.53

100.0%

2800-5200

1050-1950

W8A8-INT8

8

13

130

132.563

36.148

639.606

42.055

W8A8-INT8

8

13

130

132.123

36.028

638.907

42.18

99.7%

Qwen/Qwen3-235B-A22B-Instruct-2507

2800-5200

1050-1950

BF16

8

106

1060

489.726

133.54

2780.778

92.187

W8A8-INT8 mixed-precision quantization

4

95

950

832.724

227.071

2337.891

97.278

170.0%

Qwen/Qwen3-VL-30B-A3B-Instruct

2800-5200

1050-1950

BF16

2

72

720

2615.439

713.185

431.293

47.78

W8A8-INT8

2

98

980

3371.56

919.365

430.125

50.481

128.9%

Note: Since the Zhenwu 810E does not support FP8, the W8A8 weights from this test serve as the baseline for models with original FP8 weights.

VLLM performance

Test image: pytorch2.9.0-ubuntu24.04-cuda12.9-vllm0.12.0-py312

Model

Input len

Output len

Original weight performance

QLean-converted weight performance

Comparison ratio

Type

TP

Concurrency

Operations

Total throughput (tokens/s/card)

Output throughput (tokens/s/card)

ttft (ms)

tpot (ms)

Type

TP

Concurrency

Operations

Total throughput (tokens/s/card)

Output throughput (tokens/s/card)

ttft (ms)

tpot (ms)

deepseek-ai/DeepSeek-V3.2

800-1000

300-500

W8A8-INT8

8

39

390

72.259

22.777

804.091

100.271

W8A8-INT8

8

40

400

72.506

22.845

947.792

98.14

100.3%

deepseek-ai/DeepSeek-R1-0528

2800-5200

1050-1950

BF16

16

28

280

107.755

29.382

2371.914

55.35

BF16

16

26

260

105.566

28.786

589.644

53.167

98.0%

W8A8-INT8

8

28

280

194.585

53.059

2652.877

61.402

W8A8-INT8

8

26

260

191.23

52.145

686.914

58.728

98.3%

Qwen/Qwen3-235B-A22B-Instruct-2507

2800-5200

1050-1950

BF16

8

120

1200

507.971

138.515

2690.231

100.796

W8A8-INT8 mixed quantization

4

104

1040

899.99

245.412

956.795

98.988

177.2%

Qwen/Qwen3-VL-30B-A3B-Instruct

2800-5200

1050-1950

BF16

2

82

820

2822.723

769.657

321.365

50.33

W8A8-INT8

2

104

1040

3628.596

988.915

287.425

49.591

128.5%

Note: The Zhenwu 810E does not support FP8. For models with original FP8 weights, W8A8 weights are used for the comparison.

Known issues

  • Due to its new model architecture, the Qwen3-Next series requires performance optimization on vllm after W8A8-INT8 quantization and is temporarily incompatible with higher versions of sglang.

  • The framework does not currently support the dequantized DeepSeek-V3.2 model.