Skip to content

Int8 Quantization

Overview

Int8 quantization supports W8A8 diffusion transformer inference on CUDA and Ascend NPU. It can quantize BF16/FP16 weights at load time, or load serialized Int8 checkpoints that already contain quantized weights and scales.

Only online activation scaling is currently supported.

Hardware Support

Device Support
NVIDIA Blackwell GPU (SM 100+)
NVIDIA Ada/Hopper GPU (SM 89+)
NVIDIA Ampere GPU (SM 80+)
AMD ROCm
Intel XPU
Ascend NPU

Legend: supported, unsupported, not verified in this guide.

Model Type Support

Diffusion Model (Qwen-Image, Wan2.2)

Model HF models CUDA Ascend NPU Mode Recommendation
Qwen-Image Qwen/Qwen-Image, Qwen/Qwen-Image-2512 Yes Yes Online W8A8 All layers
Wan2.2 Wan2.2 diffusion pipelines Not validated Not validated Online W8A8 Validate before enabling in docs
Z-Image Tongyi-MAI/Z-Image-Turbo Yes Yes Online W8A8 All layers

Other diffusion models may work if their transformer uses supported linear layers, but they are not validated in this guide.

Multi-Stage Omni/TTS Model (Qwen3-Omni, Qwen3-TTS)

Model Scope Status Notes
Qwen3-Omni Thinker language-model stage Not validated Prefer checkpoint-supported ModelOpt FP8 or AutoRound paths
Qwen3-TTS TTS language-model stage Not validated No Int8 TTS stage support is documented

Multi-Stage Diffusion Model (BAGEL, GLM-Image)

Model Scope Status Notes
BAGEL Stage-specific transformer or DiT module Not validated Requires explicit stage routing
GLM-Image Stage-specific transformer or DiT module Not validated Requires quality comparison with BF16

Configuration

Python API:

from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

omni = Omni(model="<your-model>", quantization="int8")

omni_with_skips = Omni(
    model="<your-model>",
    quantization_config={
        "method": "int8",
        "ignored_layers": ["<layer-name>"],
    },
)

outputs = omni.generate(
    "A cat sitting on a windowsill",
    OmniDiffusionSamplingParams(num_inference_steps=50),
)

CLI:

python text_to_image.py --model <your-model> --quantization int8
python text_to_image.py --model <your-model> --quantization int8 --ignored-layers "img_mlp"
vllm serve <your-model> --omni --quantization int8

Parameters

Parameter Type Default Description
method str - Quantization method ("int8")
activation_scheme str "dynamic" "dynamic" selects online activation scaling; static is not supported
ignored_layers list[str] [] Layer name patterns to keep in BF16/FP16
is_checkpoint_int8_serialized bool False Set by checkpoint config when loading serialized Int8 weights

Validation and Notes

Int8 quantization can be combined with cache acceleration:

omni = Omni(
    model="<your-model>",
    quantization="int8",
    cache_backend="tea_cache",
    cache_config={"rel_l1_thresh": 0.2},
)

Only add a new model to the supported table after comparing the Int8 output against a BF16 baseline and documenting any required ignored_layers.