Int8 Quantization¶
Overview¶
Int8 quantization supports W8A8 diffusion transformer inference on CUDA and Ascend NPU. It can quantize BF16/FP16 weights at load time, or load serialized Int8 checkpoints that already contain quantized weights and scales.
Only online activation scaling is currently supported.
Hardware Support¶
| Device | Support |
|---|---|
| NVIDIA Blackwell GPU (SM 100+) | ✅ |
| NVIDIA Ada/Hopper GPU (SM 89+) | ✅ |
| NVIDIA Ampere GPU (SM 80+) | ✅ |
| AMD ROCm | ⭕ |
| Intel XPU | ⭕ |
| Ascend NPU | ✅ |
Legend: ✅ supported, ❌ unsupported, ⭕ not verified in this guide.
Model Type Support¶
Diffusion Model (Qwen-Image, Wan2.2)¶
| Model | HF models | CUDA | Ascend NPU | Mode | Recommendation |
|---|---|---|---|---|---|
| Qwen-Image | Qwen/Qwen-Image, Qwen/Qwen-Image-2512 | Yes | Yes | Online W8A8 | All layers |
| Wan2.2 | Wan2.2 diffusion pipelines | Not validated | Not validated | Online W8A8 | Validate before enabling in docs |
| Z-Image | Tongyi-MAI/Z-Image-Turbo | Yes | Yes | Online W8A8 | All layers |
Other diffusion models may work if their transformer uses supported linear layers, but they are not validated in this guide.
Multi-Stage Omni/TTS Model (Qwen3-Omni, Qwen3-TTS)¶
| Model | Scope | Status | Notes |
|---|---|---|---|
| Qwen3-Omni | Thinker language-model stage | Not validated | Prefer checkpoint-supported ModelOpt FP8 or AutoRound paths |
| Qwen3-TTS | TTS language-model stage | Not validated | No Int8 TTS stage support is documented |
Multi-Stage Diffusion Model (BAGEL, GLM-Image)¶
| Model | Scope | Status | Notes |
|---|---|---|---|
| BAGEL | Stage-specific transformer or DiT module | Not validated | Requires explicit stage routing |
| GLM-Image | Stage-specific transformer or DiT module | Not validated | Requires quality comparison with BF16 |
Configuration¶
Python API:
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
omni = Omni(model="<your-model>", quantization="int8")
omni_with_skips = Omni(
model="<your-model>",
quantization_config={
"method": "int8",
"ignored_layers": ["<layer-name>"],
},
)
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(num_inference_steps=50),
)
CLI:
python text_to_image.py --model <your-model> --quantization int8
python text_to_image.py --model <your-model> --quantization int8 --ignored-layers "img_mlp"
vllm serve <your-model> --omni --quantization int8
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
method | str | - | Quantization method ("int8") |
activation_scheme | str | "dynamic" | "dynamic" selects online activation scaling; static is not supported |
ignored_layers | list[str] | [] | Layer name patterns to keep in BF16/FP16 |
is_checkpoint_int8_serialized | bool | False | Set by checkpoint config when loading serialized Int8 weights |
Validation and Notes¶
Int8 quantization can be combined with cache acceleration:
omni = Omni(
model="<your-model>",
quantization="int8",
cache_backend="tea_cache",
cache_config={"rel_l1_thresh": 0.2},
)
Only add a new model to the supported table after comparing the Int8 output against a BF16 baseline and documenting any required ignored_layers.