Online Quantization¶
Overview¶
Online quantization means vLLM-Omni computes quantized weights and scales while loading the model. Use it when you want memory savings without preparing a separate quantized checkpoint.
This mode is different from pre-quantized checkpoint formats such as GGUF, AutoRound, msModelSlim, or serialized Int8 checkpoints. Those formats are prepared before serving and are documented in their method-specific guides. For MXFP8 and MXFP4, use this page for load-time quantization from BF16 checkpoints, and use the method-specific pages for offline checkpoints produced by msModelSlim and the merge tools.
Hardware Support¶
| Device | FP8 W8A8 | Int8 W8A8 | MXFP8 W8A8 | MXFP4 W4A4 |
|---|---|---|---|---|
| NVIDIA Blackwell GPU (SM 100+) | ✅ | ✅ | ⭕ | ⭕ |
| NVIDIA Ada/Hopper GPU (SM 89+) | ✅ | ✅ | ⭕ | ⭕ |
| NVIDIA Ampere GPU (SM 80+) | ✅ | ✅ | ⭕ | ⭕ |
| AMD ROCm | ⭕ | ⭕ | ⭕ | ⭕ |
| Intel XPU | ⭕ | ⭕ | ✅ | ⭕ |
| Ascend NPU | ❌ | ✅ | ✅ | ✅ |
Legend: ✅ supported, ❌ unsupported, ⭕ not verified in this guide. FP8 on Ampere may use a weight-only path where available. MXFP8 and MXFP4 are documented for the Ascend NPU path.
Model Type Support¶
Diffusion Model (Qwen-Image, Wan2.2)¶
| Method | Guide | Example models | Status |
|---|---|---|---|
| FP8 W8A8 | FP8 | Qwen-Image; Wan2.2 is not validated | Validated for Qwen-Image family and other DiT models |
| Int8 W8A8 | Int8 | Qwen-Image; Wan2.2 is not validated | Validated for Qwen-Image and Z-Image |
| MXFP8 W8A8 | MXFP8 | Wan2.2-T2V-A14B, Wan2.2-I2V-A14B, Wan2.2-TI2V-5B | Validated on Ascend NPU and Intel XPU |
| MXFP4 W4A4 | MXFP4 | Wan2.2-T2V-A14B, Wan2.2-I2V-A14B | Ascend NPU only; TI2V-5B is not supported |
Multi-Stage Omni/TTS Model (Qwen3-Omni, Qwen3-TTS)¶
Online quantization is not currently validated for the omni/TTS stages. For Qwen3-Omni and related models, prefer checkpoint-declared ModelOpt or AutoRound paths when available.
Multi-Stage Diffusion Model (BAGEL, GLM-Image)¶
Online quantization must be routed to the intended stage. BAGEL and GLM-Image need model-specific validation before they are listed as supported targets.
Configuration¶
Python API:
from vllm_omni import Omni
omni_fp8 = Omni(model="<your-model>", quantization="fp8")
omni_int8 = Omni(model="<your-model>", quantization="int8")
omni_mxfp8 = Omni(model="<your-model>", quantization="mxfp8")
omni_mxfp4 = Omni(model="<your-model>", quantization="mxfp4")
omni_mxfp4_dualscale = Omni(model="<your-model>", quantization="mxfp4_dualscale")
CLI:
vllm serve <your-model> --omni --quantization fp8
vllm serve <your-model> --omni --quantization int8
vllm serve <your-model> --omni --quantization mxfp8
vllm serve <your-model> --omni --quantization mxfp4
vllm serve <your-model> --omni --quantization mxfp4_dualscale
Per-component routing:
from vllm_omni.quantization import build_quant_config
config = build_quant_config({
"transformer": {"method": "fp8"},
"vae": None,
})
Parameters¶
| Parameter | Methods | Description |
|---|---|---|
method | FP8, Int8, MXFP8, MXFP4 | Quantization method: "fp8", "int8", "mxfp8", "mxfp4", or "mxfp4_dualscale" |
ignored_layers | FP8, Int8, MXFP8, MXFP4 | Layer name patterns to keep in BF16/FP16 |
activation_scheme | FP8, Int8 | The runtime value "dynamic" selects online activation scaling |
weight_block_size | FP8 | Optional block-wise FP8 weight quantization size |
num_bf16_fallback_layers | MXFP4 DualScale | Leading transformer blocks to keep in BF16 for online mxfp4_dualscale; defaults to 5 |
Validation and Notes¶
- Compare the online-quantized output against a BF16 baseline with the same seed and generation parameters.
- Use
ignored_layersfor quality-sensitive MLPs or output projections. - Document any required skipped layers in the method page before marking a new model as supported.
- If a model already ships quantized weights, use the matching pre-quantized method guide instead of online quantization.
- For Ascend MXFP4 deployments, prefer offline
mxfp4_dualscalecheckpoints when production quality is more important than avoiding preprocessing.