W8A8 MXFP8 Quantization¶
Overview¶
W8A8 MXFP8 (Microscaling FP8) quantizes both weights and activations to FP8 using the OCP MX format: groups of 32 K-dimension elements share a single float8_e8m0fnu exponent scale. This gives better accuracy than channel-wise FP8 while keeping the same 8-bit weight footprint.
This method supports three modes:
| Mode | Description |
|---|---|
| Online | BF16 weights are quantized to MXFP8 at load time — no pre-processing needed |
| Offline (Native) | msModelSlim-exported MXFP8 weights converted to diffusers format via merge_mxfp8_checkpoint.py — weights and scales are loaded directly from the preprocessed checkpoint |
| Offline (AutoRound) | AutoRound MXFP8 checkpoints with data_type="mx_fp" — auto-detected from config.json |
Hardware Support¶
| Device | Online | Offline (Native) | Offline (AutoRound) |
|---|---|---|---|
| NVIDIA Blackwell GPU (SM 100+) | ⭕ | ⭕ | ⭕ |
| NVIDIA Ada/Hopper GPU (SM 89+) | ⭕ | ⭕ | ⭕ |
| NVIDIA Ampere GPU (SM 80+) | ⭕ | ⭕ | ⭕ |
| AMD ROCm | ⭕ | ⭕ | ⭕ |
| Intel XPU | ✅ | ❌ | ✅ |
| Ascend NPU (Atlas 950 A5) | ✅ | ✅ | ⭕ |
Legend: ✅ supported, ❌ unsupported, ⭕ not verified in this guide.
Note: Intel XPU only supports AutoRound MXFP8 for offline mode. Use AutoRound quantized checkpoints or online mode for XPU.
Model Type Support¶
Diffusion Model (Wan2.2)¶
| Model | Mode | Notes |
|---|---|---|
| Wan2.2-T2V-A14B | Online + Offline | MoE cascade; quantizes two transformers (transformer + transformer_2) |
| Wan2.2-I2V-A14B | Online + Offline | MoE cascade; quantizes two transformers (transformer + transformer_2) |
| Wan2.2-TI2V-5B | Online + Offline | Single transformer |
Multi-Stage Omni/TTS Model (Qwen3-Omni, Qwen3-TTS)¶
| Model | Status | Notes |
|---|---|---|
| Qwen3-Omni | Not validated | — |
| Qwen3-TTS | Not validated | — |
Multi-Stage Diffusion Model (BAGEL, GLM-Image)¶
| Model | Status | Notes |
|---|---|---|
| BAGEL | Not validated | — |
| GLM-Image | Not validated | — |
Configuration¶
Online Mode¶
Online mode requires no pre-processing. vLLM-Omni quantizes BF16 weights to MXFP8 at load time.
Python API:
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
omni = Omni(model="<your-model>", quantization="mxfp8")
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(num_inference_steps=50),
)
CLI:
python text_to_video.py --model <your-model> --quantization mxfp8
# Online serving
vllm serve <your-model> --omni --quantization mxfp8
Offline Mode (Native)¶
Native offline mode loads a pre-quantized checkpoint from msModelSlim. A preprocessing step converts the raw quantized output to the diffusers format expected by vLLM-Omni and injects the quantization config into transformer/config.json so that vLLM-Omni auto-detects the offline path without a --quantization flag.
Step 1 — Quantize with msModelSlim¶
msmodelslim quant \
--model_path /path/to/Wan2.2-TI2V-5B-Diffusers \
--save_path /path/to/wan2_2_ti2v_quantized_raw \
--device npu \
--model_type Wan2_2 \
--config_path /path/to/wan2_2_w8a8f8_mxfp.yaml \
--trust_remote_code True
After this step, --save_path contains the raw quantized safetensors files and a metadata JSON (quant_model_description*.json).
For cascade MoE models (T2V-A14B, I2V-A14B), msModelSlim outputs two subdirectories: high_noise_model/ and low_noise_model/.
Step 2 — Preprocess with merge_mxfp8_checkpoint.py¶
The script (vllm_omni/quantization/tools/merge_mxfp8_checkpoint.py):
- Copies the original diffusers model to
--output-path(VAE, text encoder, scheduler, etc. are preserved). - Remaps tensor names from msModelSlim convention to diffusers convention.
- Saves the converted weights as
diffusion_pytorch_model.safetensors. - Copies the original
transformer/config.jsonand injectsquantization_configso that vLLM-Omni auto-detects offline MXFP8.
For cascade MoE models, steps 2–4 run separately for high_noise_model/ → transformer/ and low_noise_model/ → transformer_2/.
python vllm_omni/quantization/tools/merge_mxfp8_checkpoint.py \
--model-type Wan2.2-TI2V-5B \
--original-model /path/to/Wan2.2-TI2V-5B-Diffusers \
--quant-path /path/to/wan2_2_ti2v_quantized_raw \
--output-path /path/to/Wan2.2-TI2V-5B-MXFP8
| Argument | Description |
|---|---|
--model-type | Model variant: Wan2.2-T2V-A14B, Wan2.2-I2V-A14B, or Wan2.2-TI2V-5B |
--original-model | Root directory of the original BF16 diffusers model |
--quant-path | Root directory of the msModelSlim quantized output |
--output-path | Output directory for the merged model (created by the script) |
The script outputs a complete diffusers model directory at --output-path, with each transformer subfolder containing:
diffusion_pytorch_model.safetensors— converted FP8 weightsconfig.json— original transformer config withquantization_configinjectedquant_model_description.json— renamed quantization metadata (reference only)
Step 3 — Serve¶
python text_to_video.py --model /path/to/Wan2.2-TI2V-5B-MXFP8
# Online serving
vllm serve /path/to/Wan2.2-TI2V-5B-MXFP8 --omni
Python API:
Note
No --quantization flag is needed for native offline mode. The preprocessing script injects quantization_config into each transformer/config.json, which vLLM-Omni reads automatically to activate the offline MXFP8 method.
Offline Mode (AutoRound)¶
AutoRound MXFP8 checkpoints declare quant_method="auto-round" with data_type="mx_fp" in their config.json. These are automatically detected and use the IncMxfp8OfflineLinearMethod backend.
To use an AutoRound MXFP8 checkpoint:
python text_to_video.py --model <autoround-mxfp8-model>
# Online serving
vllm serve <autoround-mxfp8-model> --omni
Python API:
Note
AutoRound MXFP8 checkpoints are auto-detected from config.json and do not require a --quantization flag. The config must include:
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
method | str | — | Must be "mxfp8" |
is_checkpoint_mxfp8_serialized | bool | False | True for offline pre-quantized checkpoints; auto-set from config.json when using the preprocessing script |
ignored_layers | list[str] | [] | Layer name substrings to keep in BF16 (e.g. "to_out" matches blocks.0.attn1.to_out.0) |
Validation and Notes¶
- Online mode quantizes BF16 weights at load time using
npu_dynamic_mx_quant. This adds a one-time overhead on the first load but requires no checkpoint preparation. - Offline mode loads FP8 weights directly from the checkpoint. Scales are stored as
uint8bytes in safetensors (same bit layout asfloat8_e8m0fnu) and are reinterpreted at load time without a dtype conversion. - If the offline checkpoint was produced with the old
merge_mxfp8_checkpoint.pyinterface (arguments--quant-dir,--orig-dir,--meta-json,--output-dir), regenerate it with the current script. The old script wrote a separatequantization_config.jsonthat is not read by vLLM-Omni; the current script injects the config directly intotransformer/config.json.