W8A8 MXFP8 Quantization¶

Overview¶

W8A8 MXFP8 (Microscaling FP8) quantizes both weights and activations to FP8 using the OCP MX format: groups of 32 K-dimension elements share a single float8_e8m0fnu exponent scale. This gives better accuracy than channel-wise FP8 while keeping the same 8-bit weight footprint.

This method supports three modes:

Mode	Description
Online	BF16 weights are quantized to MXFP8 at load time — no pre-processing needed
Offline (Native)	msModelSlim-exported MXFP8 weights converted to diffusers format via `merge_mxfp8_checkpoint.py` — weights and scales are loaded directly from the preprocessed checkpoint
Offline (AutoRound)	AutoRound MXFP8 checkpoints with `data_type="mx_fp"` — auto-detected from `config.json`

Hardware Support¶

Device	Online	Offline (Native)	Offline (AutoRound)
NVIDIA Blackwell GPU (SM 100+)	⭕	⭕	⭕
NVIDIA Ada/Hopper GPU (SM 89+)	⭕	⭕	⭕
NVIDIA Ampere GPU (SM 80+)	⭕	⭕	⭕
AMD ROCm	⭕	⭕	⭕
Intel XPU	✅	❌	✅
Ascend NPU (Atlas 950 A5)	✅	✅	⭕

Legend: ✅ supported, ❌ unsupported, ⭕ not verified in this guide.

Note: Intel XPU only supports AutoRound MXFP8 for offline mode. Use AutoRound quantized checkpoints or online mode for XPU.

Model Type Support¶

Diffusion Model (Wan2.2)¶

Model	Mode	Notes
Wan2.2-T2V-A14B	Online + Offline	MoE cascade; quantizes two transformers (`transformer` + `transformer_2`)
Wan2.2-I2V-A14B	Online + Offline	MoE cascade; quantizes two transformers (`transformer` + `transformer_2`)
Wan2.2-TI2V-5B	Online + Offline	Single transformer

Multi-Stage Omni/TTS Model (Qwen3-Omni, Qwen3-TTS)¶

Model	Status	Notes
Qwen3-Omni	Not validated	—
Qwen3-TTS	Not validated	—

Multi-Stage Diffusion Model (BAGEL, GLM-Image)¶

Model	Status	Notes
BAGEL	Not validated	—
GLM-Image	Not validated	—

Configuration¶

Online Mode¶

Online mode requires no pre-processing. vLLM-Omni quantizes BF16 weights to MXFP8 at load time.

Python API:

from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

omni = Omni(model="<your-model>", quantization="mxfp8")

outputs = omni.generate(
    "A cat sitting on a windowsill",
    OmniDiffusionSamplingParams(num_inference_steps=50),
)

CLI:

python text_to_video.py --model <your-model> --quantization mxfp8

# Online serving
vllm serve <your-model> --omni --quantization mxfp8

Offline Mode (Native)¶

Native offline mode loads a pre-quantized checkpoint from msModelSlim. A preprocessing step converts the raw quantized output to the diffusers format expected by vLLM-Omni and injects the quantization config into transformer/config.json so that vLLM-Omni auto-detects the offline path without a --quantization flag.

Step 1 — Quantize with msModelSlim¶

msmodelslim quant \
  --model_path /path/to/Wan2.2-TI2V-5B-Diffusers \
  --save_path  /path/to/wan2_2_ti2v_quantized_raw \
  --device npu \
  --model_type Wan2_2 \
  --config_path /path/to/wan2_2_w8a8f8_mxfp.yaml \
  --trust_remote_code True

After this step, --save_path contains the raw quantized safetensors files and a metadata JSON (quant_model_description*.json).

For cascade MoE models (T2V-A14B, I2V-A14B), msModelSlim outputs two subdirectories: high_noise_model/ and low_noise_model/.

Step 2 — Preprocess with merge_mxfp8_checkpoint.py¶

The script (vllm_omni/quantization/tools/merge_mxfp8_checkpoint.py):

Copies the original diffusers model to --output-path (VAE, text encoder, scheduler, etc. are preserved).
Remaps tensor names from msModelSlim convention to diffusers convention.
Saves the converted weights as diffusion_pytorch_model.safetensors.
Copies the original transformer/config.json and injects quantization_config so that vLLM-Omni auto-detects offline MXFP8.

For cascade MoE models, steps 2–4 run separately for high_noise_model/ → transformer/ and low_noise_model/ → transformer_2/.

python vllm_omni/quantization/tools/merge_mxfp8_checkpoint.py \
  --model-type     Wan2.2-TI2V-5B \
  --original-model /path/to/Wan2.2-TI2V-5B-Diffusers \
  --quant-path     /path/to/wan2_2_ti2v_quantized_raw \
  --output-path    /path/to/Wan2.2-TI2V-5B-MXFP8

Argument	Description
`--model-type`	Model variant: `Wan2.2-T2V-A14B`, `Wan2.2-I2V-A14B`, or `Wan2.2-TI2V-5B`
`--original-model`	Root directory of the original BF16 diffusers model
`--quant-path`	Root directory of the msModelSlim quantized output
`--output-path`	Output directory for the merged model (created by the script)

The script outputs a complete diffusers model directory at --output-path, with each transformer subfolder containing:

diffusion_pytorch_model.safetensors — converted FP8 weights
config.json — original transformer config with quantization_config injected
quant_model_description.json — renamed quantization metadata (reference only)

Step 3 — Serve¶

python text_to_video.py --model /path/to/Wan2.2-TI2V-5B-MXFP8

# Online serving
vllm serve /path/to/Wan2.2-TI2V-5B-MXFP8 --omni

Python API:

omni = Omni(model="/path/to/Wan2.2-TI2V-5B-MXFP8")

Note

No --quantization flag is needed for native offline mode. The preprocessing script injects quantization_config into each transformer/config.json, which vLLM-Omni reads automatically to activate the offline MXFP8 method.

Offline Mode (AutoRound)¶

AutoRound MXFP8 checkpoints declare quant_method="auto-round" with data_type="mx_fp" in their config.json. These are automatically detected and use the IncMxfp8OfflineLinearMethod backend.

To use an AutoRound MXFP8 checkpoint:

python text_to_video.py --model <autoround-mxfp8-model>

# Online serving
vllm serve <autoround-mxfp8-model> --omni

Python API:

omni = Omni(model="<autoround-mxfp8-model>")

Note

AutoRound MXFP8 checkpoints are auto-detected from config.json and do not require a --quantization flag. The config must include:

{
  "quantization_config": {
    "quant_method": "auto-round",
    "data_type": "mx_fp",
    ...
  }
}

Parameters¶

Parameter	Type	Default	Description
`method`	str	—	Must be `"mxfp8"`
`is_checkpoint_mxfp8_serialized`	bool	`False`	`True` for offline pre-quantized checkpoints; auto-set from `config.json` when using the preprocessing script
`ignored_layers`	list[str]	`[]`	Layer name substrings to keep in BF16 (e.g. `"to_out"` matches `blocks.0.attn1.to_out.0`)

Validation and Notes¶

Online mode quantizes BF16 weights at load time using npu_dynamic_mx_quant. This adds a one-time overhead on the first load but requires no checkpoint preparation.
Offline mode loads FP8 weights directly from the checkpoint. Scales are stored as uint8 bytes in safetensors (same bit layout as float8_e8m0fnu) and are reinterpreted at load time without a dtype conversion.
If the offline checkpoint was produced with the old merge_mxfp8_checkpoint.py interface (arguments --quant-dir, --orig-dir, --meta-json, --output-dir), regenerate it with the current script. The old script wrote a separate quantization_config.json that is not read by vLLM-Omni; the current script injects the config directly into transformer/config.json.