W4A4 MXFP4 Quantization¶
Overview¶
W4A4 MXFP4 (Microscaling FP4) quantizes both weights and activations to FP4 (float4_e2m1fn_x2, packed 2 values per byte) using the OCP MX format: groups of 32 K-dimension elements share a single float8_e8m0fnu exponent scale.
vLLM-Omni provides two quantization methods with different scale structures:
| Method | Scale structure | Mode | Use case |
|---|---|---|---|
mxfp4 | Single-scale (per-32 fine only) | Online only | Quick accuracy baseline; no checkpoint prep needed |
mxfp4_dualscale | Dual-scale (fine per-32 + coarse per-512 + per-channel mul_scale) | Online + Offline | Production; better accuracy; offline recommended |
Recommended: mxfp4_dualscale offline
For production deployments, use the mxfp4_dualscale offline mode with a pre-quantized checkpoint produced by msModelSlim. Offline checkpoints load calibrated mul_scale tensors from disk, providing measurably better accuracy than any online method. The one-time preprocessing cost amortises across all subsequent inference runs.
Use mxfp4 online only for quick experimentation where preprocessing time is not acceptable and accuracy loss is tolerable.
Online single-scale ≠ Offline dual-scale
mxfp4_dualscale offline mode uses NPUMxfp4DualScaleLinearMethod: fine scale (per-32 K), coarse scale (per-512 K), and per-input-channel mul_scale from calibration — all loaded from the checkpoint. mxfp4_dualscale online mode uses NPUMxfp4DualScaleOnlineLinearMethod: dual-level scales computed on the fly from BF16 weights; no calibration mul_scale is available. Loading an offline checkpoint with the online method (or vice versa) will produce incorrect results or shape errors.
Hardware Support¶
| Device | Support |
|---|---|
| NVIDIA Blackwell GPU (SM 100+) | ⭕ |
| NVIDIA Ada/Hopper GPU (SM 89+) | ⭕ |
| NVIDIA Ampere GPU (SM 80+) | ⭕ |
| AMD ROCm | ⭕ |
| Intel XPU | ⭕ |
| Ascend NPU (Atlas 950 A5) | ✅ |
Legend: ✅ supported, ❌ unsupported, ⭕ not verified in this guide.
Model Type Support¶
Diffusion Model (Wan2.2)¶
| Model | Online | Offline | Notes |
|---|---|---|---|
| Wan2.2-T2V-A14B | mxfp4 / mxfp4_dualscale | mxfp4_dualscale | MoE cascade (transformer + transformer_2); both transformers quantized with the same config |
| Wan2.2-I2V-A14B | mxfp4 / mxfp4_dualscale | mxfp4_dualscale | MoE cascade; same scheme as T2V-A14B |
| Wan2.2-TI2V-5B | ❌ | ❌ | Parameter count too small; W4A4 causes unacceptable accuracy loss |
The choice between mxfp4 and mxfp4_dualscale in online mode is about quantization quality, not model compatibility — both work on cascade (A14B) and single-transformer models alike, the same as mxfp8 online:
mxfp4: single-scale, lower overhead, simpler compute, online onlymxfp4_dualscale: dual-scale + optional BF16 fallback, better accuracy, online and offline
Offline checkpoints for A14B are always in mxfp4_dualscale format (produced by the merge script); there is no offline mxfp4 single-scale format.
Per-layer BF16 fallback in offline cascade models
The A14B offline checkpoint uses quant_method: mxfp4_dualscale. Most linear layers are stored as W4A4 MXFP4 DualScale; precision-sensitive layers retain their original BF16 weights and are listed in ignored_layers inside each transformer's config.json. The two transformers may have different ignored_layers sets — the pipeline reads each transformer's own config.json and rebuilds the config locally when they differ, so routing is always per-transformer-accurate.
TI2V-5B not supported
Wan2.2-TI2V-5B is excluded from W4A4 quantization. Its smaller parameter count makes it significantly more sensitive to 4-bit quantization noise, resulting in unacceptable accuracy loss. Use MXFP8 for TI2V-5B.
Configuration¶
mxfp4 — Single-Scale Online Mode¶
Online mode requires no pre-processing. vLLM-Omni quantizes BF16 weights to MXFP4 at load time using npu_dynamic_mx_quant. A single block scale (float8_e8m0fnu, one per 32 K elements) is computed on the fly; no calibration mul_scale is available. Applies equally to single-transformer and cascade (A14B) models — both transformers in a cascade receive the same quantization config automatically.
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
omni = Omni(model="<your-model>", quantization="mxfp4")
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(num_inference_steps=50),
)
# Single-transformer or cascade model — same command
python text_to_video.py --model <your-model> --quantization mxfp4
# Online serving
vllm serve <your-model> --omni --quantization mxfp4
mxfp4_dualscale — DualScale Online Mode¶
Online DualScale mode computes both fine and coarse scales on the fly from BF16 weights using npu_dynamic_dual_level_mx_quant. Applies equally to single-transformer and cascade (A14B) models. Compared to mxfp4 online, DualScale provides better quantization accuracy at higher compute cost.
The default configuration keeps the leading 5 transformer blocks in BF16 (num_bf16_fallback_layers=5). Accuracy evaluation on Wan2.2-A14B shows this is sufficient to meet quality requirements and is the recommended setting.
If accuracy debugging identifies additional precision-sensitive layers, they can be pinned to BF16 via the Python API:
omni = Omni(
model="<your-model>",
quantization_config={
"method": "mxfp4_dualscale",
"ignored_layers": ["blocks.10.attn1.to_q"], # explicit per-layer override
},
)
BF16 fallback routing in online mode applies two rules in priority order:
ignored_layers(explicit per-layer override): any layer whose prefix matches is kept in BF16 regardless of block index.num_bf16_fallback_layers(coarse leading-block rule): the first N transformer blocks (blocks.0…blocks.N-1) fall back to BF16. Defaults to5(recommended). Layers outsideblocks.N.*(e.g.condition_embedder) are always quantized.
mxfp4_dualscale — DualScale Offline Mode (Recommended)¶
Offline mode loads a pre-quantized DualScale checkpoint from msModelSlim. A preprocessing step converts the raw quantized output to the diffusers format expected by vLLM-Omni and injects the quantization config into each transformer/config.json so that vLLM-Omni auto-detects the offline path without a --quantization flag.
BF16 fallback layers may be interleaved anywhere in the transformer — they are not restricted to leading blocks. The merge script detects them from quant_model_description.json and writes their prefixes into ignored_layers inside config.json. At runtime, each layer's prefix is matched against ignored_layers to decide BF16 vs. MXFP4 DualScale.
Checkpoint tensor layout¶
Each quantized linear layer stores four tensors:
| Tensor | Shape | dtype | Description |
|---|---|---|---|
weight | (N, K) | float8_e4m3fn | FP4 packed (2 values per byte) |
weight_scale | (N, K//32) | uint8 | Fine block scale (float8_e8m0fnu bit pattern) |
weight_dual_scale | (N, K//512, 1) | float32 | Coarse block scale |
mul_scale | (K,) | float32 | Per-input-channel smooth pre-scale (from calibration) |
BF16 fallback layers have no quantization tensors; only the original weight (and optional bias) are present, loaded directly from the base checkpoint.
Step 1 — Quantize with msModelSlim¶
msmodelslim quant \
--model_path /path/to/Wan2.2-T2V-A14B-Diffusers \
--save_path /path/to/wan2_2_t2v_quantized_raw \
--device npu \
--model_type Wan2_2 \
--config_path /path/to/wan2_2_w4a4_mxfp4_dualscale.yaml \
--trust_remote_code True
After this step, --save_path contains raw quantized safetensors files, scale files, and a metadata JSON (quant_model_description*.json).
For cascade MoE models (T2V-A14B, I2V-A14B), msModelSlim outputs two subdirectories: high_noise_model/ (transformer) and low_noise_model/ (transformer_2).
Step 2 — Preprocess with merge_mxfp4_dualscale_checkpoint.py¶
The script (vllm_omni/quantization/tools/merge_mxfp4_dualscale_checkpoint.py):
- Copies the original diffusers model to
--output-path(VAE, text encoder, scheduler, etc. are preserved). - Remaps tensor names from msModelSlim convention to diffusers convention and strips
.linear./.div.wrappers added by the quantization tool. - Overlays MXFP4 tensors (weight, fine/coarse scales,
mul_scale) onto the BF16 base checkpoint. Non-quantized layers keep their original BF16 weights. - Detects all linear layers that remain in BF16 and writes their prefixes into
ignored_layersinconfig.json. - Injects
quantization_configso vLLM-Omni auto-detects offline MXFP4 DualScale.
For cascade MoE models, steps 2–5 run separately for each transformer.
python vllm_omni/quantization/tools/merge_mxfp4_dualscale_checkpoint.py \
--model-type Wan2.2-T2V-A14B \
--original-model /path/to/Wan2.2-T2V-A14B-Diffusers \
--quant-path /path/to/wan2_2_t2v_quantized_raw \
--output-path /path/to/Wan2.2-T2V-A14B-MXFP4-DualScale
| Argument | Description |
|---|---|
--model-type | Model variant: Wan2.2-T2V-A14B or Wan2.2-I2V-A14B |
--original-model | Root directory of the original BF16 diffusers model |
--quant-path | Root directory of the msModelSlim quantized output |
--output-path | Output directory for the merged model (created by the script) |
The script outputs a complete diffusers model directory at --output-path, with each transformer subfolder containing:
diffusion_pytorch_model.safetensors— MXFP4 weights + scale tensors, with BF16 fallback layers from the base checkpointconfig.json— original transformer config withquantization_configinjectedquant_model_description.json— quantization metadata (reference only)
The quantization_config injected into config.json for each transformer:
{
"quant_method": "mxfp4_dualscale",
"is_checkpoint_serialized": true,
"ignored_layers": [
"blocks.0.attn1.to_qkv",
"blocks.0.attn1.to_out",
"proj_out"
]
}
ignored_layers lists every linear layer that retains its original BF16 weight, using vllm-omni model parameter names (QKV-fused, FFN underscored, to_out unindexed). The exact entries are determined by the quantization tool (msModelSlim) and may differ between transformer and transformer_2 in a cascade model.
Step 3 — Serve¶
python text_to_video.py --model /path/to/Wan2.2-T2V-A14B-MXFP4-DualScale
# Online serving
vllm serve /path/to/Wan2.2-T2V-A14B-MXFP4-DualScale --omni
Note
No --quantization flag is needed for offline mode. The preprocessing script injects quantization_config into each transformer/config.json, which vLLM-Omni reads automatically to activate the correct offline path.
Parameters¶
mxfp4 (single-scale, online only)¶
| Parameter | Type | Default | Description |
|---|---|---|---|
method | str | — | "mxfp4" |
ignored_layers | list[str] | [] | Layer prefixes to keep in BF16 |
mxfp4_dualscale (dual-scale, online + offline)¶
| Parameter | Type | Default | Description |
|---|---|---|---|
method | str | — | "mxfp4_dualscale" |
is_checkpoint_serialized | bool | False | True for offline DualScale checkpoints; auto-set from config.json when using the preprocessing script |
ignored_layers | list[str] | [] | Layer prefixes to keep in BF16. Works in both modes: offline — populated by the merge script for interleaved sensitive layers; online — user-supplied for explicit per-layer precision override |
num_bf16_fallback_layers | int | 5 | Online mode only: leading N transformer blocks (blocks.0 … blocks.N-1) kept in BF16. Applied after ignored_layers; ignored in offline mode. Default of 5 is the evaluated recommended value for Wan2.2-A14B |
BF16 fallback priority (online mode)¶
for each linear layer:
if prefix in ignored_layers → BF16 (explicit override, highest priority)
elif block_idx < num_bf16_fallback_layers → BF16 (coarse leading-block rule)
else → MXFP4 DualScale online
Layers outside blocks.N.* (e.g. condition_embedder.*) are always quantized unless they appear in ignored_layers.
Validation and Notes¶
-
Online single-scale (
mxfp4) quantizes BF16 weights at load time usingnpu_dynamic_mx_quant(single-scale). No calibrationmul_scaleis available — all output partitions receive an identity pre-scale. No offline checkpoint format exists for this method. -
Online dual-scale (
mxfp4_dualscale,is_checkpoint_serialized=False) quantizes BF16 weights usingnpu_dynamic_dual_level_mx_quant(fine + coarse scales computed on the fly). No calibrationmul_scale; leading blocks or explicitignored_layersstay in BF16 for accuracy. -
Offline dual-scale (
mxfp4_dualscale,is_checkpoint_serialized=True) — recommended for production — loads four tensors per quantized layer: FP4 weight, fine scale (uint8reinterpreted asfloat8_e8m0fnu), coarse scale (float32), and per-input-channelmul_scale(float32). BF16 fallback layers have no quantization tensors and are routed viaignored_layers. -
Scale dtype: fine scales are stored as
uint8in safetensors (same bit layout asfloat8_e8m0fnu) and reinterpreted at load time without a lossy float32 round-trip. -
Cascade model config propagation: in a cascade model (transformer + transformer_2), vLLM-Omni reads each transformer's own
config.jsonand rebuilds the quant config locally whenignored_layersdiffers between transformers, ensuring per-layer routing is accurate for each. The first transformer's config is propagated tood_configso the second transformer can reuse it as a starting point. -
Self-attention QKV fusion: Q, K, V projection weights are fused into a single
QKVParallelLinearlayer at runtime.ignored_layersentries use the fused name (attn1.to_qkv), written automatically by the merge script. -
W4A4 carries higher quantization noise than W8A8 (16 vs 256 levels). The DualScale offline method mitigates this with calibrated
mul_scalesmooth quantization. Useignored_layersandnum_bf16_fallback_layersto trade off compression vs. accuracy for precision-sensitive layers.
Adapting MXFP4 for a New Model¶
This section is aimed at developers who want to add MXFP4 support to a model other than Wan2.2. The three integration points are: (1) discovering the correct runtime layer names, (2) wiring ignored_layers into the model, and (3) writing a merge script for offline checkpoints.
Step 1 — Discover runtime layer names¶
ignored_layers entries must match the runtime parameter names used inside vllm-omni, which may differ from the names stored in the diffusers checkpoint. The canonical source of truth is the model's own named_parameters().
from vllm_omni import Omni
# Load the model without quantization to inspect parameter names.
omni = Omni(model="/path/to/your-model") # no --quantization flag
for name, _ in omni.pipeline.transformer.named_parameters():
if "weight" in name and "scale" not in name:
print(name)
Compare the printed names against the diffusers checkpoint keys (safetensors.safe_open or torch.load) to identify any renames your model applies. Common patterns that differ in Wan2.2 (and may appear in other models):
| Diffusers checkpoint name | vllm-omni runtime name | Reason |
|---|---|---|
attn1.to_q, attn1.to_k, attn1.to_v | attn1.to_qkv | Self-attention Q/K/V fused into QKVParallelLinear |
ffn.net.0.proj | ffn.net_0.proj | Dots in sub-module names replaced with underscores |
ffn.net.2 | ffn.net_2 | Same underscore rule |
to_out.0 | to_out | Sequential index stripped |
If your model has different fusion patterns, inspect packed_modules_mapping on the model class — this dict records how checkpoint keys are mapped to fused runtime parameters.
Partial QKV fallback is not allowed
If your model fuses Q, K, V into a single layer, ignored_layers must include all three or none. A partial fallback (e.g. to_q in BF16 but to_k, to_v quantized) cannot be expressed at runtime because they share one QKVParallelLinear. The merge script enforces this and raises an error if only some of the trio appear as non-quantized.
Step 2 — Add ignored_layers to the model¶
Online mode¶
Pass ignored_layers directly in the quantization config using the runtime names discovered in Step 1. No code changes to the model are required.
omni = Omni(
model="/path/to/your-model",
quantization={
"method": "mxfp4_dualscale",
"ignored_layers": [
"blocks.0.attn1.to_qkv", # runtime name, not diffusers name
"blocks.0.attn1.to_out",
"blocks.0.ffn.net_0.proj",
],
},
)
# CLI does not support list-typed ignored_layers directly.
# Use the Python API or set ignored_layers in config.json (offline).
python your_script.py --model /path/to/your-model --quantization mxfp4_dualscale
The num_bf16_fallback_layers coarse rule is an alternative to listing layers individually: set it to N to keep all linear layers in blocks 0 … N-1 in BF16. The right value depends on the model's sensitivity; evaluate on a validation set and pick the smallest N that meets your accuracy target.
Offline mode¶
For offline checkpoints, ignored_layers is written into each transformer's config.json by the merge script (see Step 3). No manual editing is needed if the merge script is correct. The injected block:
{
"quant_method": "mxfp4_dualscale",
"is_checkpoint_serialized": true,
"ignored_layers": [
"blocks.0.attn1.to_qkv",
"blocks.0.attn1.to_out"
]
}
To add a layer manually (e.g. to pin an additional layer to BF16 without re-running the merge script), edit config.json inside the transformer subfolder. Use runtime names, not diffusers checkpoint names.
Step 3 — Write a merge script for offline mode¶
The merge script for a new model mirrors vllm_omni/quantization/tools/merge_mxfp4_dualscale_checkpoint.py. The four things it must do:
-
Remap tensor names from the quantization tool convention to diffusers convention (strip wrappers like
.linear.,.div.; fix any prefix differences). -
Collect ignored_layers: after loading, enumerate all
*.weightkeys that have no corresponding*.weight_scale(i.e. layers the tool left in BF16). Convert diffusers names to vllm-omni runtime names (fuse QKV, rename FFN sub-modules, etc.). Write the result toconfig.json. -
Inject
quantization_configintoconfig.json: -
Save the merged safetensors and the updated
config.json.
The key helper to implement is the diffusers-to-runtime name translator (equivalent to _diffusers_to_vllm_ignored in the Wan2.2 merge script). For each non-quantized diffusers weight key, apply your model's specific renaming rules and collect the results.
Validate before serving
After producing the offline checkpoint, load it without a --quantization flag and verify that vLLM-Omni auto-detects the correct method. Check that the layer count reported in the startup log matches expectations: quantized layer count + ignored_layers count should equal total linear layer count. Any mismatch indicates a name-mapping bug in the merge script.