ModelOpt Quantization¶
Overview¶
ModelOpt quantization loads checkpoints produced by NVIDIA ModelOpt. The quantized weights and scale tensors are generated before serving, so inference does not run online calibration or convert a BF16 checkpoint at startup.
vLLM-Omni validates ModelOpt FP8, ModelOpt NVFP4, and ModelOpt mixed FP8/NVFP4 checkpoint loading for diffusion transformer stages. The loader auto-detects supported ModelOpt checkpoint configs and keeps non-transformer components, such as the tokenizer, scheduler, text encoder, vision/audio encoder, and VAE, on the base checkpoint unless a model-specific guide says otherwise.
Note
ModelOpt checkpoints are pre-quantized checkpoints. Do not pass --quantization fp8 for these checkpoints. The checkpoint quantization_config selects the ModelOpt path.
Note
--force-cutlass-fp8, --linear-backend cutlass, and --moe-backend cutlass are runtime backend selections for checkpoints that already carry supported ModelOpt quantized weights and scales. They do not quantize BF16 checkpoints at startup.
Supported ModelOpt Checkpoint Formats¶
vLLM-Omni treats ModelOpt checkpoints as pre-quantized checkpoints. The checkpoint config must identify ModelOpt as the quantization method or producer, and the quantization algorithm must be one of the validated algorithms below.
| Checkpoint field | Supported value |
|---|---|
method / quant_method | modelopt, modelopt_fp4, modelopt_mixed |
producer.name | modelopt |
quant_algo | FP8, FP8_PER_CHANNEL_PER_TOKEN, NVFP4, MIXED_PRECISION |
quant_algo | Runtime method | Typical use |
|---|---|---|
FP8, FP8_PER_CHANNEL_PER_TOKEN | modelopt | FP8 diffusion transformer checkpoints |
NVFP4 | modelopt_fp4 | NVFP4 diffusion transformer checkpoints |
MIXED_PRECISION | modelopt_mixed | Mixed FP8/NVFP4 checkpoints with a ModelOpt per-layer policy |
For multi-component diffusion or omni models, only the checkpoint component that contains ModelOpt quantized weights should use the ModelOpt quantization method. Encoders, decoders, tokenizers, schedulers, and other BF16 components stay unquantized unless the model-specific recipe validates otherwise.
Hardware Support¶
| Device | Support |
|---|---|
| NVIDIA Blackwell GPU (SM 100+) | ✅ |
| NVIDIA Ada/Hopper GPU (SM 89+) | ✅ |
| NVIDIA Ampere GPU (SM 80+) | ⭕ |
| AMD ROCm | ⭕ |
| Intel XPU | ⭕ |
| Ascend NPU | ❌ |
Legend: ✅ supported, ❌ unsupported, ⭕ not verified in this guide. The optional CUTLASS FP8 runtime override requires CUDA SM89+. ModelOpt NVFP4 and mixed FP8/NVFP4 diffusion checkpoints are currently validated on Blackwell CUDA systems in the recipes below; other CUDA generations require separate backend and quality validation.
Model Type Support¶
Diffusion Model¶
| Model | HF checkpoint | Scope | Status |
|---|---|---|---|
| Qwen-Image 2512 | feizhai123/qwen-image-2512-modelopt-fp8-dynamic-all | Diffusion transformer | Validated for ModelOpt FP8 checkpoints |
| Qwen-Image 2512 | feizhai123/qwen-image-2512-modelopt-mixed-fp8-sensitive-nvfp4-heavy | Diffusion transformer | Validated for ModelOpt mixed FP8/NVFP4 checkpoints |
| Z-Image | feizhai123/z-image-modelopt-fp8-conservative | Diffusion transformer | Validated for ModelOpt FP8 checkpoints |
| FLUX.2-dev | feizhai123/flux2-dev-modelopt-fp8 | Diffusion transformer | Validated for ModelOpt FP8 checkpoints |
| FLUX.2-klein 4B | feizhai123/flux2-klein-4b-modelopt-fp8 | Diffusion transformer | Validated for ModelOpt FP8 checkpoints |
| HunyuanImage-3.0 | feizhai123/hunyuan-image3-modelopt-fp8 | MoE diffusion transformer | Validated for ModelOpt FP8 checkpoints |
| HunyuanImage-3.0 | feizhai123/hunyuan-image3-modelopt-mixed-experts-nvfp4-dense-fp8 | MoE diffusion transformer | Validated for ModelOpt mixed FP8/NVFP4 checkpoints |
| Wan2.2 | Not available | Diffusion transformer | Not validated |
For full serving commands and benchmark context, see recipes/Qwen/Qwen-Image.md and recipes/Tencent/HunyuanImage-3.0-Instruct.md.
Multi-Stage Omni/TTS Model¶
| Model | Scope | Status |
|---|---|---|
| Qwen3-Omni | Thinker language-model stage | ModelOpt FP8 checkpoint path |
| Qwen3-Omni | Thinker language-model stage (W4A4 NVFP4) | Validated; see Qwen3-Omni NVFP4 W4A4 below |
| Qwen3-TTS | TTS language-model stage | Not validated |
Audio encoder, vision encoder, talker, and code2wav stages stay in BF16 unless a model-specific guide documents otherwise.
Qwen3-Omni NVFP4 W4A4 (thinker)¶
vLLM-Omni serves ModelOpt NVFP4 W4A4 quantizations of the Qwen3-Omni-30B-A3B-Instruct thinker language model. The thinker text body (attention + MoE experts) is quantized to NVFP4 with FP8 per-tensor input scales; the audio encoder, vision encoder, talker, and code2wav stay in BF16.
| Variant | HF checkpoint | Hardware |
|---|---|---|
| W4A4 NVFP4 (full thinker) | YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-full-thinker-awqclip | sm_100+ (Blackwell, FlashInfer FP4 GEMM) |
Calibration uses ModelOpt mtq.NVFP4_DEFAULT_CFG with the awq_clip algorithm on 1024 ultrachat samples chat-templated through the Qwen3-Omni tokenizer. Excluded modules: *audio_tower*, *visual*, *talker*, *code2wav*, *lm_head*, *mlp.gate*. See scripts/nvfp4/calibrate.py for the reference recipe.
ModelOpt 0.44 NaN regression workaround
ModelOpt 0.44's float32 -> FP8 E4M3 cast of per-block weight scales occasionally emits literal NaN bytes (E4M3 encoding 0x7F / 0xFF) for blocks whose pre-cast scale rounds above the FP8 max of 448 after the global-scale division. A single NaN byte in any weight_scale propagates through the FP4 GEMM and collapses the served model output to !!!!. vLLM-Omni's vllm_omni.patch installs a defensive override of ModelOptNvFp4LinearMethod.process_weights_after_loading that clamps these bytes to the FP8 E4M3 max at load time. The override self-extinguishes once vllm-omni's vllm pin moves to a release containing the upstream fix. Set VLLM_OMNI_SKIP_NVFP4_NAN_CLAMP=1 to disable the override for diagnostics.
Serving:
vllm serve YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-full-thinker-awqclip \
--omni --port 8000
Do not pass
--enforce-eagerfor production / benchmarks. CUDA graphs amortize launch overhead and unlock the FP4 throughput wins; with--enforce-eagerset, W4A4 TPOT degrades ~10x relative to the CUDA-graph configuration.
Multi-Stage Diffusion Model¶
ModelOpt checkpoints must be routed to the stage whose checkpoint contains the ModelOpt quantization_config. BAGEL and GLM-Image are not listed as validated ModelOpt targets yet.
Configuration¶
For pre-quantized ModelOpt checkpoints, no --quantization fp8 flag is needed. The checkpoint config selects the ModelOpt path.
Online serving:
vllm serve <modelopt-checkpoint> \
--omni \
--tensor-parallel-size <N> \
--linear-backend cutlass \
--force-cutlass-fp8
For mixed FP8/NVFP4 MoE checkpoints, also select the validated MoE backend:
vllm serve <modelopt-mixed-moe-checkpoint> \
--omni \
--tensor-parallel-size <N> \
--enable-expert-parallel \
--linear-backend cutlass \
--moe-backend cutlass \
--force-cutlass-fp8
Offline inference:
python examples/offline_inference/text_to_image/text_to_image.py \
--model <modelopt-checkpoint> \
--tensor-parallel-size <N> \
--prompt "a red ceramic teapot on a wooden table" \
--height 1024 \
--width 1024 \
--num-inference-steps 20 \
--seed 42 \
--output outputs/modelopt.png
Python API:
from vllm_omni import Omni
omni = Omni(
model="<modelopt-checkpoint>",
tensor_parallel_size=2,
force_cutlass_fp8=True,
)
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
force_cutlass_fp8 / --force-cutlass-fp8 | bool | False | Force CUTLASS FP8 linear kernels for supported ModelOpt FP8 diffusion stages on CUDA SM89+ |
--linear-backend cutlass | str | auto | Select the validated CUTLASS linear backend for supported ModelOpt NVFP4 or mixed FP8/NVFP4 diffusion stages |
--moe-backend cutlass | str | auto | Select the validated CUTLASS MoE backend for supported ModelOpt mixed MoE checkpoints |
Validation and Notes¶
- Compare the ModelOpt checkpoint against the BF16 baseline with the same prompt, resolution, seed, and inference steps.
- Use
tests/diffusion/quantization/test_quantization_quality.pywithVLLM_OMNI_QUALITY_CONFIGSto validate local baseline and quantized model paths. - For HunyuanImage-3.0 quantized DiT checkpoints, the opt-in accuracy check is:
CUDA_VISIBLE_DEVICES=2,3 \
HUNYUAN_IMAGE3_RUN_QUANT_ACCURACY=1 \
HUNYUAN_IMAGE3_QUANT_DEVICES=0,1 \
HUNYUAN_IMAGE3_QUANT_TP=2 \
HUNYUAN_IMAGE3_BF16_MODEL=/path/to/hunyuan-image3-bf16 \
HUNYUAN_IMAGE3_FP8_MODEL=/path/to/hunyuan-image3-modelopt-fp8 \
HUNYUAN_IMAGE3_NVFP4_MODEL=/path/to/hunyuan-image3-modelopt-mixed-experts-nvfp4-dense-fp8 \
PYTHONPATH=/path/to/vllm-omni:${PYTHONPATH:-} \
python -m pytest -s -v \
tests/e2e/accuracy/test_hunyuan_image3.py \
-k quantized_dit_matches_bf16_accuracy
- Report CLIP score deltas, SSIM, PSNR, throughput, latency, and peak memory when adding a new validated ModelOpt diffusion checkpoint.
- Keep
--quantization fp8for online FP8 from BF16 checkpoints; use this ModelOpt path only when the checkpoint already contains ModelOpt quantized weights and scales.