Quantization¶

vLLM-Omni exposes quantization through the unified quantization_config path. The same configuration entrypoint is used across diffusion-only models, multi-stage omni/TTS models, and multi-stage diffusion models, but each model type has a different quantization scope.

Quantization Modes¶

Mode	Guide	Description	Methods
Online quantization	Online Quantization	vLLM-Omni computes quantized weights and scales while loading the model.	FP8 W8A8, Int8 W8A8, MXFP8 W8A8, MXFP4 W4A4
Runtime attention quantization	Quantized KV Cache	vLLM-Omni dynamically quantizes eligible diffusion Flash Attention tensors during inference.	FP8 FA
Pre-quantized checkpoints	Method-specific guides	The checkpoint or an offline quantizer provides quantized weights and scales before serving.	ModelOpt, GGUF, AutoRound, msModelSlim, serialized Int8, offline MXFP8, offline MXFP4 DualScale

Hardware Support¶

Device	FP8 W8A8	Int8 W8A8	ModelOpt	MXFP8 W8A8	MXFP4 W4A4	GGUF	AutoRound	msModelSlim
NVIDIA Blackwell GPU (SM 100+)	✅	✅	✅	⭕	⭕	✅	✅	❌
NVIDIA Ada/Hopper GPU (SM 89+)	✅	✅	✅	⭕	⭕	✅	✅	❌
NVIDIA Ampere GPU (SM 80+)	✅	✅	⭕	⭕	⭕	✅	✅	❌
AMD ROCm	⭕	⭕	⭕	⭕	✅	⭕	⭕	❌
Intel XPU	⭕	⭕	⭕	⭕	⭕	⭕	✅	❌
Ascend NPU	❌	✅	❌	✅	✅	❌	❌	✅

Legend: ✅ supported, ❌ unsupported, ⭕ not verified in this guide. FP8 on Ampere may use a weight-only path where available.

Model Type Support¶

Diffusion Model (Qwen-Image, Wan2.2)¶

These models run a diffusion transformer as the primary inference module. The default quantization target is the transformer; tokenizer, scheduler, text encoder, and VAE stay on the base checkpoint unless a method guide says otherwise.

Method	Guide	Mode	Example models	Status
FP8 W8A8	FP8	Online W8A8 or checkpoint FP8	Qwen-Image; Wan2.2 is not validated	Validated for Qwen-Image family and other DiT models
Int8 W8A8	Int8	Online or serialized W8A8	Qwen-Image; Wan2.2 is not validated	Validated for Qwen-Image and Z-Image
ModelOpt	ModelOpt	Pre-quantized FP8 checkpoints	Qwen-Image, Z-Image, FLUX.2, HunyuanImage-3.0	Validated for ModelOpt FP8 diffusion checkpoints
MXFP8 W8A8	MXFP8	Online W8A8 or offline pre-quantized	Wan2.2-T2V-A14B, I2V-A14B, TI2V-5B	Ascend NPU only; validated for Wan2.2
MXFP4 W4A4	MXFP4	`mxfp4`: online single-scale only; `mxfp4_dualscale`: online or offline dual-scale (offline recommended)	Wan2.2-T2V-A14B, I2V-A14B	Ascend NPU only; validated for Wan2.2 A14B cascade models; TI2V-5B not supported; offline `mxfp4_dualscale` uses calibrated `mul_scale` for best accuracy
GGUF	GGUF	Pre-quantized transformer weights	Qwen-Image	Validated where a model-specific GGUF adapter exists
AutoRound	AutoRound	Pre-quantized W4A16 checkpoints	FLUX.1-dev; Qwen-Image/Wan2.2 not validated	Checkpoint-driven
msModelSlim	msModelSlim	Pre-quantized Ascend checkpoints	Wan2.2 recipe; HunyuanImage-3.0 inference target	Ascend/NPU path

Multi-Stage Omni/TTS Model (Qwen3-Omni, Qwen3-TTS)¶

These models combine an AR language model with audio, vision, talker, or TTS stages. Quantization is scoped to the AR language-model stage when the checkpoint contains a supported quantization_config; the non-AR stages stay in BF16 unless the model guide explicitly adds support.

Method	Guide	Scope	Example models	Status
ModelOpt	ModelOpt	Thinker or language-model checkpoint config	Qwen3-Omni thinker	ModelOpt checkpoint path
Int8	Int8	Not currently validated for omni/TTS stages	Qwen3-Omni, Qwen3-TTS	Not validated
MXFP8	MXFP8	Not currently validated for omni/TTS stages	Qwen3-Omni, Qwen3-TTS	Not validated
MXFP4	MXFP4	Not currently validated for omni/TTS stages	Qwen3-Omni, Qwen3-TTS	Not validated
GGUF	GGUF	Not currently validated for omni/TTS stages	Qwen3-Omni, Qwen3-TTS	Not validated
AutoRound	AutoRound	Thinker or language-model checkpoint config	Qwen2.5-Omni, Qwen3-Omni	Supported through AutoRound checkpoints
msModelSlim	msModelSlim	Not currently validated for omni/TTS stages	Qwen3-Omni, Qwen3-TTS	Not validated

Multi-Stage Diffusion Model (BAGEL, GLM-Image)¶

These models split generation across multiple stages. Quantization must be attached to the intended stage rather than applied globally.

Method	Guide	Scope	Example models	Status
FP8	FP8	Stage-specific DiT or transformer module	BAGEL, GLM-Image	Requires model-specific validation
Int8	Int8	Stage-specific DiT or transformer module	BAGEL, GLM-Image	Requires model-specific validation
ModelOpt	ModelOpt	Checkpoint-defined diffusion stage	BAGEL, GLM-Image	Requires model-specific validation
MXFP8	MXFP8	Stage-specific DiT or transformer module	BAGEL, GLM-Image	Not validated
MXFP4	MXFP4	Stage-specific DiT or transformer module	BAGEL, GLM-Image	Not validated
GGUF	GGUF	Stage-specific transformer weights	BAGEL, GLM-Image	No validated adapter listed
AutoRound	AutoRound	Checkpoint-defined stage	BAGEL, GLM-Image	No validated checkpoint listed
msModelSlim	msModelSlim	Ascend-generated stage weights	GLM-Image	Requires model-specific adaptation

Note

"Online quantization" means vLLM-Omni computes the quantization data while loading the model. "Pre-quantized" means the checkpoint or external quantizer provides the required quantized weights and scales.

Quantization Scope¶

Diffusion Model (Qwen-Image, Wan2.2)¶

The default target is the diffusion transformer. Component routing is available through build_quant_config():

from vllm_omni.quantization import build_quant_config

config = build_quant_config({
    "transformer": {"method": "fp8"},
    "vae": None,
})

Component	Default quantized?	Notes
Diffusion transformer	Yes	Primary target for FP8, Int8, ModelOpt, MXFP8, MXFP4, GGUF, AutoRound, and msModelSlim
Text encoder	No	Keep BF16 unless a method-specific guide documents support
VAE	No	Keep BF16; storage-only paths are method-specific
Scheduler/tokenizer	No	Loaded from the base model repository

Multi-Stage Omni/TTS Model (Qwen3-Omni, Qwen3-TTS)¶

Component	Default quantized?	Notes
Thinker or AR language model	Yes, when checkpoint config is supported	ModelOpt FP8/NVFP4 or AutoRound checkpoint config
Audio encoder	No	BF16
Vision encoder	No	BF16
Talker or TTS stage	No	BF16 unless model-specific support is documented
Code2Wav	No	BF16

Multi-Stage Diffusion Model (BAGEL, GLM-Image)¶

Component	Default quantized?	Notes
Selected diffusion or transformer stage	Method-specific	Must be routed to the intended stage
Other generation stages	No	Keep BF16 unless separately validated
VAE, tokenizer, scheduler	No	Loaded from the base checkpoint

Python API¶

build_quant_config() accepts strings, dictionaries, per-component dictionaries, existing QuantizationConfig objects, or None.

from vllm_omni.quantization import build_quant_config

build_quant_config("fp8")
build_quant_config({"method": "fp8", "activation_scheme": "static"})
build_quant_config("auto-round", bits=4, group_size=128)
build_quant_config({"method": "gguf", "gguf_model": "/path/to/model.gguf"})
build_quant_config({"transformer": {"method": "fp8"}, "vae": None})
build_quant_config(None)

Output Similarity Comparison Tool¶

Use vllm_omni.quantization.tools.compare_diffusion_trajectory_similarity to compare a reference diffusion run with a quantized candidate run using the same prompt, seed, resolution, scheduler settings, and inference steps. The tool compares final decoded images or video frames, and also reports generation latency and worker-reported peak memory when available.

This is useful when validating whether online quantization, an offline pre-quantized checkpoint, or a new ignored_layers choice keeps generation quality close to the BF16 reference.

Online Quantization Example¶

python -m vllm_omni.quantization.tools.compare_diffusion_trajectory_similarity \
  --task t2i \
  --model Qwen/Qwen-Image \
  --candidate-quantization fp8 \
  --ignored-layers img_mlp \
  --prompt "a cup of coffee on the table" \
  --height 512 --width 512 \
  --num-inference-steps 20 \
  --seed 142 \
  --output-json /tmp/qwen_image_fp8_similarity/result.json \
  --save-output-dir /tmp/qwen_image_fp8_similarity/images \
  --enforce-eager

Offline Checkpoint Example¶

Use --candidate-model when the candidate is already quantized or lives at a different model path:

python -m vllm_omni.quantization.tools.compare_diffusion_trajectory_similarity \
  --task t2i \
  --reference-model Qwen/Qwen-Image \
  --candidate-model /path/to/qwen-image-fp8-checkpoint \
  --prompt "a cup of coffee on the table" \
  --height 512 --width 512 \
  --num-inference-steps 20 \
  --seed 142 \
  --output-json /tmp/qwen_image_fp8_checkpoint_similarity/result.json

If the checkpoint does not include a loadable quantization config, pass one explicitly:

--candidate-quantization-config-json '{"method":"fp8"}'

Output Metrics¶

The output JSON includes output_metrics, reference_generation, and candidate_generation.

Metric	Direction	Meaning
`cosine_similarity`	Higher is better	Vector direction similarity between output pixels or frames. Useful as a broad sanity check.
`mae`	Lower is better	Mean absolute pixel or frame error. For decoded outputs, values are in uint8 pixel units.
`mse` / `rmse`	Lower is better	Squared error and its square root. These penalize localized large differences more than `mae`.
`max_abs`	Lower is better	Worst single-element absolute error. Treat it as an outlier/debug signal, not as a release gate.
`l2` / `relative_l2`	Lower is better	Absolute and reference-normalized L2 distance. `relative_l2` is easier to compare across resolutions.
`psnr_db`	Higher is better	Pixel-space signal-to-noise ratio in dB for uint8 images or frames.
`avg_generation_time_s`	Lower is better	Average wall-clock generation time across measured runs.
`max_peak_memory_mb`	Lower is better	Maximum worker-reported peak device memory across measured runs, when the worker reports it.

Recommended starting thresholds for same-seed diffusion comparisons:

Metric	Smoke threshold	Stricter target	Notes
`psnr_db`	`>= 20.0`	`>= 25.0`	Good for quick image or frame regression checks.
`mae`	`<= 12.0`	`<= 6.0`	Interpreted in decoded uint8 pixel units.
`cosine_similarity`	`>= 0.98`	`>= 0.995`	Less sensitive to global scale than L2-style metrics.
`relative_l2`	`<= 0.20`	`<= 0.08`	Useful when comparing across prompts or resolutions.

These thresholds are heuristics. Tune them by model family, task, resolution, quantization method, and deployment tolerance. For release gating, pair the numeric report with visual inspection of saved reference and candidate outputs.

The tool intentionally reports separate quality, latency, and memory metrics instead of a single consolidated similarity score. A single score can hide important tradeoffs, for example a candidate with good PSNR but a meaningful memory regression, or a candidate with low average error but localized visual artifacts. If you need a project-specific pass/fail gate, define it as an explicit policy over the individual metrics.

Pixel-level metrics do not measure semantic consistency. For higher-cost evaluation, you can complement this report with a vision-language judge that describes the reference and candidate outputs and compares those descriptions. Keep that semantic check separate from this lightweight tool so users can choose whether the additional model cost and latency are appropriate.