Skip to content

Quantization

vLLM-Omni exposes quantization through the unified quantization_config path. The same configuration entrypoint is used across diffusion-only models, multi-stage omni/TTS models, and multi-stage diffusion models, but each model type has a different quantization scope.

Quantization Modes

Mode Guide Description Methods
Online quantization Online Quantization vLLM-Omni computes quantized weights and scales while loading the model. FP8 W8A8, Int8 W8A8, MXFP8 W8A8, MXFP4 W4A4
Runtime attention quantization Quantized KV Cache vLLM-Omni dynamically quantizes eligible diffusion Flash Attention tensors during inference. FP8 FA
Pre-quantized checkpoints Method-specific guides The checkpoint or an offline quantizer provides quantized weights and scales before serving. ModelOpt, GGUF, AutoRound, msModelSlim, serialized Int8, offline MXFP8, offline MXFP4 DualScale

Hardware Support

Device FP8 W8A8 Int8 W8A8 ModelOpt MXFP8 W8A8 MXFP4 W4A4 GGUF AutoRound msModelSlim
NVIDIA Blackwell GPU (SM 100+)
NVIDIA Ada/Hopper GPU (SM 89+)
NVIDIA Ampere GPU (SM 80+)
AMD ROCm
Intel XPU
Ascend NPU

Legend: supported, unsupported, not verified in this guide. FP8 on Ampere may use a weight-only path where available.

Model Type Support

Diffusion Model (Qwen-Image, Wan2.2)

These models run a diffusion transformer as the primary inference module. The default quantization target is the transformer; tokenizer, scheduler, text encoder, and VAE stay on the base checkpoint unless a method guide says otherwise.

Method Guide Mode Example models Status
FP8 W8A8 FP8 Online W8A8 or checkpoint FP8 Qwen-Image; Wan2.2 is not validated Validated for Qwen-Image family and other DiT models
Int8 W8A8 Int8 Online or serialized W8A8 Qwen-Image; Wan2.2 is not validated Validated for Qwen-Image and Z-Image
ModelOpt ModelOpt Pre-quantized FP8 checkpoints Qwen-Image, Z-Image, FLUX.2, HunyuanImage-3.0 Validated for ModelOpt FP8 diffusion checkpoints
MXFP8 W8A8 MXFP8 Online W8A8 or offline pre-quantized Wan2.2-T2V-A14B, I2V-A14B, TI2V-5B Ascend NPU only; validated for Wan2.2
MXFP4 W4A4 MXFP4 mxfp4: online single-scale only; mxfp4_dualscale: online or offline dual-scale (offline recommended) Wan2.2-T2V-A14B, I2V-A14B Ascend NPU only; validated for Wan2.2 A14B cascade models; TI2V-5B not supported; offline mxfp4_dualscale uses calibrated mul_scale for best accuracy
GGUF GGUF Pre-quantized transformer weights Qwen-Image Validated where a model-specific GGUF adapter exists
AutoRound AutoRound Pre-quantized W4A16 checkpoints FLUX.1-dev; Qwen-Image/Wan2.2 not validated Checkpoint-driven
msModelSlim msModelSlim Pre-quantized Ascend checkpoints Wan2.2 recipe; HunyuanImage-3.0 inference target Ascend/NPU path

Multi-Stage Omni/TTS Model (Qwen3-Omni, Qwen3-TTS)

These models combine an AR language model with audio, vision, talker, or TTS stages. Quantization is scoped to the AR language-model stage when the checkpoint contains a supported quantization_config; the non-AR stages stay in BF16 unless the model guide explicitly adds support.

Method Guide Scope Example models Status
ModelOpt ModelOpt Thinker or language-model checkpoint config Qwen3-Omni thinker ModelOpt checkpoint path
Int8 Int8 Not currently validated for omni/TTS stages Qwen3-Omni, Qwen3-TTS Not validated
MXFP8 MXFP8 Not currently validated for omni/TTS stages Qwen3-Omni, Qwen3-TTS Not validated
MXFP4 MXFP4 Not currently validated for omni/TTS stages Qwen3-Omni, Qwen3-TTS Not validated
GGUF GGUF Not currently validated for omni/TTS stages Qwen3-Omni, Qwen3-TTS Not validated
AutoRound AutoRound Thinker or language-model checkpoint config Qwen2.5-Omni, Qwen3-Omni Supported through AutoRound checkpoints
msModelSlim msModelSlim Not currently validated for omni/TTS stages Qwen3-Omni, Qwen3-TTS Not validated

Multi-Stage Diffusion Model (BAGEL, GLM-Image)

These models split generation across multiple stages. Quantization must be attached to the intended stage rather than applied globally.

Method Guide Scope Example models Status
FP8 FP8 Stage-specific DiT or transformer module BAGEL, GLM-Image Requires model-specific validation
Int8 Int8 Stage-specific DiT or transformer module BAGEL, GLM-Image Requires model-specific validation
ModelOpt ModelOpt Checkpoint-defined diffusion stage BAGEL, GLM-Image Requires model-specific validation
MXFP8 MXFP8 Stage-specific DiT or transformer module BAGEL, GLM-Image Not validated
MXFP4 MXFP4 Stage-specific DiT or transformer module BAGEL, GLM-Image Not validated
GGUF GGUF Stage-specific transformer weights BAGEL, GLM-Image No validated adapter listed
AutoRound AutoRound Checkpoint-defined stage BAGEL, GLM-Image No validated checkpoint listed
msModelSlim msModelSlim Ascend-generated stage weights GLM-Image Requires model-specific adaptation

Note

"Online quantization" means vLLM-Omni computes the quantization data while loading the model. "Pre-quantized" means the checkpoint or external quantizer provides the required quantized weights and scales.

Quantization Scope

Diffusion Model (Qwen-Image, Wan2.2)

The default target is the diffusion transformer. Component routing is available through build_quant_config():

from vllm_omni.quantization import build_quant_config

config = build_quant_config({
    "transformer": {"method": "fp8"},
    "vae": None,
})
Component Default quantized? Notes
Diffusion transformer Yes Primary target for FP8, Int8, ModelOpt, MXFP8, MXFP4, GGUF, AutoRound, and msModelSlim
Text encoder No Keep BF16 unless a method-specific guide documents support
VAE No Keep BF16; storage-only paths are method-specific
Scheduler/tokenizer No Loaded from the base model repository

Multi-Stage Omni/TTS Model (Qwen3-Omni, Qwen3-TTS)

Component Default quantized? Notes
Thinker or AR language model Yes, when checkpoint config is supported ModelOpt FP8/NVFP4 or AutoRound checkpoint config
Audio encoder No BF16
Vision encoder No BF16
Talker or TTS stage No BF16 unless model-specific support is documented
Code2Wav No BF16

Multi-Stage Diffusion Model (BAGEL, GLM-Image)

Component Default quantized? Notes
Selected diffusion or transformer stage Method-specific Must be routed to the intended stage
Other generation stages No Keep BF16 unless separately validated
VAE, tokenizer, scheduler No Loaded from the base checkpoint

Python API

build_quant_config() accepts strings, dictionaries, per-component dictionaries, existing QuantizationConfig objects, or None.

from vllm_omni.quantization import build_quant_config

build_quant_config("fp8")
build_quant_config({"method": "fp8", "activation_scheme": "static"})
build_quant_config("auto-round", bits=4, group_size=128)
build_quant_config({"method": "gguf", "gguf_model": "/path/to/model.gguf"})
build_quant_config({"transformer": {"method": "fp8"}, "vae": None})
build_quant_config(None)

Output Similarity Comparison Tool

Use vllm_omni.quantization.tools.compare_diffusion_trajectory_similarity to compare a reference diffusion run with a quantized candidate run using the same prompt, seed, resolution, scheduler settings, and inference steps. The tool compares final decoded images or video frames, and also reports generation latency and worker-reported peak memory when available.

This is useful when validating whether online quantization, an offline pre-quantized checkpoint, or a new ignored_layers choice keeps generation quality close to the BF16 reference.

Online Quantization Example

python -m vllm_omni.quantization.tools.compare_diffusion_trajectory_similarity \
  --task t2i \
  --model Qwen/Qwen-Image \
  --candidate-quantization fp8 \
  --ignored-layers img_mlp \
  --prompt "a cup of coffee on the table" \
  --height 512 --width 512 \
  --num-inference-steps 20 \
  --seed 142 \
  --output-json /tmp/qwen_image_fp8_similarity/result.json \
  --save-output-dir /tmp/qwen_image_fp8_similarity/images \
  --enforce-eager

Offline Checkpoint Example

Use --candidate-model when the candidate is already quantized or lives at a different model path:

python -m vllm_omni.quantization.tools.compare_diffusion_trajectory_similarity \
  --task t2i \
  --reference-model Qwen/Qwen-Image \
  --candidate-model /path/to/qwen-image-fp8-checkpoint \
  --prompt "a cup of coffee on the table" \
  --height 512 --width 512 \
  --num-inference-steps 20 \
  --seed 142 \
  --output-json /tmp/qwen_image_fp8_checkpoint_similarity/result.json

If the checkpoint does not include a loadable quantization config, pass one explicitly:

--candidate-quantization-config-json '{"method":"fp8"}'

Output Metrics

The output JSON includes output_metrics, reference_generation, and candidate_generation.

Metric Direction Meaning
cosine_similarity Higher is better Vector direction similarity between output pixels or frames. Useful as a broad sanity check.
mae Lower is better Mean absolute pixel or frame error. For decoded outputs, values are in uint8 pixel units.
mse / rmse Lower is better Squared error and its square root. These penalize localized large differences more than mae.
max_abs Lower is better Worst single-element absolute error. Treat it as an outlier/debug signal, not as a release gate.
l2 / relative_l2 Lower is better Absolute and reference-normalized L2 distance. relative_l2 is easier to compare across resolutions.
psnr_db Higher is better Pixel-space signal-to-noise ratio in dB for uint8 images or frames.
avg_generation_time_s Lower is better Average wall-clock generation time across measured runs.
max_peak_memory_mb Lower is better Maximum worker-reported peak device memory across measured runs, when the worker reports it.

Recommended starting thresholds for same-seed diffusion comparisons:

Metric Smoke threshold Stricter target Notes
psnr_db >= 20.0 >= 25.0 Good for quick image or frame regression checks.
mae <= 12.0 <= 6.0 Interpreted in decoded uint8 pixel units.
cosine_similarity >= 0.98 >= 0.995 Less sensitive to global scale than L2-style metrics.
relative_l2 <= 0.20 <= 0.08 Useful when comparing across prompts or resolutions.

These thresholds are heuristics. Tune them by model family, task, resolution, quantization method, and deployment tolerance. For release gating, pair the numeric report with visual inspection of saved reference and candidate outputs.

The tool intentionally reports separate quality, latency, and memory metrics instead of a single consolidated similarity score. A single score can hide important tradeoffs, for example a candidate with good PSNR but a meaningful memory regression, or a candidate with low average error but localized visual artifacts. If you need a project-specific pass/fail gate, define it as an explicit policy over the individual metrics.

Pixel-level metrics do not measure semantic consistency. For higher-cost evaluation, you can complement this report with a vision-language judge that describes the reference and candidate outputs and compares those descriptions. Keep that semantic check separate from this lightweight tool so users can choose whether the additional model cost and latency are appropriate.