FP8 Quantization¶

Overview¶

FP8 quantization converts BF16/FP16 weights to FP8 at model load time. Online activation scaling is the default and does not require calibration. Static activation scaling is supported when calibrated scale information is available. For ModelOpt-produced pre-quantized checkpoints, see ModelOpt Quantization.

Some architectures can quantize all linear layers. Others have quality-sensitive layers that should stay in BF16 through ignored_layers. Image-stream MLPs (img_mlp) are a common sensitive target because denoising latent ranges shift across timesteps and small per-layer errors can compound in deep DiT blocks.

Hardware Support¶

Device	Support
NVIDIA Blackwell GPU (SM 100+)	✅
NVIDIA Ada/Hopper GPU (SM 89+)	✅
NVIDIA Ampere GPU (SM 80+)	✅
AMD ROCm	⭕
Intel XPU	⭕
Ascend NPU	❌

Legend: ✅ supported, ❌ unsupported, ⭕ not verified in this guide. FP8 on Ampere may use a weight-only path where available.

Faster FP8 GEMM on Blackwell (quack)¶

On Blackwell (SM 100+), vLLM runs FP8 linears through the FlashInfer kernel, which applies the bias as a separate kernel after the GEMM. On the small GEMMs in video DiTs this bias add is a significant overhead. Installing the optional quack kernel lets vLLM-Omni fuse alpha * (A @ B) + bias into a single CuteDSL GEMM, recovering that overhead (e.g. HunyuanVideo-1.5 FP8 goes from slower-than-BF16 to faster).

# CUDA 12.9
pip install vllm-omni[quack]

# CUDA 13.x
pip install 'quack-kernels[cu13]' --extra-index-url https://download.pytorch.org/whl/cu130

It is enabled automatically once installed (no flag needed) and is datacenter Blackwell only (sm_100 / sm_101 / sm_103, compute capability 10.x, e.g. B200): quack's CuteDSL GEMM uses the 5th-gen tcgen05 tensor-core MMA, which exists only on those parts. On Hopper/Ada the CUTLASS FP8 kernel already fuses bias, and on workstation/consumer Blackwell (sm_120 / sm_121, compute capability 12.x, e.g. RTX PRO 6000 / RTX 50-series) tcgen05 is absent — so quack is not auto-enabled there and FlashInfer's native FP8 path is used instead. Set VLLM_OMNI_USE_QUACK_FP8=1 to force quack on, or VLLM_OMNI_USE_QUACK_FP8=0 to force the FlashInfer path. If quack-kernels is not installed, FP8 still works — it just keeps the unfused FlashInfer path.

Compile cache and warmup¶

quack JIT-compiles its kernel once per distinct GEMM shape (tens of seconds, longer the first time across all autotuned configs). The compiled .o files are cached on disk and reused on later runs, so this is a one-time cost — not per request.

vLLM-Omni points that cache at ~/.cache/vllm_omni/quack (override with QUACK_CACHE_DIR) instead of quack's default under /tmp, so it survives restarts. In containers, set QUACK_CACHE_DIR to a mounted/persistent path — or bake it into the image — so the first cold start does not recompile. The engine's startup dummy run already exercises the kernels, so with a warm cache the first real request is fast.

To pre-warm specific shapes (e.g. at image build time):

from vllm_omni.quantization.quack_fp8 import warmup_quack_fp8
# (M, K, N) per linear; M = number of tokens for your resolution/frame count
warmup_quack_fp8([(14040, 2048, 6144), (14040, 2048, 2048)])

The PyPI package is quack-kernels (imported as quack); plain pip install quack is an unrelated statistics library. Requires CUDA 12.9+ and Python 3.12.

Model Type Support¶

Diffusion Model (Qwen-Image, Wan2.2)¶

Model	HF models	Online	Pre-calibrated	Recommendation	`ignored_layers`	Text-Encoder quantization
Qwen-Image	`Qwen/Qwen-Image`, `Qwen/Qwen-Image-2512`	Yes	Yes	Skip sensitive image-stream MLPs when quality regresses	`img_mlp`
Wan2.2	Wan2.2 diffusion pipelines	Not validated	Not validated	Validate against BF16 before documenting as supported	TBD
Z-Image	`Tongyi-MAI/Z-Image-Turbo`	Yes	Yes	All layers	None	✅︎
FLUX.1	`black-forest-labs/FLUX.1-dev`, `black-forest-labs/FLUX.1-schnell`	Yes	Yes	All layers	None
FLUX.2-klein	`black-forest-labs/FLUX.2-klein-4B`	Yes	Yes	All layers	None
HunyuanImage-3.0	`tencent/HunyuanImage-3.0`, `tencent/HunyuanImage-3.0-Instruct`	Yes	Yes	All layers; use the Hunyuan stage config for multi-stage runs	None
HunyuanVideo-1.5	`hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-480p_t2v`, `720p_t2v`, `480p_i2v`	Yes	Yes	All layers	None
Cosmos3	`nvidia/Cosmos3-Nano`, `nvidia/Cosmos3-Super`	Yes	Not validated	All layers	None

Multi-Stage Omni/TTS Model (Qwen3-Omni, Qwen3-TTS)¶

Model	Scope	Format	Status
Qwen3-Omni	Thinker language-model stage	ModelOpt `quant_algo=FP8`	Tested for thinker memory reduction
Qwen3-TTS	TTS language-model stage	Checkpoint config	Not validated

Audio encoder, vision encoder, talker, and code2wav stay in BF16 unless a model-specific guide says otherwise.

Multi-Stage Diffusion Model (BAGEL, GLM-Image)¶

Model	Scope	Status	Notes
BAGEL	Stage-specific transformer or DiT module	Not validated	Route FP8 to the intended stage before enabling
GLM-Image	Stage-specific transformer or DiT module	Not validated	Validate quality against BF16 baseline

Configuration¶

Python API:

from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

omni = Omni(model="<your-model>", quantization="fp8")

omni_with_skips = Omni(
    model="<your-model>",
    quantization_config={
        "method": "fp8",
        "ignored_layers": ["img_mlp"],
    },
)

outputs = omni.generate(
    "A cat sitting on a windowsill",
    OmniDiffusionSamplingParams(num_inference_steps=50),
)

CLI:

python text_to_image.py --model <your-model> --quantization fp8
python text_to_image.py --model <your-model> --quantization fp8 --ignored-layers "img_mlp"
vllm serve <your-model> --omni --quantization fp8

Parameters¶

Parameter	Type	Default	Description
`method`	str	-	Quantization method (`"fp8"`)
`ignored_layers`	list[str]	`[]`	Layer name patterns to keep in BF16
`activation_scheme`	str	`"dynamic"`	`"dynamic"` selects online activation scaling, or `"static"` when scales are available
`weight_block_size`	list[int] \| None	`None`	Block size for block-wise weight quantization

The available ignored_layers names depend on the model architecture, for example to_qkv, to_out, img_mlp, or txt_mlp.

Validation and Notes¶

FP8 quantization can be combined with cache acceleration:

omni = Omni(
    model="<your-model>",
    quantization="fp8",
    cache_backend="tea_cache",
    cache_config={"rel_l1_thresh": 0.2},
)

Compare generated outputs with a BF16 baseline before adding a new model to the supported table. GLM-Image and Helios are not listed as FP8-supported diffusion models until they have method-specific validation.