GGUF Quantization¶

Overview¶

GGUF support loads pre-quantized diffusion transformer weights while keeping the rest of the pipeline on the base Hugging Face checkpoint. Use the base model for tokenizer, text encoder, scheduler, and VAE, then pass the GGUF file for the transformer.

GGUF is static quantization: the quantized weights are produced before serving.

Hardware Support¶

Device	Support
NVIDIA Blackwell GPU (SM 100+)	✅
NVIDIA Ada/Hopper GPU (SM 89+)	✅
NVIDIA Ampere GPU (SM 80+)	✅
AMD ROCm	⭕
Intel XPU	⭕
Ascend NPU	❌

Legend: ✅ supported, ❌ unsupported, ⭕ not verified in this guide.

Model Type Support¶

Diffusion Model (Qwen-Image, Wan2.2)¶

Model	HF base model	GGUF input	Scope	Adapter
Qwen-Image family	`Qwen/Qwen-Image`, `Qwen/Qwen-Image-2512`, edit and layered Qwen-Image pipelines	Local `.gguf`, `repo/file.gguf`, or `repo:quant_type`	Transformer only	`QwenImageGGUFAdapter`
Wan2.2	Wan2.2 diffusion pipelines	Not validated	Transformer only	No validated adapter listed
Z-Image	`Tongyi-MAI/Z-Image-Turbo`	Local `.gguf`, `repo/file.gguf`, or `repo:quant_type`	Transformer only	`ZImageGGUFAdapter`
FLUX.2-klein	`black-forest-labs/FLUX.2-klein-4B`	Local `.gguf`, `repo/file.gguf`, or `repo:quant_type`	Transformer only	`Flux2KleinGGUFAdapter`

Generic FLUX.1 GGUF checkpoints are not listed here; the implemented adapter is for the FLUX.2-klein path.

Multi-Stage Omni/TTS Model (Qwen3-Omni, Qwen3-TTS)¶

Model	Scope	Status	Notes
Qwen3-Omni	Thinker language-model stage	Not validated	GGUF is not documented for omni/TTS AR stages
Qwen3-TTS	TTS language-model stage	Not validated	GGUF is not documented for TTS stages

Multi-Stage Diffusion Model (BAGEL, GLM-Image)¶

Model	Scope	Status	Notes
BAGEL	Stage-specific transformer weights	Not validated	Requires a model-specific GGUF adapter
GLM-Image	Stage-specific transformer weights	Not validated	Requires a model-specific GGUF adapter

Configuration¶

Offline:

python examples/offline_inference/text_to_image/text_to_image.py \
  --model Qwen/Qwen-Image \
  --gguf-model QuantStack/Qwen-Image-GGUF/Qwen_Image-Q4_K_M.gguf \
  --quantization gguf \
  --prompt "a red paper kite hanging from a pine tree in a winter courtyard" \
  --height 1024 \
  --width 1024 \
  --seed 42 \
  --num_inference_steps 20 \
  --output outputs/qwen_image_q4km.png

Online:

vllm serve Qwen/Qwen-Image \
  --omni \
  --port 8000 \
  --diffusion-quantization-config '{"method":"gguf","gguf_model":"QuantStack/Qwen-Image-GGUF/Qwen_Image-Q4_K_M.gguf"}'

Parameters¶

Parameter	Type	Default	Description
`method`	str	-	Quantization method (`"gguf"`)
`gguf_model`	str	-	Local GGUF file, explicit Hugging Face file, or `repo:quant_type` selector

gguf_model accepts:

Form	Example
Local file	`/models/z-image-Q4_K_M.gguf`
Explicit HF file	`QuantStack/Qwen-Image-GGUF/Qwen_Image-Q4_K_M.gguf`
HF repo plus quant type	`owner/repo:Q4_K_M`

Validation and Notes¶

OmniDiffusionConfig receives {"method": "gguf", "gguf_model": ...}.
DiffusersPipelineLoader resolves the GGUF file.
A model-specific adapter remaps GGUF tensor names to vLLM-Omni transformer names.
Only transformer weights are loaded from GGUF. Missing non-transformer weights are loaded from the base model repository.
vLLM's GGUF linear method performs dequantization and GEMM at runtime.

Unsupported models fail fast with a clear "No GGUF adapter matched" error instead of falling back to a generic mapper. Many GGUF repositories do not include model_index.json; always pass the normal base model through --model.