AutoRound Quantization¶

Overview¶

AutoRound produces pre-quantized checkpoints for LLMs, VLMs, and diffusion models. vLLM-Omni reads the checkpoint's config.json and auto-detects quantization_config.quant_method = "auto-round".

AutoRound is static quantization: no --quantization flag is needed at inference time when the checkpoint already contains the quantization config.

Hardware Support¶

Device	Support
NVIDIA Blackwell GPU (SM 100+)	✅
NVIDIA Ada/Hopper GPU (SM 89+)	✅
NVIDIA Ampere GPU (SM 80+)	✅
AMD ROCm	⭕
Intel XPU	✅
Ascend NPU	❌

Legend: ✅ supported, ❌ unsupported, ⭕ not verified in this guide. AutoRound is Intel-supported.

Model Type Support¶

Diffusion Model (Qwen-Image, Wan2.2)¶

Model	Checkpoint	Scope	Scheme	Backend
FLUX.1-dev	`vllm-project-org/FLUX.1-dev-AutoRound-w4a16`	Diffusion transformer	W4A16	GPTQ-Marlin or Intel-supported AutoRound backend
Qwen-Image	Not listed	Diffusion transformer	W4A16	Not validated
Wan2.2-I2V	`Intel/Wan2.2-I2V-A14B-Diffusers-int4-AutoRound`	Diffusion transformer	W4A16	GPTQ-Marlin or Intel-supported AutoRound backend
Wan2.2-T2V	`Intel/Wan2.2-T2V-A14B-Diffusers-int4-AutoRound`	Diffusion transformer	W4A16	GPTQ-Marlin or Intel-supported AutoRound backend
Wan2.2-TI2V	`Intel/Wan2.2-TI2V-5B-Diffusers-int4-AutoRound`	Diffusion transformer	W4A16	GPTQ-Marlin or Intel-supported AutoRound backend

Multi-Stage Omni/TTS Model (Qwen3-Omni, Qwen3-TTS)¶

Model	Checkpoint	Scope	Scheme	Backend
Qwen2.5-Omni-7B	`Intel/Qwen2.5-Omni-7B-int4-AutoRound`	Language-model stage	W4A16	AutoRound
Qwen3-Omni-30B-A3B-Instruct	`Intel/Qwen3-Omni-30B-A3B-Instruct-int4-AutoRound`	Thinker language-model stage	W4A16	AutoRound
Qwen3-TTS	Not listed	TTS language-model stage	W4A16	Not validated

AutoRound support is checkpoint-driven. A model is supported when its checkpoint uses a compatible INC/AutoRound config and the target stage maps to vLLM-Omni's runtime module names.

Multi-Stage Diffusion Model (BAGEL, GLM-Image)¶

Model	Scope	Status	Notes
GLM-Image	Diffusion transformer	✅	`Intel/GLM-Image-int4-AutoRound`
BAGEL	Checkpoint-defined diffusion or transformer stage	Not validated	Requires a compatible AutoRound checkpoint

Configuration¶

Python API:

from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

omni = Omni(model="vllm-project-org/FLUX.1-dev-AutoRound-w4a16")

outputs = omni.generate(
    "A cat sitting on a windowsill",
    OmniDiffusionSamplingParams(num_inference_steps=28),
)
outputs[0].save_images("output.png")

CLI:

python examples/offline_inference/text_to_image/text_to_image.py \
  --model vllm-project-org/FLUX.1-dev-AutoRound-w4a16 \
  --prompt "A cat sitting on a windowsill" \
  --num-inference-steps 28 \
  --output outputs/flux_w4a16.png

Parameters¶

Field	Type	Description
`quant_method`	str	Must be `"auto-round"`
`bits`	int	Quantized weight bit width, usually `4`
`group_size`	int	Quantization group size
`packing_format`	str	AutoRound packing format, for example `auto_round:auto_gptq`
`block_name_to_quantize`	str	Checkpoint block names that should map to runtime module names

The checkpoint should contain a config like:

{
  "quantization_config": {
    "quant_method": "auto-round",
    "bits": 4,
    "group_size": 128,
    "sym": true,
    "packing_format": "auto_round:auto_gptq",
    "block_name_to_quantize": "transformer_blocks,single_transformer_blocks"
  }
}

Validation and Notes¶

At load time, vLLM-Omni builds an OmniINCConfig, remaps checkpoint block names to runtime module names, and selects the matching vLLM compute backend.

Example checkpoint creation:

auto-round \
  --model black-forest-labs/FLUX.1-dev \
  --scheme W4A16 \
  --batch_size 1 \
  --disable_opt_rtn \
  --dataset coco2014 \
  --iters 0

Use the generated output directory directly as the model argument. See the AutoRound documentation for all available schemes and options.