AutoRound Quantization¶
Overview¶
AutoRound produces pre-quantized checkpoints for LLMs, VLMs, and diffusion models. vLLM-Omni reads the checkpoint's config.json and auto-detects quantization_config.quant_method = "auto-round".
AutoRound is static quantization: no --quantization flag is needed at inference time when the checkpoint already contains the quantization config.
Hardware Support¶
| Device | Support |
|---|---|
| NVIDIA Blackwell GPU (SM 100+) | ✅ |
| NVIDIA Ada/Hopper GPU (SM 89+) | ✅ |
| NVIDIA Ampere GPU (SM 80+) | ✅ |
| AMD ROCm | ⭕ |
| Intel XPU | ✅ |
| Ascend NPU | ❌ |
Legend: ✅ supported, ❌ unsupported, ⭕ not verified in this guide. AutoRound is Intel-supported.
Model Type Support¶
Diffusion Model (Qwen-Image, Wan2.2)¶
| Model | Checkpoint | Scope | Scheme | Backend |
|---|---|---|---|---|
| FLUX.1-dev | vllm-project-org/FLUX.1-dev-AutoRound-w4a16 | Diffusion transformer | W4A16 | GPTQ-Marlin or Intel-supported AutoRound backend |
| Qwen-Image | Not listed | Diffusion transformer | W4A16 | Not validated |
| Wan2.2-I2V | Intel/Wan2.2-I2V-A14B-Diffusers-int4-AutoRound | Diffusion transformer | W4A16 | GPTQ-Marlin or Intel-supported AutoRound backend |
| Wan2.2-T2V | Intel/Wan2.2-T2V-A14B-Diffusers-int4-AutoRound | Diffusion transformer | W4A16 | GPTQ-Marlin or Intel-supported AutoRound backend |
| Wan2.2-TI2V | Intel/Wan2.2-TI2V-5B-Diffusers-int4-AutoRound | Diffusion transformer | W4A16 | GPTQ-Marlin or Intel-supported AutoRound backend |
Multi-Stage Omni/TTS Model (Qwen3-Omni, Qwen3-TTS)¶
| Model | Checkpoint | Scope | Scheme | Backend |
|---|---|---|---|---|
| Qwen2.5-Omni-7B | Intel/Qwen2.5-Omni-7B-int4-AutoRound | Language-model stage | W4A16 | AutoRound |
| Qwen3-Omni-30B-A3B-Instruct | Intel/Qwen3-Omni-30B-A3B-Instruct-int4-AutoRound | Thinker language-model stage | W4A16 | AutoRound |
| Qwen3-TTS | Not listed | TTS language-model stage | W4A16 | Not validated |
AutoRound support is checkpoint-driven. A model is supported when its checkpoint uses a compatible INC/AutoRound config and the target stage maps to vLLM-Omni's runtime module names.
Multi-Stage Diffusion Model (BAGEL, GLM-Image)¶
| Model | Scope | Status | Notes |
|---|---|---|---|
| GLM-Image | Diffusion transformer | ✅ | Intel/GLM-Image-int4-AutoRound |
| BAGEL | Checkpoint-defined diffusion or transformer stage | Not validated | Requires a compatible AutoRound checkpoint |
Configuration¶
Python API:
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
omni = Omni(model="vllm-project-org/FLUX.1-dev-AutoRound-w4a16")
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(num_inference_steps=28),
)
outputs[0].save_images("output.png")
CLI:
python examples/offline_inference/text_to_image/text_to_image.py \
--model vllm-project-org/FLUX.1-dev-AutoRound-w4a16 \
--prompt "A cat sitting on a windowsill" \
--num-inference-steps 28 \
--output outputs/flux_w4a16.png
Parameters¶
| Field | Type | Description |
|---|---|---|
quant_method | str | Must be "auto-round" |
bits | int | Quantized weight bit width, usually 4 |
group_size | int | Quantization group size |
packing_format | str | AutoRound packing format, for example auto_round:auto_gptq |
block_name_to_quantize | str | Checkpoint block names that should map to runtime module names |
The checkpoint should contain a config like:
{
"quantization_config": {
"quant_method": "auto-round",
"bits": 4,
"group_size": 128,
"sym": true,
"packing_format": "auto_round:auto_gptq",
"block_name_to_quantize": "transformer_blocks,single_transformer_blocks"
}
}
Validation and Notes¶
At load time, vLLM-Omni builds an OmniINCConfig, remaps checkpoint block names to runtime module names, and selects the matching vLLM compute backend.
Example checkpoint creation:
auto-round \
--model black-forest-labs/FLUX.1-dev \
--scheme W4A16 \
--batch_size 1 \
--disable_opt_rtn \
--dataset coco2014 \
--iters 0
Use the generated output directory directly as the model argument. See the AutoRound documentation for all available schemes and options.