Qwen-Image Usage Guide¶

Qwen-Image models include the following models:

Model	HuggingFace	Description
Qwen-Image	🤗 Qwen/Qwen-Image	Text-to-image generation (20B parameters, Aug 2025)
Qwen-Image-2512	🤗 Qwen/Qwen-Image-2512	Updated T2I with enhanced realism and text rendering (Dec 2025)
Qwen-Image-Edit	🤗 Qwen/Qwen-Image-Edit	Single-image editing with semantic and appearance control (Aug 2025)
Qwen-Image-Edit-2509	🤗 Qwen/Qwen-Image-Edit-2509	Multi-image editing with improved consistency (Sep 2025)
Qwen-Image-Edit-2511	🤗 Qwen/Qwen-Image-Edit-2511	Further enhanced consistency, built-in LoRA support (Nov 2025)
Qwen-Image-Layered	🤗 Qwen/Qwen-Image-Layered	Decomposes an input image into multiple RGBA layers (Dec 2025)

All models share the same DiT transformer core; hence, the acceleration methods (e.g., cache methods, parallelism methods) are applicable across the entire series.

Installation¶

# Clone and install vllm-omni
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
uv venv
source .venv/bin/activate
uv pip install -e . vllm==0.18.0

Usage¶

Text-to-Image (Qwen-Image, Qwen-Image-2512)¶

Qwen-Image and Qwen-Image-2512 are text-to-image models. Use the text_to_image.py script:

# Qwen-Image (default)
python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --output output_qwen_image.png \
    --num-inference-steps 50 \
    --cfg-scale 4.0

# Qwen-Image-2512
python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image-2512 \
    --prompt "a cup of coffee on the table" \
    --output output_qwen_image_2512.png \
    --num-inference-steps 50 \
    --cfg-scale 4.0

Notes: 1. vLLM-Omni enables torch.compile by default. Try --enforce-eager if you want to disable it. 2. vLLM-Omni does not enable CPU offload automatically. If you encounter OOM, please --enable-cpu-offload or --enable-layerwise-offload.

Image Editing (Qwen-Image-Edit)¶

Qwen-Image-Edit is the image editing version of Qwen-Image. It simultaneously feeds the input image into Qwen2.5-VL (for visual semantic control) and the VAE Encoder (for visual appearance control), achieving capabilities in both semantic and appearance editing.

# Single image input (Qwen-Image-Edit)
python3 ./examples/offline_inference/image_to_image/image_edit.py \
    --model Qwen/Qwen-Image-Edit \
    --image qwen_bear.png \
    --prompt "Let this mascot dance under the moon, surrounded by floating stars and poetic bubbles such as 'Be Kind'" \
    --output output_image_edit.png \
    --num-inference-steps 50 \
    --cfg-scale 4.0

For multiple image inputs, use Qwen/Qwen-Image-Edit-2509 or Qwen/Qwen-Image-Edit-2511:

# Qwen-Image-Edit-2511 example (multiple images)
python3 ./examples/offline_inference/image_to_image/image_edit.py \
    --model Qwen/Qwen-Image-Edit-2511 \
    --image image1.png image2.png \
    --prompt "Add a white art board written with colorful text 'vLLM-Omni' on grassland. Add a paintbrush in the bear's hands. position the bear standing in front of the art board as if painting" \
    --output output_image_edit.png \
    --num-inference-steps 50 \
    --cfg-scale 4.0

Image Layering (Qwen-Image-Layered)¶

Qwen-Image-Layered decomposes an input image into multiple RGBA layers:

python3 ./examples/offline_inference/image_to_image/image_edit.py \
    --model Qwen/Qwen-Image-Layered \
    --image input.png \
    --prompt "" \
    --output layered \
    --num-inference-steps 50 \
    --cfg-scale 4.0 \
    --layers 4 \
    --color-format "RGBA"

Key Arguments¶

Argument	Description
`--model`	Model name or local path. Use `Qwen/Qwen-Image-Edit-2509` or later for multiple image support.
`--image`	Path(s) to the source image(s) (PNG/JPG, converted to RGB). Can specify multiple images.
`--prompt` / `--negative-prompt`	Text description (string).
`--cfg-scale`	True classifier-free guidance scale (default: 4.0). Classifier-free guidance is enabled by setting `cfg_scale > 1` and providing a `negative_prompt`. Higher guidance scale encourages images closely linked to the text prompt, usually at the expense of lower image quality.
`--guidance-scale`	Guidance scale for guidance-distilled models (default: 1.0, disabled). Unlike classifier-free guidance (`--cfg-scale`), guidance-distilled models take the guidance scale directly as an input parameter. Enabled when `guidance_scale > 1`. Ignored when not using guidance-distilled models.
`--num-inference-steps`	Diffusion sampling steps (more steps = higher quality, slower).
`--output`	Path to save the generated PNG. For Qwen-Image-Layered, this is used as the filename prefix.
`--vae-use-slicing`	Enable VAE slicing for memory optimization.
`--vae-use-tiling`	Enable VAE tiling for memory optimization.
`--cfg-parallel-size`	Set to `2` to enable CFG Parallel.
`--enable-cpu-offload`	Enable CPU offloading for diffusion models.
`--layers`	Number of layers to decompose the input image into (Qwen-Image-Layered only).
`--color-format`	Output color channel format (`RGB` or `RGBA`). Qwen-Image-Layered uses `RGBA`.

Acceleration Methods¶

Cache Acceleration¶

vLLM-Omni supports cache-dit and tea-cache for Qwen-Image models.

Cache-DiT¶

# Text-to-Image with Cache-DiT
python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --cache-backend cache_dit

Advanced Cache-DiT options:

python3 ./examples/offline_inference/image_to_image/image_edit.py \
    --model Qwen/Qwen-Image-Edit \
    --image qwen_bear.png \
    --prompt "Edit description" \
    --cache-backend cache_dit \
    --cache-dit-max-continuous-cached-steps 3 \
    --cache-dit-residual-diff-threshold 0.24 \
    --cache-dit-enable-taylorseer

TeaCache¶

# Text-to-Image with TeaCache
python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --cache-backend tea_cache

Ulysses Sequence Parallelism¶

Distributes computation across GPUs without quality loss. Recommended for high-resolution images (>1536px) with 2–8 GPUs.

# Text-to-Image with Ulysses SP
python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --ulysses-degree 4

# Image Editing with Ulysses SP
python3 ./examples/offline_inference/image_to_image/image_edit.py \
    --model Qwen/Qwen-Image-Edit \
    --image qwen_bear.png \
    --prompt "Add a white art board written with colorful text 'vLLM-Omni' on grassland." \
    --output output_image_edit.png \
    --num-inference-steps 50 \
    --cfg-scale 4.0 \
    --ulysses-degree 4

Ring-Attention Sequence Parallelism¶

Ring-based sequence parallelism, suitable for memory-constrained environments with very long sequences.

# Text-to-Image with Ring-Attention
python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --ring-degree 4

# Image Editing with Ring-Attention
python3 ./examples/offline_inference/image_to_image/image_edit.py \
    --model Qwen/Qwen-Image-Edit \
    --image qwen_bear.png \
    --prompt "Edit description" \
    --ring-degree 4

CFG Parallelism¶

Splits classifier-free guidance positive/negative branches across 2 GPUs. Particularly effective for image editing with cfg-scale > 1.

# Image Editing with CFG Parallel (2 GPUs)
python3 ./examples/offline_inference/image_to_image/image_edit.py \
    --model Qwen/Qwen-Image-Edit \
    --image qwen_bear.png \
    --prompt "Edit description" \
    --cfg-parallel-size 2 \
    --num-inference-steps 50 \
    --cfg-scale 4.0

Tensor Parallelism¶

Shards model weights across multiple GPUs. Useful for running the 20B model across 2+ GPUs.

# Text-to-Image with Tensor Parallelism (2 GPUs)
python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --tensor-parallel-size 2

# Image Editing with Tensor Parallelism
python3 ./examples/offline_inference/image_to_image/image_edit.py \
    --model Qwen/Qwen-Image-Edit \
    --image qwen_bear.png \
    --prompt "Edit description" \
    --tensor-parallel-size 2

CPU Offload¶

Offloads DiT layers to CPU memory between forward passes. Enables inference on limited VRAM.

# Text-to-Image with CPU offload (module-wise)
python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --enable-cpu-offload

# Image Editing with CPU offload (layerwise, saves more VRAM, but slower)
python3 ./examples/offline_inference/image_to_image/image_edit.py \
    --model Qwen/Qwen-Image-Edit \
    --image qwen_bear.png \
    --prompt "Edit description" \
    --enable-layerwise-offload

VAE Patch Parallelism¶

Distributes VAE decode tiling across GPUs, reducing peak VAE memory usage at high resolutions.

# Text-to-Image with VAE Patch Parallelism
python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --height 1536 --width 1536 \
    --ulysses-degree 2 \
    --vae-patch-parallel-size 2

VAE patch parallelism cannot be used alone. It must be used together with other parallelism methods.

Quantization¶

Qwen-Image and Qwen-Image-2512 support FP8 and INT8 quantization. Qwen-Image-Edit variants do not support quantization.

FP8¶

# Text-to-Image with FP8 quantization
python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --quantization fp8

# Skip sensitive layers (recommended for better quality)
python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --quantization fp8 \
    --ignored-layers "img_mlp"

INT8¶

python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --quantization int8

Feature Support Summary¶

For detailed features support for Qwen-Image series models in vLLM-Omni, see the Feature Support Table For detailed compatibility between features (e.g., combining Cache + SP + CFG-Parallel), see the Feature Compatibility Guide.

Combining Acceleration Methods¶

A few guidelines help pick the right combination:

Cache (TeaCache or Cache-DiT) reduces redundant DiT computation per inference. Parallelism (SP, TP, CFG-parallel, VAE patch) splits work across GPUs. Use one cache backend together with any supported parallel strategy. See the Feature Compatibility Guide for supported combinations.
TeaCache and Cache-DiT cannot be used together.
Sequence parallelism (Ulysses / Ring) is the best parallelism choice for high-resolution or long-sequence workloads. It generally outperforms tensor parallelism (TP) in these settings by distributing token-dimension computation across GPUs.
Tensor parallelism is most useful when model weights alone do not fit on a single GPU.
CFG parallelism targets non-distilled diffusion with full classifier-free guidance (--cfg-scale > 1). It assigns the positive and negative CFG branches to separate GPUs, achieving up to ~1.5× speedup when guidance is the dominant cost. It is not well-suited for guidance-distilled models (CFG is not applied).
To reduce peak VRAM, use --enable-cpu-offload, --enable-layerwise-offload or pair --vae-patch-parallel-size with another parallel method to lower VAE decode memory at high resolutions.
To trade quality for speed, FP8 / INT8 quantization is available for Qwen-Image and Qwen-Image-2512.

Examples¶

1) Sequence parallelism only:

python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --ulysses-degree 4

2) Cache only (single GPU):

python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --cache-backend cache_dit

3) Cache + SP (recommended for long sequence generation):

python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --cache-backend cache_dit \
    --ulysses-degree 4

4) SP + VAE patch parallel (high-resolution, VRAM-constrained):

python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --height 1536 --width 1536 \
    --ulysses-degree 2 \
    --vae-patch-parallel-size 2

5) Image editing: cache + CFG parallel:

python3 ./examples/offline_inference/image_to_image/image_edit.py \
    --model Qwen/Qwen-Image-Edit \
    --image qwen_bear.png \
    --prompt "Edit description" \
    --cache-backend cache_dit \
    --cfg-parallel-size 2 \
    --num-inference-steps 50 \
    --cfg-scale 4.0

6) CPU offload (add when OOM):

python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --enable-cpu-offload

7) Quantization + SP:

python3 ./examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --quantization fp8 \
    --ulysses-degree 2