Diffusion Advanced Features¶

Table of Contents¶

Overview
Supported Features
Supported Models
Feature Compatibility
Learn More

Overview¶

vLLM-Omni supports various advanced features for diffusion models:

Acceleration: cache methods, parallelism methods, startup optimizations
Memory optimization: cpu offloading, quantization
Extensions: LoRA inference, frame interpolation
Execution modes: step execution

Supported Features¶

Acceleration¶

Lossy Acceleration¶

Cache methods trade minimal quality for significant speedup. Quality loss is typically imperceptible with proper tuning.

Method	Description	Best For
TeaCache	Adaptive caching using modulated inputs	Quick setup, balanced quality/speed on single GPU
Cache-DiT	Multiple caching techniques: DBCache, TaylorSeer, SCM	Fine-grained control, tunable quality-speed tradeoff

Lossless Acceleration¶

Parallelism methods distribute computation across GPUs without quality loss (mathematically equivalent to single-GPU).

Method	Description	Best For
Ulysses-SP	Sequence parallelism via all-to-all communication	High-resolution images (>1536px) or long videos with 2-8 GPUs
Ring-Attention	Sequence parallelism via ring-based communication	Videos, very long sequences, memory-constrained, with 2-8 GPUs
CFG-Parallel	Splits CFG positive/negative branches across devices	Image editing with CFG guidance (true_cfg_scale > 1) on 2 GPUs
Tensor Parallelism	Shards model weights across devices	Large models that don't fit in single GPU, with 2+ GPUs
Pipeline Parallelism	Splits the denoising transformer block-wise across sequential GPU stages	Large diffusion transformers that need lower per-GPU model memory
HSDP	Weight sharding via FSDP2, redistributed on-demand at runtime	Very large models (14B+) on limited VRAM, combinable with SP
Expert Parallelism	Shards MoE expert MLP blocks across devices	MoE diffusion models (e.g., HunyuanImage3.0)

Startup Optimization¶

Method	Description	Best For
Multi-Thread Weight Loading	Loads safetensors shards in parallel using a thread pool	All diffusion models; reduces startup from minutes to seconds

Note: Some acceleration methods can be combined together for optimized performance. See Feature Compatibility Table and Feature Compatibility Tutorial for detailed configuration examples.

Memory Optimization¶

Memory optimization methods help reduce GPU memory usage, enabling inference on resource-constrained hardware or larger models.

Method	Description	Best For
CPU Offload	Offloads model components to CPU memory	Limited VRAM, large models on consumer GPUs
Quantization	Reduces transformer stages from BF16 to FP8/INT8/etc.	Limited VRAM, minimal accuracy loss
VAE Parallelism	Distributes VAE decode work across GPUs	High-resolution generation with reduced VAE memory peak

Extensions¶

Extension methods add specialized capabilities to diffusion models beyond standard inference.

Method	Description	Best For
LoRA Inference	Enables inference with Low-Rank Adaptation (LoRA) adapters weights	Reinforcement learning extensions
Frame Interpolation	Inserts intermediate video frames after generation for smoother motion	Video generation pipelines that need higher temporal smoothness

Execution Modes¶

Execution modes control how the diffusion pipeline processes requests and denoise steps.

Method	Description	Best For
Request-Level Batching	Scheduler batches compatible independent diffusion requests into one pipeline forward pass	Bursty online serving and multi-request throughput
Step Execution	Per-step denoise execution with mid-request abort support	Request cancellation between denoise steps, fine-grained execution control

Note: Request-level batching is available for pipelines that declare the request-batch forward contract. Step execution is currently supported by QwenImagePipeline only. See Supported Models for details.

Quantization Methods¶

Method	Configuration	Description	Best For
FP8	`quantization="fp8"`	FP8 W8A8 on validated transformer stages	Memory reduction, inference speedup
INT8	`quantization="int8"`	INT8 W8A8 on validated transformer stages	Memory reduction, broad GPU compatibility
GGUF	`quantization="gguf"`	Native GGUF transformer-only weights (Q4, Q8, etc.)	Memory reduction on consumer GPUs

Supported Models¶

The following tables show which models support each feature:

🔀SP (Ulysses & Ring): Includes both Ulysses-SP and Ring-Attention methods
✅ = Fully supported
❌ = Not supported

Notes:

CPU Offload has two methods: Module-wise (default for models with DiT + text encoder) and Layerwise. The tables below show Layerwise support only.

The 💾Quantization column is collapsed for readability. See Quantization Overview for per-method and per-model support details.

ImageGen¶

Model	⚡TeaCache	⚡Cache-DiT	🔀SP (Ulysses & Ring)	🔀CFG-Parallel	🔀Tensor-Parallel	🔀Pipeline-Parallel	🔀HSDP	💾CPU Offload (Layerwise)	💾VAE-Patch-Parallel	💾Quantization	🔄Step Execution
Bagel	✅	✅	✅	✅	✅	❌	✅	✅	❌	❌	❌
FLUX.1-dev	✅	✅	❌	✅	✅	❌	✅	❌	❌	✅	❌
FLUX.1-schnell	❌	✅	❌	✅	✅	❌	✅	❌	❌	✅	❌
FLUX.2-klein	✅	✅	✅	✅	✅	❌	✅	❌	❌	✅	❌
FLUX.1-Kontext-dev	❌	❌	❌	❌	✅	❌	✅	❌	❌	❌	❌
FLUX.2-dev	✅	✅	✅	✅	✅	❌	✅	❌	❌	❌	❌
GLM-Image	❌	❌	❌	✅	✅	❌	✅	❌	❌	❌	❌
Hidream-I1-Full	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
HunyuanImage3	❌	✅	❌	❌	✅	❌	❌	❌	❌	✅	❌
LongCat-Image	✅	✅	✅	✅	✅	❌	❌	✅	❌	❌	❌
LongCat-Image-Edit	✅	✅	✅	✅	✅	❌	❌	✅	❌	❌	❌
MagiHuman	❌	❌	❌	❓	✅	❌	❌	✅	❌	❌	❌
MammothModa2(T2I)	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌
Nextstep_1(T2I)	❓	❓	❌	✅	✅	❌	❌	✅	❌	❌	❌
OmniGen2	❌	✅	❌	❌	✅	❌	❌	❌	❌	❌	❌
Ovis-Image	❌	✅	❌	✅	❌	❌	❌	✅	❌	❌	❌
Qwen-Image	✅	✅	✅	✅	✅	❌	✅	✅	✅ (decode)	✅	✅
Qwen-Image-2512	✅	✅	✅	✅	✅	❌	✅	✅	✅ (decode)	✅	✅
Qwen-Image-Edit	✅	✅	✅	✅	✅	❌	✅	✅	✅ (decode)	❌	❌
Qwen-Image-Edit-2509	✅	✅	✅	✅	✅	❌	✅	✅ (decode)	✅	❌	❌
Qwen-Image-Layered	✅	✅	✅	✅	✅	❌	✅	✅	✅ (decode)	❌	❌
SenseNova-U1	❌	✅	❌	✅	✅	❌	❌	✅	❌	❌	❌
Stable-Diffusion-XL	❌	❌	✅	✅	✅	❌	✅	✅	✅ (decode)	❌	❌
Stable-Diffusion3.5	❌	✅	❌	✅	✅	❌	❌	✅	✅ (decode)	❌	❌
Z-Image	✅	✅	✅	❓	✅ (TP=2 only)	❌	✅	❌	✅ (decode)	✅	❌
ERNIE-Image	❌	✅	✅	❓	✅	❌	✅	✅	❌	❌	❌
Cosmos3	❌	✅	✅	✅	✅	❌	✅	✅	✅ (decode)	✅	❌

Notes: 1. Nextstep_1(T2I) does not support cache acceleration methods such as TeaCache or Cache-DiT. 2. Tongyi-MAI/Z-Image-Turbo and SII-GAIR/daVinci-MagiHuman-Base-1080p are distilled models with minimal NFEs; CFG-Parallel is not necessary. 3. Cosmos3 T2I uses Cosmos3OmniDiffusersPipeline with modalities=["image"]. Model-level CPU offload is not supported; use layerwise offload.

VideoGen¶

Model	⚡TeaCache	⚡Cache-DiT	🔀SP (Ulysses & Ring)	🔀CFG-Parallel	🔀Tensor-Parallel	Pipeline-Parallel	🔀HSDP	💾CPU Offload (Layerwise)	💾VAE-Patch-Parallel	💾Quantization	🔄Step Execution
Wan2.2	❌	✅	✅	✅	✅	✅	✅	✅	✅ (encode/decode)	❌	❌
Wan2.2-S2V	❌	✅	✅	✅	✅	❌	✅	✅	✅ (encode/decode)	❌	❌
Wan2.1-VACE	❌	✅	✅	✅	✅	❌	✅	✅	✅ (decode)	❌	❌
LTX-2	❌	✅	✅	✅	✅	❌	✅	✅	❌	❌	❌
LTX-2.3	❌	✅	✅	✅	✅	❌	❌	✅	✅ (decode)	❌	❌
Helios	❌	✅	✅	✅	✅	❌	✅	✅	❌	❌	❌
HunyuanVideo-1.5 T2V I2V	❌	✅	✅	✅	✅	❌	✅	✅	✅ (encode/decode)	✅	❌
DreamID-Omni	❌	❌	❌	✅	❌	❌	✅	✅	❌	❌	❌
Cosmos3	❌	✅	✅	✅	✅	❌	✅	✅	✅ (encode/decode)	✅	❌

Frame Interpolation Support

Supported: Wan2.2 text-to-video, image-to-video, and TI2V pipelines
Not supported: Wan2.1-VACE, LTX-2, LTX-2.3, Helios, HunyuanVideo-1.5, DreamID-Omni

AudioGen¶

Model	⚡TeaCache	⚡Cache-DiT	🔀SP (Ulysses & Ring)	🔀CFG-Parallel	🔀Tensor-Parallel	🔀Pipeline-Parallel	🔀HSDP	💾CPU Offload (Layerwise)	💾VAE-Patch-Parallel	💾Quantization	🔄Step Execution
Stable-Audio-Open	✅	❌	❓	❓	❌	❌	✅	✅	❌	✅	❌

Feature Compatibility¶

Legend:

✅: Functionality is supported
❌: No support plan
❓: Not verified yet and Not Recommended

	⚡TeaCache	⚡Cache-DiT	🔀Ulysses-SP	🔀Ring-Attn	🔀CFG-Parallel	🔀Tensor Parallel	🔀HSDP	🔀Expert Parallel	💾CPU Offloading (Layerwise)	💾CPU Offloading (Module-wise)	💾VAE Patch Parallel	💾FP8 Quant	🔧LoRA Inference
⚡TeaCache
⚡Cache-DiT	❌
🔀Ulysses-SP	✅	✅
🔀Ring-Attn	✅	✅	✅
🔀CFG-Parallel	✅	✅	✅	✅
🔀Tensor Parallel	✅	✅	✅	✅	✅
🔀HSDP	❓	❓	❓	❓	❓	❌
🔀Expert Parallel	❓	❓	❓	❓	❓	❓	❓
💾CPU Offloading (Layerwise)	✅	✅	❌	❌	❌	❌	❌	❌
💾CPU Offloading (Module-wise)	✅	✅	✅	✅	✅	✅	❓	❓	❌
💾VAE Patch Parallel	✅	✅	✅	✅	✅	✅	✅	✅	❌	❌
💾FP8 Quant	✅	✅	✅	✅	✅	✅	❓	❓	✅	✅	✅
🔧LoRA Inference	❓	❓	❓	❓	❓	❓	❓	❓	❓	❓	❓	❓
🔄Step Execution	❌	❌	✅	✅	✅	✅	❓	❓	✅	❓	✅	✅	✅

Info

Tensor Parallel and HSDP are not compatible.
TeaCache and Cache-DiT are not compatible.
CPU Offloading (Layerwise) and CPU Offloading (Module-wise) are not compatible.
CPU Offloading (Layerwise) supports single-card for now.
Using FP8-Quant as an example of qunatization methods.
Step Execution is not compatible with cache backends (TeaCache, Cache-DiT). LoRA is supported, but each scheduled batch must use a single adapter (requests with different lora_request or lora_scale are kept in separate batches).

Multi-Thread Weight Loading¶

Large diffusion models can take several minutes to load weights at startup (e.g., ~3 min for Qwen-Image, ~5 min for Wan2.2 I2V 14B). Multi-thread weight loading speeds up this process by loading safetensors shards in parallel using a thread pool instead of sequentially.

This optimization is enabled by default with 4 threads. No configuration is needed for the default behavior.

Configuration¶

Parameter	CLI Flag	Default	Description
`enable_multithread_weight_load`	`--disable-multithread-weight-load`	`True` (enabled)	Pass the flag to disable multi-thread loading
`num_weight_load_threads`	`--num-weight-load-threads`	`4`	Number of threads for parallel weight loading

Tip

The default of 4 threads balances speed and disk I/O contention. On fast NVMe storage you may benefit from more threads (e.g., 8). On HDD or network storage, the default of 4 avoids saturating I/O bandwidth.

Online Serving¶

# Default (multi-thread enabled, 4 threads)
vllm serve Qwen/Qwen-Image --omni --port 8091

# Custom thread count
vllm serve Wan-AI/Wan2.2-I2V-A14B-Diffusers --omni --num-weight-load-threads 8

# Disable multi-thread loading
vllm serve Qwen/Qwen-Image --omni --disable-multithread-weight-load

Offline Inference¶

from vllm_omni import Omni

# Default (multi-thread enabled, 4 threads)
omni = Omni(model="Qwen/Qwen-Image")

# Custom thread count
omni = Omni(
    model="Wan-AI/Wan2.2-I2V-A14B-Diffusers",
    num_weight_load_threads=8,
)

Benchmarks¶

Measured on NVIDIA H800:

Model	Before	After	Speedup
Qwen/Qwen-Image (53.7 GiB)	168s	27s	6.2x
Wan-AI/Wan2.2-I2V-A14B-Diffusers (64.5 GiB)	283s	56s	5.1x

Learn More¶

Cache Acceleration:

TeaCache Configuration Guide - Parameter tuning, performance tips, troubleshooting
Cache-DiT Advanced Guide - DBCache, TaylorSeer, SCM techniques and optimization

Parallelism Methods:

Parallelism Overview - Tensor Parallelism, Sequence Parallelism, CFG Parallelism, Pipeline Parallelism, HSDP, and Expert Parallelism

Memory Optimization:

CPU Offload Guide - Offload model components to CPU, reduce GPU memory usage
VAE Parallelism Guide - Distribute VAE decode work across GPUs for high-resolution images and videos
Quantization Overview - Overview of quantization methods for diffusion, multi-stage omni/TTS, and multi-stage diffusion models

Extensions:

LoRA Inference Guide - Low-Rank Adaptation for style customization and fine-tuning
Frame Interpolation Guide - Worker-side post-generation video frame interpolation for smoother motion

Execution Modes:

Step Execution Guide - Per-step denoise execution with mid-request abort support

Startup Optimization:

Multi-Thread Weight Loading - Speed up model startup by loading safetensors shards in parallel

Advanced Topics:

Feature Compatibility - How to combine multiple features for maximum performance