Diffusion Advanced Features¶
Table of Contents¶
Overview¶
vLLM-Omni supports various advanced features for diffusion models:
- Acceleration: cache methods, parallelism methods, startup optimizations
- Memory optimization: cpu offloading, quantization
- Extensions: LoRA inference, frame interpolation
- Execution modes: step execution
Supported Features¶
Acceleration¶
Lossy Acceleration¶
Cache methods trade minimal quality for significant speedup. Quality loss is typically imperceptible with proper tuning.
| Method | Description | Best For |
|---|---|---|
| TeaCache | Adaptive caching using modulated inputs | Quick setup, balanced quality/speed on single GPU |
| Cache-DiT | Multiple caching techniques: DBCache, TaylorSeer, SCM | Fine-grained control, tunable quality-speed tradeoff |
Lossless Acceleration¶
Parallelism methods distribute computation across GPUs without quality loss (mathematically equivalent to single-GPU).
| Method | Description | Best For |
|---|---|---|
| Ulysses-SP | Sequence parallelism via all-to-all communication | High-resolution images (>1536px) or long videos with 2-8 GPUs |
| Ring-Attention | Sequence parallelism via ring-based communication | Videos, very long sequences, memory-constrained, with 2-8 GPUs |
| CFG-Parallel | Splits CFG positive/negative branches across devices | Image editing with CFG guidance (true_cfg_scale > 1) on 2 GPUs |
| Tensor Parallelism | Shards model weights across devices | Large models that don't fit in single GPU, with 2+ GPUs |
| Pipeline Parallelism | Splits the denoising transformer block-wise across sequential GPU stages | Large diffusion transformers that need lower per-GPU model memory |
| HSDP | Weight sharding via FSDP2, redistributed on-demand at runtime | Very large models (14B+) on limited VRAM, combinable with SP |
| Expert Parallelism | Shards MoE expert MLP blocks across devices | MoE diffusion models (e.g., HunyuanImage3.0) |
Startup Optimization¶
| Method | Description | Best For |
|---|---|---|
| Multi-Thread Weight Loading | Loads safetensors shards in parallel using a thread pool | All diffusion models; reduces startup from minutes to seconds |
Note: Some acceleration methods can be combined together for optimized performance. See Feature Compatibility Table and Feature Compatibility Tutorial for detailed configuration examples.
Memory Optimization¶
Memory optimization methods help reduce GPU memory usage, enabling inference on resource-constrained hardware or larger models.
| Method | Description | Best For |
|---|---|---|
| CPU Offload | Offloads model components to CPU memory | Limited VRAM, large models on consumer GPUs |
| Quantization | Reduces transformer stages from BF16 to FP8/INT8/etc. | Limited VRAM, minimal accuracy loss |
| VAE Patch Parallelism | Distributes VAE decode tiling across GPUs | High-resolution generation with reduced VAE memory peak |
Extensions¶
Extension methods add specialized capabilities to diffusion models beyond standard inference.
| Method | Description | Best For |
|---|---|---|
| LoRA Inference | Enables inference with Low-Rank Adaptation (LoRA) adapters weights | Reinforcement learning extensions |
| Frame Interpolation | Inserts intermediate video frames after generation for smoother motion | Video generation pipelines that need higher temporal smoothness |
Execution Modes¶
Execution modes control how the diffusion pipeline processes denoise steps.
| Method | Description | Best For |
|---|---|---|
| Step Execution | Per-step denoise execution with mid-request abort support | Request cancellation between denoise steps, fine-grained execution control |
Note: Step execution is currently supported by QwenImagePipeline only. See Supported Models for details.
Quantization Methods¶
| Method | Configuration | Description | Best For |
|---|---|---|---|
| FP8 | quantization="fp8" | FP8 W8A8 on validated transformer stages | Memory reduction, inference speedup |
| INT8 | quantization="int8" | INT8 W8A8 on validated transformer stages | Memory reduction, broad GPU compatibility |
| GGUF | quantization="gguf" | Native GGUF transformer-only weights (Q4, Q8, etc.) | Memory reduction on consumer GPUs |
Supported Models¶
The following tables show which models support each feature:
- 🔀SP (Ulysses & Ring): Includes both Ulysses-SP and Ring-Attention methods
- ✅ = Fully supported
- ❌ = Not supported
Notes:
- CPU Offload has two methods: Module-wise (default for models with DiT + text encoder) and Layerwise. The tables below show Layerwise support only.
- The 💾Quantization column is collapsed for readability. See Quantization Overview for per-method and per-model support details.
ImageGen¶
| Model | ⚡TeaCache | ⚡Cache-DiT | 🔀SP (Ulysses & Ring) | 🔀CFG-Parallel | 🔀Tensor-Parallel | 🔀Pipeline-Parallel | 🔀HSDP | 💾CPU Offload (Layerwise) | 💾VAE-Patch-Parallel | 💾Quantization | 🔄Step Execution |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Bagel | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| FLUX.1-dev | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ |
| FLUX.1-schnell | ❌ | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ |
| FLUX.2-klein | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ |
| FLUX.1-Kontext-dev | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ |
| FLUX.2-dev | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ |
| GLM-Image | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ |
| Hidream-I1-Full | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| HunyuanImage3 | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ |
| LongCat-Image | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| LongCat-Image-Edit | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| MagiHuman | ❌ | ❌ | ❌ | ❓ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| MammothModa2(T2I) | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Nextstep_1(T2I) | ❓ | ❓ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| OmniGen2 | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Ovis-Image | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| Qwen-Image | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ (decode) | ✅ | ✅ |
| Qwen-Image-2512 | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ (decode) | ✅ | ✅ |
| Qwen-Image-Edit | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ (decode) | ❌ | ❌ |
| Qwen-Image-Edit-2509 | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ (decode) | ✅ | ❌ | ❌ |
| Qwen-Image-Layered | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ (decode) | ❌ | ❌ |
| SenseNova-U1 | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| Stable-Diffusion3.5 | ❌ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ (decode) | ❌ | ❌ |
| Z-Image | ✅ | ✅ | ✅ | ❓ | ✅ (TP=2 only) | ❌ | ✅ | ❌ | ✅ (decode) | ✅ | ❌ |
| ERNIE-Image | ❌ | ✅ | ✅ | ❓ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| Cosmos3 | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ (decode) | ✅ | ❌ |
Notes: 1. Nextstep_1(T2I) does not support cache acceleration methods such as TeaCache or Cache-DiT. 2.
Tongyi-MAI/Z-Image-TurboandSII-GAIR/daVinci-MagiHuman-Base-1080pare distilled models with minimal NFEs; CFG-Parallel is not necessary. 3. Cosmos3 T2I usesCosmos3OmniDiffusersPipelinewithmodalities=["image"]. Model-level CPU offload is not supported; use layerwise offload.
VideoGen¶
| Model | ⚡TeaCache | ⚡Cache-DiT | 🔀SP (Ulysses & Ring) | 🔀CFG-Parallel | 🔀Tensor-Parallel | Pipeline-Parallel | 🔀HSDP | 💾CPU Offload (Layerwise) | 💾VAE-Patch-Parallel | 💾Quantization | 🔄Step Execution |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Wan2.2 | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ (encode/decode) | ❌ | ❌ |
| Wan2.2-S2V | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ (encode/decode) | ❌ | ❌ |
| Wan2.1-VACE | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ (decode) | ❌ | ❌ |
| LTX-2 | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| LTX-2.3 | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Helios | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| HunyuanVideo-1.5 T2V I2V | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ (encode/decode) | ✅ | ❌ |
| DreamID-Omni | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| Cosmos3 | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ (encode/decode) | ✅ | ❌ |
Frame Interpolation Support
- Supported: Wan2.2 text-to-video, image-to-video, and TI2V pipelines
- Not supported: Wan2.1-VACE, LTX-2, LTX-2.3, Helios, HunyuanVideo-1.5, DreamID-Omni
AudioGen¶
| Model | ⚡TeaCache | ⚡Cache-DiT | 🔀SP (Ulysses & Ring) | 🔀CFG-Parallel | 🔀Tensor-Parallel | 🔀Pipeline-Parallel | 🔀HSDP | 💾CPU Offload (Layerwise) | 💾VAE-Patch-Parallel | 💾Quantization | 🔄Step Execution |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Stable-Audio-Open | ✅ | ❌ | ❓ | ❓ | ❌ | ❌ | ✅ | ✅ | ❌ | ✅ | ❌ |
Feature Compatibility¶
Legend:
- ✅: Functionality is supported
- ❌: No support plan
- ❓: Not verified yet and Not Recommended
| ⚡TeaCache | ⚡Cache-DiT | 🔀Ulysses-SP | 🔀Ring-Attn | 🔀CFG-Parallel | 🔀Tensor Parallel | 🔀HSDP | 🔀Expert Parallel | 💾CPU Offloading (Layerwise) | 💾CPU Offloading (Module-wise) | 💾VAE Patch Parallel | 💾FP8 Quant | 🔧LoRA Inference | 🔄Step Execution | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ⚡TeaCache | ||||||||||||||
| ⚡Cache-DiT | ❌ | |||||||||||||
| 🔀Ulysses-SP | ✅ | ✅ | ||||||||||||
| 🔀Ring-Attn | ✅ | ✅ | ✅ | |||||||||||
| 🔀CFG-Parallel | ✅ | ✅ | ✅ | ✅ | ||||||||||
| 🔀Tensor Parallel | ✅ | ✅ | ✅ | ✅ | ✅ | |||||||||
| 🔀HSDP | ❓ | ❓ | ❓ | ❓ | ❓ | ❌ | ||||||||
| 🔀Expert Parallel | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | |||||||
| 💾CPU Offloading (Layerwise) | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||||||
| 💾CPU Offloading (Module-wise) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❓ | ❓ | ❌ | |||||
| 💾VAE Patch Parallel | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ||||
| 💾FP8 Quant | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❓ | ❓ | ✅ | ✅ | ✅ | |||
| 🔧LoRA Inference | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | ||
| 🔄Step Execution | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ❓ | ❓ | ✅ | ❓ | ✅ | ✅ | ✅ |
Info
- Tensor Parallel and HSDP are not compatible.
- TeaCache and Cache-DiT are not compatible.
- CPU Offloading (Layerwise) and CPU Offloading (Module-wise) are not compatible.
- CPU Offloading (Layerwise) supports single-card for now.
- Using FP8-Quant as an example of qunatization methods.
- Step Execution is not compatible with cache backends (TeaCache, Cache-DiT). LoRA is supported, but each scheduled batch must use a single adapter (requests with different
lora_requestorlora_scaleare kept in separate batches).
Multi-Thread Weight Loading¶
Large diffusion models can take several minutes to load weights at startup (e.g., ~3 min for Qwen-Image, ~5 min for Wan2.2 I2V 14B). Multi-thread weight loading speeds up this process by loading safetensors shards in parallel using a thread pool instead of sequentially.
This optimization is enabled by default with 4 threads. No configuration is needed for the default behavior.
Configuration¶
| Parameter | CLI Flag | Default | Description |
|---|---|---|---|
enable_multithread_weight_load | --disable-multithread-weight-load | True (enabled) | Pass the flag to disable multi-thread loading |
num_weight_load_threads | --num-weight-load-threads | 4 | Number of threads for parallel weight loading |
Tip
The default of 4 threads balances speed and disk I/O contention. On fast NVMe storage you may benefit from more threads (e.g., 8). On HDD or network storage, the default of 4 avoids saturating I/O bandwidth.
Online Serving¶
# Default (multi-thread enabled, 4 threads)
vllm serve Qwen/Qwen-Image --omni --port 8091
# Custom thread count
vllm serve Wan-AI/Wan2.2-I2V-A14B-Diffusers --omni --num-weight-load-threads 8
# Disable multi-thread loading
vllm serve Qwen/Qwen-Image --omni --disable-multithread-weight-load
Offline Inference¶
from vllm_omni import Omni
# Default (multi-thread enabled, 4 threads)
omni = Omni(model="Qwen/Qwen-Image")
# Custom thread count
omni = Omni(
model="Wan-AI/Wan2.2-I2V-A14B-Diffusers",
num_weight_load_threads=8,
)
Benchmarks¶
Measured on NVIDIA H800:
| Model | Before | After | Speedup |
|---|---|---|---|
| Qwen/Qwen-Image (53.7 GiB) | 168s | 27s | 6.2x |
| Wan-AI/Wan2.2-I2V-A14B-Diffusers (64.5 GiB) | 283s | 56s | 5.1x |
Learn More¶
Cache Acceleration:
- TeaCache Configuration Guide - Parameter tuning, performance tips, troubleshooting
- Cache-DiT Advanced Guide - DBCache, TaylorSeer, SCM techniques and optimization
Parallelism Methods:
- Parallelism Overview - Tensor Parallelism, Sequence Parallelism, CFG Parallelism, Pipeline Parallelism, HSDP, and Expert Parallelism
Memory Optimization:
- CPU Offload Guide - Offload model components to CPU, reduce GPU memory usage
- VAE Patch Parallelism Guide - Distribute VAE decode tiling across GPUs for high-resolution images
- Quantization Overview - Overview of quantization methods for diffusion, multi-stage omni/TTS, and multi-stage diffusion models
Extensions:
- LoRA Inference Guide - Low-Rank Adaptation for style customization and fine-tuning
- Frame Interpolation Guide - Worker-side post-generation video frame interpolation for smoother motion
Execution Modes:
- Step Execution Guide - Per-step denoise execution with mid-request abort support
Startup Optimization:
- Multi-Thread Weight Loading - Speed up model startup by loading safetensors shards in parallel
Advanced Topics:
- Feature Compatibility - How to combine multiple features for maximum performance