Parallelism Acceleration Guide¶
This guide covers the parallelism methods in vLLM-Omni for speeding up diffusion model inference and reducing per-device memory requirements.
Supported Methods¶
| Method | Description |
|---|---|
| Tensor Parallelism | Shards DiT weights across GPUs to reduce per-GPU memory |
| Sequence Parallelism | Splits sequence dimension across GPUs (Ulysses-SP, Ring-Attention, or hybrid) for high-resolution images and videos |
| CFG-Parallel | Runs CFG positive/negative branches on separate GPUs for ~1.8x speedup on guided generation |
| Pipeline Parallelism | Splits the denoising transformer block-wise across sequential GPU stages to reduce per-GPU model memory |
| VAE Patch Parallelism | Distributes VAE decode spatially across GPUs to reduce peak VAE memory |
| HSDP | Shards full model weights via PyTorch FSDP2 to enable large-model inference on memory-constrained GPUs |
| Expert Parallelism | Shards MoE expert blocks across GPUs for MoE models (e.g. HunyuanImage3.0) |
See Supported Models for per-model compatibility.