Parallelism Acceleration Guide¶

This guide covers the parallelism methods in vLLM-Omni for speeding up diffusion model inference and reducing per-device memory requirements.

Supported Methods¶

Method	Description
Tensor Parallelism	Shards DiT weights across GPUs to reduce per-GPU memory
Sequence Parallelism	Splits sequence dimension across GPUs (Ulysses-SP, Ring-Attention, or hybrid) for high-resolution images and videos
CFG-Parallel	Runs CFG positive/negative branches on separate GPUs for ~1.8x speedup on guided generation
Pipeline Parallelism	Splits the denoising transformer block-wise across sequential GPU stages to reduce per-GPU model memory
VAE Parallelism	Distributes VAE decode spatially across GPUs to reduce peak VAE memory
HSDP	Shards full model weights via PyTorch FSDP2 to enable large-model inference on memory-constrained GPUs
Expert Parallelism	Shards MoE expert blocks across GPUs for MoE models (e.g. HunyuanImage3.0)

See Supported Models for per-model compatibility.