Skip to content

Parallelism Acceleration Guide

This guide covers the parallelism methods in vLLM-Omni for speeding up diffusion model inference and reducing per-device memory requirements.

Supported Methods

Method Description
Tensor Parallelism Shards DiT weights across GPUs to reduce per-GPU memory
Sequence Parallelism Splits sequence dimension across GPUs (Ulysses-SP, Ring-Attention, or hybrid) for high-resolution images and videos
CFG-Parallel Runs CFG positive/negative branches on separate GPUs for ~1.8x speedup on guided generation
Pipeline Parallelism Splits the denoising transformer block-wise across sequential GPU stages to reduce per-GPU model memory
VAE Patch Parallelism Distributes VAE decode spatially across GPUs to reduce peak VAE memory
HSDP Shards full model weights via PyTorch FSDP2 to enable large-model inference on memory-constrained GPUs
Expert Parallelism Shards MoE expert blocks across GPUs for MoE models (e.g. HunyuanImage3.0)

See Supported Models for per-model compatibility.