Pipeline Parallelism Guide¶
Table of Content¶
Overview¶
Pipeline Parallelism splits the denoising transformer block-wise into sequential stages across GPUs. Each rank owns only part of the transformer, which reduces per-GPU model memory and enables larger diffusion models to run across multiple devices.
It can also be combined with other distributed methods such as CFG-Parallel, Tensor Parallelism, and Sequence Parallelism.
See supported models list in Supported Models.
Quick Start¶
Basic Usage¶
Simplest working example:
from vllm_omni import Omni
from vllm_omni.diffusion.data import DiffusionParallelConfig
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
omni = Omni(
model="Wan-AI/Wan2.2-TI2V-5B-Diffusers",
parallel_config=DiffusionParallelConfig(
pipeline_parallel_size=2,
),
)
outputs = omni.generate(
{"prompt": "A cinematic drone shot over snowy mountains"},
OmniDiffusionSamplingParams(
num_inference_steps=40,
num_frames=81,
height=704,
width=1280,
),
)
Example Script¶
Offline Inference¶
Use python scripts under:
examples/offline_inference/text_to_video/text_to_video.pyexamples/offline_inference/image_to_video/image_to_video.py
Text-to-video example:
python examples/offline_inference/text_to_video/text_to_video.py \
--model=Wan-AI/Wan2.2-TI2V-5B-Diffusers \
--width=1280 \
--height=704 \
--guidance-scale=5.0 \
--prompt="Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage" \
--output=t2v_5B_pp2.mp4 \
--pipeline-parallel-size=2
Pipeline Parallelism can also be combined with CFG-Parallel:
python examples/offline_inference/text_to_video/text_to_video.py \
--model=Wan-AI/Wan2.2-TI2V-5B-Diffusers \
--width=1280 \
--height=704 \
--guidance-scale=5.0 \
--prompt="Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage" \
--output=t2v_5B_pp2_cfg2.mp4 \
--pipeline-parallel-size=2 \
--cfg-parallel-size=2
Online Serving¶
Enable Pipeline Parallelism in online serving:
# Default PP configuration
vllm serve Wan-AI/Wan2.2-TI2V-5B-Diffusers --omni --port 8091 --pipeline-parallel-size 2
# PP + CFG-Parallel
vllm serve Wan-AI/Wan2.2-TI2V-5B-Diffusers --omni --port 8091 \
--pipeline-parallel-size 2 \
--cfg-parallel-size 2
Configuration Parameters¶
In DiffusionParallelConfig
| Parameter | Type | Default | Description |
|---|---|---|---|
pipeline_parallel_size | int | 1 | Number of pipeline-parallel stages. Set to a value greater than 1 to split the denoising transformer across GPUs |
[!NOTE] Total GPU count is the product of all enabled distributed dimensions, for example
pipeline_parallel_size * cfg_parallel_size * tensor_parallel_size * ulysses_degree * ring_degree.
Manual Layer Partitioning¶
By default, transformer layers are distributed evenly across PP ranks. You can override this with the VLLM_PP_LAYER_PARTITION environment variable to assign a specific number of layers to each rank:
# Example: 40 layers across 4 PP ranks, assigning 8 / 12 / 12 / 8 layers
export VLLM_PP_LAYER_PARTITION=8,12,12,8
The value must be a comma-separated list of integers whose length equals pipeline_parallel_size and whose sum equals the total number of transformer layers. This is useful when you want to balance memory or compute asymmetrically across ranks.
Best Practices¶
When to Use¶
Good for:
- Large diffusion transformers that do not fit comfortably on one GPU
- Multi-GPU setups where reducing per-GPU model memory is more important than minimizing communication
- Combining with CFG-Parallel or other distributed methods on supported models
Not for:
- Single GPU setups
- Models that do not support Pipeline Parallelism (check supported models)
- Very small models where inter-stage communication overhead may outweigh the benefit
Expected Behavior¶
Pipeline Parallelism primarily reduces per-GPU memory usage by splitting the transformer into stage-local blocks. Depending on the model, topology, and resolution, it may also help execution fit into available hardware, but it is not primarily a latency optimization.
Troubleshooting¶
Common Issue 1: No benefit from Pipeline Parallelism¶
Symptoms: PP is enabled, but latency does not improve or becomes slightly worse.
Solutions:
- Check your goal:
# PP is mainly for memory scaling, not guaranteed latency speedup
parallel_config = DiffusionParallelConfig(pipeline_parallel_size=2)
-
Check model support:
- Verify your model in supported models
- PP is currently validated only on selected pipelines
-
Combine with other methods when appropriate:
- PP can be combined with CFG-Parallel, Tensor Parallelism, or Sequence Parallelism on supported models
Common Issue 2: PP pipeline fails at import¶
Symptoms: Importing a custom pipeline raises a TypeError about CFGParallelMixin or mixin order.
Solutions:
- Inherit both mixins.
- Put
PipelineParallelMixinbeforeCFGParallelMixinin the class base list. - Use
class YourPipeline(nn.Module, PipelineParallelMixin, CFGParallelMixin): ...as the reference pattern.
Summary¶
- ✅ Enable Pipeline Parallelism - Set
pipeline_parallel_size > 1inDiffusionParallelConfig - ✅ Use Supported Models - Verify your model supports PP in supported models
- ✅ Combine When Needed - PP can be combined with CFG-Parallel and other distributed methods on supported pipelines
- ✅ Scale for Memory - Use PP primarily to reduce per-GPU model memory and fit larger transformers