Expert Parallelism Guide¶
Table of Content¶
Overview¶
Unlike Tensor Parallelism which shards every layer's weights, Expert Parallelism (EP) only shards the MoE expert MLP blocks. This significantly reduces the memory footprint of MoE models (e.g., HunyuanImage3.0) while maintaining constant dense-equivalent compute efficiency.
During the forward pass, a gating mechanism routes tokens to their designated experts, requiring all-to-all communication to dispatch tokens to the correct ranks and combine results.
See supported models list in Supported Models.
EP Size Constraint
The effective EP size equals tp × sp × cfg × dp. At least one of TP/SP/CFG/DP must be set when EP is enabled.
Quick Start¶
Basic Usage¶
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.diffusion.data import DiffusionParallelConfig
omni = Omni(
model="tencent/HunyuanImage-3.0",
parallel_config=DiffusionParallelConfig(
tensor_parallel_size=8,
enable_expert_parallel=True,
),
)
outputs = omni.generate(
"A brown and white dog is running on the grass",
OmniDiffusionSamplingParams(
num_inference_steps=50,
width=1024,
height=1024,
),
)
Configuration Parameters¶
In DiffusionParallelConfig:
| Parameter | Type | Default | Description |
|---|---|---|---|
enable_expert_parallel | bool | False | Enable Expert Parallelism for MoE models |
EP size is derived automatically as tp × sp × cfg × dp — configure at least one of those to set the EP degree.
Best Practices¶
When to Use¶
Good for:
- MoE models (e.g., HunyuanImage3.0) with numbers of experts
- Memory-constrained multi-GPU setups where only expert blocks need sharding
Not for:
- Dense models (no MoE layers) — EP has no effect
- Single GPU setups
Summary¶
- ✅ Enable EP - Set
enable_expert_parallel=TrueinDiffusionParallelConfigfor MoE models - ✅ Set parallelism degree - At least one of
tensor_parallel_size/ulysses_degree/cfg_parallel_sizemust be > 1 to define the EP size