CFG-Parallel Guide¶
Table of Content¶
Overview¶
CFG-Parallel accelerates diffusion models by distributing positive and negative classifier-free guidance (CFG) passes across different GPUs, providing ~1.8x speedup when CFG is enabled. It's ideal for image editing tasks that require guidance scales greater than 1.0.
See supported models list in Supported Models.
Quick Start¶
Basic Usage¶
Simplest working example:
from vllm_omni import Omni
from vllm_omni.diffusion.data import DiffusionParallelConfig
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from PIL import Image
omni = Omni(
model="Qwen/Qwen-Image-Edit",
parallel_config=DiffusionParallelConfig(cfg_parallel_size=2), # Enable CFG-Parallel
)
input_image = Image.open("input.png").convert("RGB")
outputs = omni.generate(
{
"prompt": "turn this cat to a dog",
"negative_prompt": "low quality, blurry",
"multi_modal_data": {"image": input_image},
},
OmniDiffusionSamplingParams(
true_cfg_scale=4.0,
num_inference_steps=50,
),
)
Example Script¶
Offline Inference¶
Use python script under examples/offline_inference/image_to_image/image_edit.py:
cd examples/offline_inference/image_to_image/
python image_edit.py \
--model "Qwen/Qwen-Image-Edit" \
--image "input.png" \
--prompt "turn this cat to a dog" \
--negative-prompt "low quality, blurry" \
--cfg-scale 4.0 \
--output "edited_image.png" \
--cfg-parallel-size 2
Online Serving¶
Enable CFG-Parallel in online serving:
Configuration Parameters¶
In DiffusionParallelConfig
| Parameter | Type | Default | Description |
|---|---|---|---|
cfg_parallel_size | int | 1 | Number of GPUs for CFG parallelism. Set to 2 to enable CFG-Parallel (rank 0 for positive, rank 1 for negative branch) |
Info
Most models support cfg_parallel_size=2 (positive branch on rank 0, negative branch on rank 1). Bagel is an exception: it supports cfg_parallel_size=3, which adds a third branch on rank 2 for full three-way CFG parallelism.
Best Practices¶
When to Use¶
Good for:
- Tasks requiring classifier-free guidance
- Multi-GPU setups (at least 2 GPUs available)
- Combining with other parallelism methods (sequence/tensor parallel)
Not for:
- Single GPU setups
- Models that don't support CFG-Parallel (check supported models)
- Workloads without negative prompts or classifier-free guidance
- Very short inference runs (< 10 steps) where parallelism overhead may outweigh benefits
Expected Performance¶
| Configuration | Speedup | Quality | Use Case |
|---|---|---|---|
| CFG-Parallel (2 GPUs) | 1.5~1.8x | No degradation | Large model, VRAM limited |
Troubleshooting¶
Common Issue 1: No Speedup with CFG-Parallel¶
Symptoms: CFG-Parallel enabled but no performance improvement
Solutions:
-
Ensure CFG scale is set correctly:
-
Add negative prompt:
-
Check model support:
- Verify your model in supported models
- Some models don't support CFG-Parallel
Summary¶
- ✅ Enable CFG-Parallel - Set
cfg_parallel_size=2inDiffusionParallelConfigto get speedup when using CFG - ✅ Set CFG Scale - Ensure
true_cfg_scale > 1.0inOmniDiffusionSamplingParamsfor CFG-Parallel to take effect - ✅ Check Model Support - Verify your model supports CFG-Parallel in supported models