BAGEL-7B-MoT¶
Source https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/bagel.
Setup¶
Please refer to the stage configuration documentation to configure memory allocation appropriately for your hardware setup.
Architecture¶
BAGEL-7B-MoT is a Mixture-of-Transformers (MoT) model supporting both image generation and understanding. It offers two deployment topologies:
| Topology | Stages | Description |
|---|---|---|
| Two-stage (default) | Stage 0 (Thinker, AR) + Stage 1 (DiT, Diffusion) | Thinker handles text/understanding via vLLM AR engine; DiT handles image generation. KV cache is transferred between stages. |
| Single-stage | Stage 0 (DiT, Diffusion) only | The DiT stage contains a full LLM, ViT, VAE, and tokenizer internally. All modalities are handled within a single diffusion process. |
Both topologies support all four modalities: text2img, img2img, img2text, text2text.
Quick Start¶
cd examples/offline_inference/bagel
# Default two-stage mode (auto-detected)
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2img \
--prompts "A cute cat"
# Single-stage mode
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2img \
--prompts "A cute cat" \
--deploy-config vllm_omni/deploy/bagel_single_stage.yaml
Note: These examples work with the default configuration on an NVIDIA A100 (80GB). For dual-GPU setups, modify the deploy YAML to distribute stages across devices.
Modality Control¶
Control the mode using the --modality argument:
| Modality | Input | Output | Description |
|---|---|---|---|
text2img | Text | Image | Generate images from text prompts |
img2img | Image + Text | Image | Transform images using text guidance |
img2text | Image + Text | Text | Generate text descriptions from images |
text2text | Text | Text | Pure text generation (language model mode) |
Text to Image (text2img)¶
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2img \
--prompts "A cute cat" \
--steps 50
Image to Image (img2img)¶
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality img2img \
--image-path /path/to/image.jpg \
--prompts "Let the woman wear a blue dress"
Image to Text (img2text)¶
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality img2text \
--image-path /path/to/image.jpg \
--prompts "Describe this image in detail"
Text to Text (text2text)¶
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2text \
--prompts "What is the capital of France?"
# Load prompts from a text file (one prompt per line):
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2text \
--txt-prompts /path/to/prompts.txt
Think Mode¶
Think mode enables the model to generate <think>...</think> planning/reasoning tokens before producing the final output. This improves generation quality for complex prompts.
- Two-stage: The Thinker (AR) stage decodes think tokens, then transfers the augmented KV cache to the DiT stage for image generation.
- Single-stage: The DiT's internal LLM generates think tokens in-place before proceeding to denoise.
# Think + text2img: plan before generating
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2img \
--prompts "A futuristic city with flying cars" \
--think \
--max-think-tokens 1000
# Think + img2img: reason about the edit
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality img2img \
--image-path /path/to/image.jpg \
--prompts "Make it look like a watercolor painting" \
--think
# Think + img2text: reason before describing
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality img2text \
--image-path /path/to/image.jpg \
--prompts "What is happening in this image?" \
--think
# Think + text2text: chain-of-thought reasoning
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2text \
--prompts "Solve: 23 * 47" \
--think
Think mode parameters:
| Argument | Default | Description |
|---|---|---|
--think | False | Enable thinking mode |
--max-think-tokens | 1000 | Maximum tokens for think generation |
--do-sample | False | Enable sampling (vs. greedy) for text generation |
--text-temperature | 0.3 | Temperature for text generation sampling |
Classifier-Free Guidance (CFG)¶
CFG controls the trade-off between prompt fidelity and diversity. These parameters apply to image generation modalities (text2img, img2img).
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2img \
--prompts "A photorealistic portrait" \
--cfg-text-scale 6.0 \
--cfg-img-scale 2.0 \
--negative-prompt "blurry, low quality, distorted" \
--cfg-interval 0.4 1.0 \
--cfg-renorm-type global \
--cfg-renorm-min 0.0
| Argument | Default | Description |
|---|---|---|
--cfg-text-scale | 4.0 | Text CFG scale (higher = more prompt-adherent) |
--cfg-img-scale | 1.5 | Image CFG scale (for img2img) |
--negative-prompt | None | Negative prompt for CFG conditioning |
--cfg-interval | pipeline default | CFG active interval [start, end] as fractions of total timesteps |
--cfg-renorm-type | None | Renormalization type: global, text_channel, channel |
--cfg-renorm-min | None | Minimum renormalization value |
--cfg-parallel-size | 1 | CFG parallel size: 1 = batched (single GPU), 2 = 2-branch parallel, 3 = full 3-GPU parallel |
Deployment Topologies¶
Two-Stage (Default)¶
The default topology auto-detected from the model. No extra flags needed.
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2img \
--prompts "A cute cat"
The pipeline is defined in bagel.yaml. Stage 0 (Thinker) and Stage 1 (DiT) share GPU 0 by default. For dual-GPU setups, customize the deploy YAML and set devices: "1" for stage 1.
Single-Stage¶
Pass the single-stage deploy config via --deploy-config:
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2img \
--prompts "A cute cat" \
--deploy-config vllm_omni/deploy/bagel_single_stage.yaml
See bagel_single_stage.yaml for configuration details. The pipeline: bagel_single_stage field selects the single-stage topology from the pipeline registry.
Tensor Parallelism (TP)¶
For larger models or multi-GPU environments, customize the deploy YAML (see bagel.yaml) and set per-stage tensor_parallel_size and devices:
# Example: TP=2 on GPUs 0,1 for the Thinker stage
stages:
- stage_id: 0
tensor_parallel_size: 2
devices: "0,1"
VAE Patch Parallelism¶
VAE Patch Parallelism splits Bagel VAE decode/encode tiles across multiple GPUs on the DiT stage, reducing per-GPU peak memory during VAE decode. Use it when high-resolution text2img or img2img hits VAE OOM or large decode spikes.
Bagel-specific notes:
- Implemented in
BagelPipelineviaDistributedAutoEncoder(DiT stage only). - Single-stage is the simplest path: one DiT process with TP + VAE patch parallel.
- Two-stage: enable on stage 1 (DiT) only; stage 0 (Thinker) keeps encoder-only
VAEEncoderand does not use VAE patch parallel. - You need a DiT
world_size≥vae_patch_parallel_size(typicallytensor_parallel_size=2on that stage). VAE PP reuses the DiT process group; it is not a standalone second-GPU VAE worker.
Single-stage via deploy YAML (recommended for end2end.py):
pipeline: bagel_single_stage
async_chunk: false
stages:
- stage_id: 0
max_num_batched_tokens: 32768
max_num_seqs: 1
enforce_eager: true
trust_remote_code: true
enable_prefix_caching: false
devices: "0,1"
vae_use_tiling: true
parallel_config:
tensor_parallel_size: 2
vae_patch_parallel_size: 2
default_sampling_params:
seed: 52
cd examples/offline_inference/bagel
CUDA_VISIBLE_DEVICES=0,1 python end2end.py \
--model /path/to/BAGEL-7B-MoT \
--deploy-config /path/to/bagel_single_stage_vae_pp.yaml \
--modality text2img \
--prompts "A cute cat" \
--steps 10 \
--output ./out_vae_pp
Single-stage via Omni kwargs (same flags as online serving):
from vllm_omni.entrypoints.omni import Omni
omni = Omni(
model="ByteDance-Seed/BAGEL-7B-MoT",
deploy_config="vllm_omni/deploy/bagel_single_stage.yaml",
tensor_parallel_size=2,
vae_patch_parallel_size=2,
vae_use_tiling=True,
)
# Then call omni.generate(...) as in end2end.py
Two-stage (VAE PP on DiT only):
stages:
- stage_id: 0
devices: "0"
# AR Thinker — no vae_patch_parallel here
- stage_id: 1
devices: "0,1"
vae_use_tiling: true
parallel_config:
tensor_parallel_size: 2
vae_patch_parallel_size: 2
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--deploy-config /path/to/bagel_vae_pp.yaml \
--modality text2img \
--prompts "A cute cat"
Startup log checks:
| Setting | Role |
|---|---|
parallel_config.tensor_parallel_size | DiT world size / TP (must be ≥ vae_patch_parallel_size) |
parallel_config.vae_patch_parallel_size | Number of ranks for distributed VAE tiles (1 = off) |
vae_use_tiling | Enable spatial tiling (auto-enabled when vae_patch_parallel_size > 1) |
Hybrid Sharded Data Parallel (HSDP)¶
For larger Bagel deployments on multiple GPUs, you can enable HSDP (Hybrid Sharded Data Parallel) by modifying the stage configuration (for example, bagel.yaml). HSDP shards transformer weights across GPUs to reduce per-GPU memory usage.
- Enable HSDP: Set
use_hsdp: true. - Set shard size: Set
hsdp_shard_sizeto the number of GPUs used for sharding (for example,4). - Set replicate size: Usually keep
hsdp_replicate_size: 1unless you want replicated HSDP groups. - Set devices: Specify the comma-separated GPU IDs used by the diffusion stage (for example,
"0,1,2,3").
Example configuration for HSDP across 4 GPUs:
- stage_id: 1
devices: "0,1,2,3"
parallel_config:
use_hsdp: true
hsdp_shard_size: 4
hsdp_replicate_size: 1
Then pass the custom deploy YAML:
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2img \
--prompts "A cute cat" \
--deploy-config /path/to/custom_bagel.yaml
FP8 Quantization¶
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2img \
--prompts "A cute cat" \
--quantization fp8
Command Line Reference¶
Core Arguments¶
| Argument | Type | Default | Description |
|---|---|---|---|
--model | string | ByteDance-Seed/BAGEL-7B-MoT | Model path or HuggingFace name |
--modality | choice | text2img | text2img, img2img, img2text, text2text |
--prompts | list | None | Input text prompts |
--txt-prompts | string | None | Path to text file with one prompt per line |
--image-path | string | None | Input image path (required for img2img/img2text) |
--output | string | . | Output directory for saved images |
--steps | int | 50 | Number of diffusion inference steps |
--seed | int | None | Random seed for reproducibility |
Think Mode Arguments¶
| Argument | Type | Default | Description |
|---|---|---|---|
--think | flag | False | Enable <think>...</think> planning/reasoning |
--max-think-tokens | int | 1000 | Maximum tokens for think generation |
--do-sample | flag | False | Use sampling instead of greedy decoding |
--text-temperature | float | 0.3 | Sampling temperature for text generation |
CFG Arguments¶
| Argument | Type | Default | Description |
|---|---|---|---|
--cfg-text-scale | float | 4.0 | Text CFG guidance scale |
--cfg-img-scale | float | 1.5 | Image CFG guidance scale |
--negative-prompt | string | None | Negative prompt for CFG |
--cfg-parallel-size | int | 1 | CFG parallel GPU count (1, 2, or 3) |
--cfg-interval | float[2] | pipeline default | CFG active window [start, end] |
--cfg-renorm-type | string | None | global, text_channel, or channel |
--cfg-renorm-min | float | None | Minimum renormalization value |
Engine Arguments¶
| Argument | Type | Default | Description |
|---|---|---|---|
--deploy-config | string | None | Path to deploy YAML (auto-detected if omitted) |
--worker-backend | choice | process | process or ray |
--ray-address | string | None | Ray cluster address |
--quantization | string | None | Quantization method (e.g. fp8) |
--log-stats | flag | False | Enable statistics logging |
--init-timeout | int | 300 | Initialization timeout (seconds) |
--batch-timeout | int | 5 | Batch timeout (seconds) |
--enable-diffusion-pipeline-profiler | flag | False | Profile diffusion stage durations |
FAQ¶
- If you encounter OOM errors, try decreasing
max_model_lenorgpu_memory_utilizationin the deploy YAML.
Two-stage VRAM usage:
| Stage | VRAM |
|---|---|
| Stage 0 (Thinker) | 15.04 GiB + KV Cache |
| Stage 1 (DiT) | 26.50 GiB |
| Total | ~42 GiB + KV Cache |
Single-stage VRAM usage: The DiT loads the full model (~42 GiB) in one process.
Example materials¶
end2end.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/bagel/end2end.py.