BAGEL-7B-MoT¶

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/bagel.

Setup¶

Please refer to the stage configuration documentation to configure memory allocation appropriately for your hardware setup.

Architecture¶

BAGEL-7B-MoT is a Mixture-of-Transformers (MoT) model supporting both image generation and understanding. It offers two deployment topologies:

Topology	Stages	Description
Two-stage (default)	Stage 0 (Thinker, AR) + Stage 1 (DiT, Diffusion)	Thinker handles text/understanding via vLLM AR engine; DiT handles image generation. KV cache is transferred between stages.
Single-stage	Stage 0 (DiT, Diffusion) only	The DiT stage contains a full LLM, ViT, VAE, and tokenizer internally. All modalities are handled within a single diffusion process.

Both topologies support all four modalities: text2img, img2img, img2text, text2text.

Quick Start¶

cd examples/offline_inference/bagel

# Default two-stage mode (auto-detected)
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
                  --modality text2img \
                  --prompts "A cute cat"

# Single-stage mode
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
                  --modality text2img \
                  --prompts "A cute cat" \
                  --deploy-config vllm_omni/deploy/bagel_single_stage.yaml

Note: These examples work with the default configuration on an NVIDIA A100 (80GB). For dual-GPU setups, modify the deploy YAML to distribute stages across devices.

Modality Control¶

Control the mode using the --modality argument:

Modality	Input	Output	Description
`text2img`	Text	Image	Generate images from text prompts
`img2img`	Image + Text	Image	Transform images using text guidance
`img2text`	Image + Text	Text	Generate text descriptions from images
`text2text`	Text	Text	Pure text generation (language model mode)

Text to Image (text2img)¶

python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
                  --modality text2img \
                  --prompts "A cute cat" \
                  --steps 50

Image to Image (img2img)¶

python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
                  --modality img2img \
                  --image-path /path/to/image.jpg \
                  --prompts "Let the woman wear a blue dress"

Image to Text (img2text)¶

python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
                  --modality img2text \
                  --image-path /path/to/image.jpg \
                  --prompts "Describe this image in detail"

Text to Text (text2text)¶

python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
                  --modality text2text \
                  --prompts "What is the capital of France?"

# Load prompts from a text file (one prompt per line):
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
                  --modality text2text \
                  --txt-prompts /path/to/prompts.txt

Think Mode¶

Think mode enables the model to generate <think>...</think> planning/reasoning tokens before producing the final output. This improves generation quality for complex prompts.

Two-stage: The Thinker (AR) stage decodes think tokens, then transfers the augmented KV cache to the DiT stage for image generation.
Single-stage: The DiT's internal LLM generates think tokens in-place before proceeding to denoise.

# Think + text2img: plan before generating
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
                  --modality text2img \
                  --prompts "A futuristic city with flying cars" \
                  --think \
                  --max-think-tokens 1000

# Think + img2img: reason about the edit
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
                  --modality img2img \
                  --image-path /path/to/image.jpg \
                  --prompts "Make it look like a watercolor painting" \
                  --think

# Think + img2text: reason before describing
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
                  --modality img2text \
                  --image-path /path/to/image.jpg \
                  --prompts "What is happening in this image?" \
                  --think

# Think + text2text: chain-of-thought reasoning
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
                  --modality text2text \
                  --prompts "Solve: 23 * 47" \
                  --think

Think mode parameters:

Argument	Default	Description
`--think`	`False`	Enable thinking mode
`--max-think-tokens`	`1000`	Maximum tokens for think generation
`--do-sample`	`False`	Enable sampling (vs. greedy) for text generation
`--text-temperature`	`0.3`	Temperature for text generation sampling

Classifier-Free Guidance (CFG)¶

CFG controls the trade-off between prompt fidelity and diversity. These parameters apply to image generation modalities (text2img, img2img).

python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
                  --modality text2img \
                  --prompts "A photorealistic portrait" \
                  --cfg-text-scale 6.0 \
                  --cfg-img-scale 2.0 \
                  --negative-prompt "blurry, low quality, distorted" \
                  --cfg-interval 0.4 1.0 \
                  --cfg-renorm-type global \
                  --cfg-renorm-min 0.0

Argument	Default	Description
`--cfg-text-scale`	`4.0`	Text CFG scale (higher = more prompt-adherent)
`--cfg-img-scale`	`1.5`	Image CFG scale (for img2img)
`--negative-prompt`	`None`	Negative prompt for CFG conditioning
`--cfg-interval`	pipeline default	CFG active interval `[start, end]` as fractions of total timesteps
`--cfg-renorm-type`	`None`	Renormalization type: `global`, `text_channel`, `channel`
`--cfg-renorm-min`	`None`	Minimum renormalization value
`--cfg-parallel-size`	`1`	CFG parallel size: `1` = batched (single GPU), `2` = 2-branch parallel, `3` = full 3-GPU parallel

Deployment Topologies¶

Two-Stage (Default)¶

The default topology auto-detected from the model. No extra flags needed.

python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
                  --modality text2img \
                  --prompts "A cute cat"

The pipeline is defined in bagel.yaml. Stage 0 (Thinker) and Stage 1 (DiT) share GPU 0 by default. For dual-GPU setups, customize the deploy YAML and set devices: "1" for stage 1.

Single-Stage¶

Pass the single-stage deploy config via --deploy-config:

python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
                  --modality text2img \
                  --prompts "A cute cat" \
                  --deploy-config vllm_omni/deploy/bagel_single_stage.yaml

See bagel_single_stage.yaml for configuration details. The pipeline: bagel_single_stage field selects the single-stage topology from the pipeline registry.

Tensor Parallelism (TP)¶

For larger models or multi-GPU environments, customize the deploy YAML (see bagel.yaml) and set per-stage tensor_parallel_size and devices:

# Example: TP=2 on GPUs 0,1 for the Thinker stage
stages:
  - stage_id: 0
    tensor_parallel_size: 2
    devices: "0,1"

VAE Patch Parallelism¶

VAE Patch Parallelism splits Bagel VAE decode/encode tiles across multiple GPUs on the DiT stage, reducing per-GPU peak memory during VAE decode. Use it when high-resolution text2img or img2img hits VAE OOM or large decode spikes.

Bagel-specific notes:

Implemented in BagelPipeline via DistributedAutoEncoder (DiT stage only).
Single-stage is the simplest path: one DiT process with TP + VAE patch parallel.
Two-stage: enable on stage 1 (DiT) only; stage 0 (Thinker) keeps encoder-only VAEEncoder and does not use VAE patch parallel.
You need a DiT world_size ≥ vae_patch_parallel_size (typically tensor_parallel_size=2 on that stage). VAE PP reuses the DiT process group; it is not a standalone second-GPU VAE worker.

Single-stage via deploy YAML (recommended for end2end.py):

pipeline: bagel_single_stage
async_chunk: false

stages:
  - stage_id: 0
    max_num_batched_tokens: 32768
    max_num_seqs: 1
    enforce_eager: true
    trust_remote_code: true
    enable_prefix_caching: false
    devices: "0,1"
    vae_use_tiling: true
    parallel_config:
      tensor_parallel_size: 2
      vae_patch_parallel_size: 2
    default_sampling_params:
      seed: 52

cd examples/offline_inference/bagel

CUDA_VISIBLE_DEVICES=0,1 python end2end.py \
    --model /path/to/BAGEL-7B-MoT \
    --deploy-config /path/to/bagel_single_stage_vae_pp.yaml \
    --modality text2img \
    --prompts "A cute cat" \
    --steps 10 \
    --output ./out_vae_pp

Single-stage via Omni kwargs (same flags as online serving):

from vllm_omni.entrypoints.omni import Omni

omni = Omni(
    model="ByteDance-Seed/BAGEL-7B-MoT",
    deploy_config="vllm_omni/deploy/bagel_single_stage.yaml",
    tensor_parallel_size=2,
    vae_patch_parallel_size=2,
    vae_use_tiling=True,
)
# Then call omni.generate(...) as in end2end.py

Two-stage (VAE PP on DiT only):

stages:
  - stage_id: 0
    devices: "0"
    # AR Thinker — no vae_patch_parallel here

  - stage_id: 1
    devices: "0,1"
    vae_use_tiling: true
    parallel_config:
      tensor_parallel_size: 2
      vae_patch_parallel_size: 2

python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
    --deploy-config /path/to/bagel_vae_pp.yaml \
    --modality text2img \
    --prompts "A cute cat"

Startup log checks:

INFO ... vae_patch_parallel_size=2 requires vae_use_tiling; automatically enabling it.

Setting	Role
`parallel_config.tensor_parallel_size`	DiT world size / TP (must be ≥ `vae_patch_parallel_size`)
`parallel_config.vae_patch_parallel_size`	Number of ranks for distributed VAE tiles (`1` = off)
`vae_use_tiling`	Enable spatial tiling (auto-enabled when `vae_patch_parallel_size > 1`)

Hybrid Sharded Data Parallel (HSDP)¶

For larger Bagel deployments on multiple GPUs, you can enable HSDP (Hybrid Sharded Data Parallel) by modifying the stage configuration (for example, bagel.yaml). HSDP shards transformer weights across GPUs to reduce per-GPU memory usage.

Enable HSDP: Set use_hsdp: true.
Set shard size: Set hsdp_shard_size to the number of GPUs used for sharding (for example, 4).
Set replicate size: Usually keep hsdp_replicate_size: 1 unless you want replicated HSDP groups.
Set devices: Specify the comma-separated GPU IDs used by the diffusion stage (for example, "0,1,2,3").

Example configuration for HSDP across 4 GPUs:

  - stage_id: 1
    devices: "0,1,2,3"
    parallel_config:
      use_hsdp: true
      hsdp_shard_size: 4
      hsdp_replicate_size: 1

Then pass the custom deploy YAML:

python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
                  --modality text2img \
                  --prompts "A cute cat" \
                  --deploy-config /path/to/custom_bagel.yaml

FP8 Quantization¶

python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
                  --modality text2img \
                  --prompts "A cute cat" \
                  --quantization fp8

Command Line Reference¶

Core Arguments¶

Argument	Type	Default	Description
`--model`	string	`ByteDance-Seed/BAGEL-7B-MoT`	Model path or HuggingFace name
`--modality`	choice	`text2img`	`text2img`, `img2img`, `img2text`, `text2text`
`--prompts`	list	`None`	Input text prompts
`--txt-prompts`	string	`None`	Path to text file with one prompt per line
`--image-path`	string	`None`	Input image path (required for `img2img`/`img2text`)
`--output`	string	`.`	Output directory for saved images
`--steps`	int	`50`	Number of diffusion inference steps
`--seed`	int	`None`	Random seed for reproducibility

Think Mode Arguments¶

Argument	Type	Default	Description
`--think`	flag	`False`	Enable `<think>...</think>` planning/reasoning
`--max-think-tokens`	int	`1000`	Maximum tokens for think generation
`--do-sample`	flag	`False`	Use sampling instead of greedy decoding
`--text-temperature`	float	`0.3`	Sampling temperature for text generation

CFG Arguments¶

Argument	Type	Default	Description
`--cfg-text-scale`	float	`4.0`	Text CFG guidance scale
`--cfg-img-scale`	float	`1.5`	Image CFG guidance scale
`--negative-prompt`	string	`None`	Negative prompt for CFG
`--cfg-parallel-size`	int	`1`	CFG parallel GPU count (1, 2, or 3)
`--cfg-interval`	float[2]	pipeline default	CFG active window `[start, end]`
`--cfg-renorm-type`	string	`None`	`global`, `text_channel`, or `channel`
`--cfg-renorm-min`	float	`None`	Minimum renormalization value

Engine Arguments¶

Argument	Type	Default	Description
`--deploy-config`	string	`None`	Path to deploy YAML (auto-detected if omitted)
`--worker-backend`	choice	`process`	`process` or `ray`
`--ray-address`	string	`None`	Ray cluster address
`--quantization`	string	`None`	Quantization method (e.g. `fp8`)
`--log-stats`	flag	`False`	Enable statistics logging
`--init-timeout`	int	`300`	Initialization timeout (seconds)
`--batch-timeout`	int	`5`	Batch timeout (seconds)
`--enable-diffusion-pipeline-profiler`	flag	`False`	Profile diffusion stage durations

FAQ¶

If you encounter OOM errors, try decreasing max_model_len or gpu_memory_utilization in the deploy YAML.

Two-stage VRAM usage:

Stage	VRAM
Stage 0 (Thinker)	15.04 GiB + KV Cache
Stage 1 (DiT)	26.50 GiB
Total	~42 GiB + KV Cache

Single-stage VRAM usage: The DiT loads the full model (~42 GiB) in one process.

Example materials¶

end2end.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/bagel/end2end.py.