BAGEL-7B-MoT¶
Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/bagel.
Installation¶
Please refer to README.md
Architecture¶
BAGEL-7B-MoT is a Mixture-of-Transformers (MoT) model supporting both image generation and understanding. It offers two deployment topologies:
| Topology | Stages | Description |
|---|---|---|
| Two-stage (default) | Stage 0 (Thinker, AR) + Stage 1 (DiT, Diffusion) | Thinker handles text/understanding via vLLM AR engine; DiT handles image generation. KV cache is transferred between stages. |
| Single-stage | Stage 0 (DiT, Diffusion) only | The DiT stage contains a full LLM, ViT, VAE, and tokenizer internally. All modalities are handled within a single diffusion process. |
Both topologies support all four modalities: text2img, img2img, img2text, text2text.
Note: These examples work with the default configuration on an NVIDIA A100 (80GB). We also tested on dual NVIDIA RTX 5000 Ada (32GB each). For dual-GPU setups, modify the deploy YAML to distribute stages across devices.
Launch the Server¶
Two-Stage (Default)¶
The default pipeline is auto-detected from the model. No extra flags needed:
Or use the convenience script:
cd examples/online_serving/bagel
bash run_server.sh
# Initialize each stage in a discrete isolated process terminal
bash run_server_stage_cli.sh --stage 0
bash run_server_stage_cli.sh --stage 1
To use a custom deploy YAML, pass it via --deploy-config:
vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 \
--deploy-config /path/to/deploy_config.yaml
See bagel.yaml for the default two-stage deploy configuration.
Single-Stage¶
The DiT stage contains a full LLM, ViT, VAE, and tokenizer, so it can handle all modalities (text2img, img2img, img2text, text2text, think) without a separate Thinker stage:
vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 \
--deploy-config vllm_omni/deploy/bagel_single_stage.yaml
See bagel_single_stage.yaml for configuration. The pipeline: bagel_single_stage field selects the single-stage topology from the pipeline registry.
Tensor Parallelism (TP)¶
For larger models or multi-GPU environments, enable TP via CLI:
Or set tensor_parallel_size per stage in a custom deploy YAML.
VAE Patch Parallelism¶
VAE Patch Parallelism distributes Bagel VAE decode/encode across multiple GPUs by splitting latent tiles. It lowers per-GPU peak memory during VAE decode, which helps high-resolution text2img / img2img when VAE becomes a bottleneck.
Scope for Bagel:
| Topology | VAE patch parallel |
|---|---|
| Single-stage (DiT only) | Supported on stage 0 (BagelPipeline + DistributedAutoEncoder) |
| Two-stage | Supported on stage 1 (DiT) only; stage 0 (Thinker) uses encoder-only VAE and is unrelated |
Requirements:
vae_patch_parallel_size > 1and a distributed VAE (DistributedAutoEncoderon the DiT pipeline).- The DiT process group must have at least
vae_patch_parallel_sizeranks. In practice this means the diffusion stageworld_sizemust be ≥ 2 (commonlytensor_parallel_size=2on that stage). vae_use_tilingmust be enabled. If you setvae_patch_parallel_size > 1and omit tiling, the registry auto-enablesvae_use_tilingat startup.
VAE patch parallel reuses the DiT process group (dit_group); it does not create a separate VAE-only worker pool. It is not a substitute for single-GPU VAE tiling (vae_pp=1).
Online serving (single-stage, 2 GPUs):
CUDA_VISIBLE_DEVICES=0,1 vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 \
--deploy-config vllm_omni/deploy/bagel_single_stage.yaml \
--tensor-parallel-size 2 \
--vae-patch-parallel-size 2 \
--vae-use-tiling
Online serving (two-stage, VAE PP on DiT stage 1): use a custom deploy YAML, for example:
stages:
- stage_id: 0
devices: "0"
# Thinker (AR) — no VAE patch parallel here
- stage_id: 1
devices: "0,1"
vae_use_tiling: true
parallel_config:
tensor_parallel_size: 2
vae_patch_parallel_size: 2
vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 \
--deploy-config /path/to/bagel_vae_pp.yaml
Verify it is active (check server logs at startup):
| CLI flag | Default | Description |
|---|---|---|
--vae-patch-parallel-size | 1 | Number of DiT ranks used for VAE tile parallelism. Set to 2 or higher to enable. Should be ≤ DiT process group size (typically match --tensor-parallel-size on the diffusion stage). |
--vae-use-tiling | off | Enable VAE spatial tiling. Required for VAE patch parallel (auto-enabled when vae_patch_parallel_size > 1). |
Hybrid Sharded Data Parallel (HSDP)¶
For larger Bagel deployments on multiple GPUs, you can enable HSDP (Hybrid Sharded Data Parallel) by modifying the stage configuration (for example, bagel.yaml). HSDP shards transformer weights across GPUs to reduce per-GPU memory usage.
- Enable HSDP: Set
use_hsdp: true. - Set shard size: Set
hsdp_shard_sizeto the number of GPUs used for sharding (for example,4). - Set replicate size: Usually keep
hsdp_replicate_size: 1unless you want replicated HSDP groups. - Set devices: Specify the comma-separated GPU IDs used by the diffusion stage (for example,
"0,1,2,3").
Example configuration for HSDP across 4 GPUs:
- stage_id: 1
devices: "0,1,2,3"
parallel_config:
use_hsdp: true
hsdp_shard_size: 4
hsdp_replicate_size: 1
Multi-Node Deployment¶
Deploy each stage on a separate node for better resource utilization. Replace <ORCHESTRATOR_IP> with the actual IP address of your orchestrator node.
1. Launch Stage 0 (Thinker / Orchestrator) on the orchestrator node:
# API server port for client requests: 8000
vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni \
--port 8000 \
--stage-id 0 \
--omni-master-address <ORCHESTRATOR_IP> \
--omni-master-port 8091
2. Launch Stage 1 (DiT) on the remote node in headless mode:
vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni \
--stage-id 1 \
--headless \
--omni-master-address <ORCHESTRATOR_IP> \
--omni-master-port 8091
Or use the convenience script:
# Terminal 1: Stage 0
bash run_server_stage_cli.sh --stage 0
# Terminal 2: Stage 1
bash run_server_stage_cli.sh --stage 1
# With extra args
bash run_server_stage_cli.sh --stage 0 -- --tensor-parallel-size 2
bash run_server_stage_cli.sh --stage 1 -- --gpu-memory-utilization 0.9
vllm serve arguments:
| Argument | Description |
|---|---|
--stage-id | Which stage this process runs (0 = Thinker, 1 = DiT) |
--headless | Run without the API server (worker-only mode) |
-oma / --omni-master-address | Orchestrator master address |
-omp / --omni-master-port | Orchestrator master port |
[!IMPORTANT] Startup Order: Stage 0 (orchestrator) must be launched before Stage 1 (headless). Stage 0 will appear to hang on startup until Stage 1 (worker) connects — this is expected behavior.
Inter-Stage Connectors¶
When deploying stages across nodes, configure the connector type in the deploy YAML:
- SharedMemoryConnector (default): Used for single-node deployments. No explicit configuration needed.
- MooncakeTransferEngineConnector: For multi-node setups with RDMA hardware. Defined in
bagel.yamlunderconnectors.rdma_connector.
To use Mooncake, create a custom deploy YAML that binds output_connectors / input_connectors on each stage to the rdma_connector defined in the connectors section.
Send Requests¶
Text to Image (text2img)¶
Python client:
python openai_chat_client.py \
--prompt "A beautiful sunset over mountains" \
--modality text2img \
--output sunset.png \
--steps 50
curl:
curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": [{"type": "text", "text": "<|im_start|>A beautiful sunset over mountains<|im_end|>"}]}],
"modalities": ["image"],
"height": 512,
"width": 512,
"num_inference_steps": 50,
"seed": 42
}'
Image to Image (img2img)¶
Python client:
python openai_chat_client.py \
--prompt "Make the cat stand up" \
--modality img2img \
--image-url /path/to/input.jpg \
--output transformed.png
curl:
IMAGE_BASE64=$(base64 -w 0 cat.jpg)
cat <<EOF > payload.json
{
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "<|im_start|>Make the cat stand up<|im_end|>"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,${IMAGE_BASE64}"}}
]
}],
"modalities": ["image"],
"height": 512,
"width": 512,
"num_inference_steps": 50,
"seed": 42
}
EOF
curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d @payload.json
Image to Text (img2text)¶
Python client:
python openai_chat_client.py \
--prompt "Describe this image in detail" \
--modality img2text \
--image-url /path/to/image.jpg
curl:
IMAGE_BASE64=$(base64 -w 0 cat.jpg)
cat <<EOF > payload.json
{
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "<|im_start|>user\n<|image_pad|>\nDescribe this image in detail<|im_end|>\n<|im_start|>assistant\n"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,${IMAGE_BASE64}"}}
]
}],
"modalities": ["text"]
}
EOF
curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d @payload.json
Text to Text (text2text)¶
Python client:
curl:
curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": [{"type": "text", "text": "<|im_start|>user\nWhat is the capital of France?<|im_end|>\n<|im_start|>assistant\n"}]}],
"modalities": ["text"]
}'
Python Client Arguments¶
| Argument | Default | Description |
|---|---|---|
--prompt / -p | A cute cat | Text prompt |
--output / -o | bagel_output.png | Output file path |
--server / -s | http://localhost:8091 | Server URL |
--image-url / -i | None | Input image URL or local path (img2img/img2text) |
--modality / -m | text2img | text2img, img2img, img2text, text2text |
--height | 512 | Image height in pixels |
--width | 512 | Image width in pixels |
--steps | 25 | Number of inference steps |
--seed | 42 | Random seed |
--negative | None | Negative prompt for CFG |
Example with custom parameters:
python openai_chat_client.py \
--prompt "A futuristic city" \
--modality text2img \
--height 768 \
--width 768 \
--steps 50 \
--seed 42 \
--negative "blurry, low quality"
Configuration Reference¶
Deploy YAML Files¶
| File | Description |
|---|---|
bagel.yaml | Two-stage default (Thinker + DiT on GPU 0) |
bagel_single_stage.yaml | Single-stage (DiT only) |
Key Deploy YAML Fields¶
| Field | Scope | Description |
|---|---|---|
pipeline | top-level | Override auto-detected pipeline (e.g. bagel_single_stage) |
stages[].stage_id | per-stage | Stage identifier (0, 1, ...) |
stages[].devices | per-stage | GPU device IDs (e.g. "0", "0,1") |
stages[].max_num_seqs | per-stage | Maximum concurrent sequences |
stages[].gpu_memory_utilization | per-stage | Fraction of GPU memory to use |
stages[].enforce_eager | per-stage | Disable CUDA graphs |
stages[].tensor_parallel_size | per-stage | TP degree for this stage |
stages[].parallel_config.vae_patch_parallel_size | per-stage (DiT) | VAE tile parallelism degree (DiT stage only) |
stages[].vae_use_tiling | per-stage (DiT) | Enable VAE tiling (required for VAE patch parallel) |
connectors | top-level | Define available connector instances (SHM, Mooncake) |
platforms | top-level | Platform-specific overrides (e.g. xpu) |
FAQ¶
- If you encounter OOM errors, try decreasing
max_model_lenorgpu_memory_utilizationin the deploy YAML.
Two-stage VRAM usage:
| Stage | VRAM |
|---|---|
| Stage 0 (Thinker) | 15.04 GiB + KV Cache |
| Stage 1 (DiT) | 26.50 GiB |
| Total | ~42 GiB + KV Cache |
Single-stage VRAM usage: The DiT loads the full model (~42 GiB) in one process.
Example materials¶
openai_chat_client.py
#!/usr/bin/env python3
"""
Bagel OpenAI-compatible chat client for image generation and multimodal tasks.
Usage:
python openai_chat_client.py --prompt "A cute cat" --output output.png
python openai_chat_client.py --prompt "Describe this image" --image-url https://example.com/image.png
"""
import argparse
import base64
from pathlib import Path
import requests
def generate_image(
prompt: str,
server_url: str = "http://localhost:8091",
image_url: str | None = None,
height: int | None = None,
width: int | None = None,
steps: int | None = None,
seed: int | None = None,
negative_prompt: str | None = None,
modality: str = "text2img", # "text2img" (default), "img2img", "img2text", "text2text"
) -> bytes | str | None:
"""Generate an image or text using the chat completions API.
Args:
prompt: Text description or prompt
server_url: Server URL
image_url: URL or path to input image (for img2img/img2text)
height: Image height in pixels
width: Image width in pixels
steps: Number of inference steps
seed: Random seed
negative_prompt: Negative prompt
modality: Task modality hint
Returns:
Image bytes (for image outputs) or Text string (for text outputs) or None if failed
"""
# Construct Message Content
content = [{"type": "text", "text": f"<|im_start|>{prompt}<|im_end|>"}]
if image_url:
# Check if local file
if Path(image_url).exists():
with open(image_url, "rb") as f:
b64_data = base64.b64encode(f.read()).decode("utf-8")
final_image_url = f"data:image/jpeg;base64,{b64_data}"
else:
final_image_url = image_url
content.append({"type": "image_url", "image_url": {"url": final_image_url}})
messages = [{"role": "user", "content": content}]
# Build request payload with all parameters at top level
# Note: vLLM ignores "extra_body", so we put parameters directly in the payload
payload = {"messages": messages}
# Set output modalities at top level
if modality == "text2img" or modality == "img2img":
payload["modalities"] = ["image"]
elif modality == "img2text" or modality == "text2text":
payload["modalities"] = ["text"]
# Add generation parameters directly to payload
if height is not None:
payload["height"] = height
if width is not None:
payload["width"] = width
if steps is not None:
payload["num_inference_steps"] = steps
if seed is not None:
payload["seed"] = seed
if negative_prompt:
payload["negative_prompt"] = negative_prompt
# Send request
try:
print(f"Sending request to {server_url} with modality {modality}...")
response = requests.post(
f"{server_url}/v1/chat/completions",
headers={"Content-Type": "application/json"},
json=payload,
timeout=300,
)
response.raise_for_status()
data = response.json()
# Extract content - check ALL choices since server may return multiple
# (e.g., text in choices[0], image in choices[1])
choices = data.get("choices", [])
# First pass: look for image output in any choice
for choice in choices:
choice_content = choice.get("message", {}).get("content")
# Handle Image Output
if isinstance(choice_content, list) and len(choice_content) > 0:
first_item = choice_content[0]
if isinstance(first_item, dict) and "image_url" in first_item:
img_url_str = first_item["image_url"].get("url", "")
if img_url_str.startswith("data:image"):
_, b64_data = img_url_str.split(",", 1)
return base64.b64decode(b64_data)
# Second pass: look for text output if no image found
for choice in choices:
choice_content = choice.get("message", {}).get("content")
if isinstance(choice_content, str) and choice_content:
return choice_content
print(f"Unexpected response format: {choices}")
return None
except Exception as e:
print(f"Error: {e}")
return None
def main():
parser = argparse.ArgumentParser(description="Bagel multimodal chat client")
parser.add_argument("--prompt", "-p", default="A cute cat", help="Text prompt")
parser.add_argument("--output", "-o", default="bagel_output.png", help="Output file (for image results)")
parser.add_argument("--server", "-s", default="http://localhost:8091", help="Server URL")
# Modality Control
parser.add_argument("--image-url", "-i", type=str, help="Input image URL or local path")
parser.add_argument(
"--modality",
"-m",
default="text2img",
choices=["text2img", "img2img", "img2text", "text2text"],
help="Task modality",
)
# Generation Params
parser.add_argument("--height", type=int, default=512, help="Image height")
parser.add_argument("--width", type=int, default=512, help="Image width")
parser.add_argument("--steps", type=int, default=25, help="Inference steps")
parser.add_argument("--seed", type=int, default=42, help="Random seed")
parser.add_argument("--negative", help="Negative prompt")
args = parser.parse_args()
print(f"Mode: {args.modality}")
if args.image_url:
print(f"Input Image: {args.image_url}")
result = generate_image(
prompt=args.prompt,
server_url=args.server,
image_url=args.image_url,
height=args.height,
width=args.width,
steps=args.steps,
seed=args.seed,
negative_prompt=args.negative,
modality=args.modality,
)
if result:
if isinstance(result, bytes):
# It's an image
output_path = Path(args.output)
output_path.write_bytes(result)
print(f"Image saved to: {output_path}")
print(f"Size: {len(result) / 1024:.1f} KB")
elif isinstance(result, str):
# It's text
print("Response:")
print(result)
else:
print("Failed to generate response")
exit(1)
if __name__ == "__main__":
main()
run_server.sh
run_server_stage_cli.sh
#!/bin/bash
# Bagel multi-stage online serving startup script.
#
# Usage:
# ./run_server_stage_cli.sh --stage 0
# ./run_server_stage_cli.sh --stage 1
# ./run_server_stage_cli.sh --stage 0 -- --tensor-parallel-size 2
# ./run_server_stage_cli.sh --stage 1 -- --gpu-memory-utilization 0.9
#
# By default, `--stage all` keeps the old behavior and launches both stages in
# one session. Use `--stage 0` / `--stage 1` to launch each stage separately in
# different terminal sessions, with stage-specific extra CLI arguments passed
# after `--`.
set -euo pipefail
MODEL="${MODEL:-ByteDance-Seed/BAGEL-7B-MoT}"
PORT="${PORT:-8091}"
MASTER_ADDRESS="${MASTER_ADDRESS:-127.0.0.1}"
MASTER_PORT="${MASTER_PORT:-8092}"
STAGE="all"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
DEPLOY_CONFIG="${DEPLOY_CONFIG:-$SCRIPT_DIR/../../../vllm_omni/deploy/bagel.yaml}"
EXTRA_ARGS=()
usage() {
cat <<EOF
Usage: $0 [OPTIONS] [-- EXTRA_VLLM_ARGS...]
Options:
--stage {0|1|all} Stage to launch (default: all)
--model MODEL Model name/path (default: $MODEL)
--port PORT API port for stage 0 (default: $PORT)
--master-address ADDRESS Master/orchestrator address (default: $MASTER_ADDRESS)
--master-port PORT Master/orchestrator port (default: $MASTER_PORT)
--deploy-config PATH Deploy config YAML path (default: $DEPLOY_CONFIG)
--help Show this help message
Examples:
$0 --stage 0
$0 --stage 1
$0 --stage 0 -- --tensor-parallel-size 2
$0 --stage 1 -- --gpu-memory-utilization 0.9
Notes:
- Use different terminal sessions to launch stage 0 and stage 1 separately.
- Extra args after '--' are forwarded only to the selected stage.
- When using '--stage all', the extra args are forwarded to both stages.
EOF
}
while [[ $# -gt 0 ]]; do
case "$1" in
--stage)
STAGE="$2"
shift 2
;;
--model)
MODEL="$2"
shift 2
;;
--port)
PORT="$2"
shift 2
;;
--master-address)
MASTER_ADDRESS="$2"
shift 2
;;
--master-port)
MASTER_PORT="$2"
shift 2
;;
--deploy-config)
DEPLOY_CONFIG="$2"
shift 2
;;
--help|-h)
usage
exit 0
;;
--)
shift
EXTRA_ARGS=("$@")
break
;;
*)
echo "Unknown option: $1" >&2
usage
exit 1
;;
esac
done
if [[ "$STAGE" != "0" && "$STAGE" != "1" && "$STAGE" != "all" ]]; then
echo "Invalid --stage value: $STAGE" >&2
usage
exit 1
fi
print_config() {
echo "Model: $MODEL"
echo "API Port: $PORT"
echo "Master Address: $MASTER_ADDRESS"
echo "Master Port: $MASTER_PORT"
echo "Deploy Config: $DEPLOY_CONFIG"
echo "Selected Stage: $STAGE"
if [[ ${#EXTRA_ARGS[@]} -gt 0 ]]; then
echo "Extra Args: ${EXTRA_ARGS[*]}"
fi
}
run_stage_0() {
echo "Starting Stage 0 (Thinker) as master..."
vllm serve "$MODEL" --omni \
--port "$PORT" \
--deploy-config "$DEPLOY_CONFIG" \
--stage-id 0 \
--omni-master-address "$MASTER_ADDRESS" \
--omni-master-port "$MASTER_PORT" \
"${EXTRA_ARGS[@]}"
}
run_stage_1() {
echo "Starting Stage 1 (DiT) in headless mode..."
vllm serve "$MODEL" --omni \
--deploy-config "$DEPLOY_CONFIG" \
--stage-id 1 \
--headless \
--omni-master-address "$MASTER_ADDRESS" \
--omni-master-port "$MASTER_PORT" \
"${EXTRA_ARGS[@]}"
}
echo "Starting Bagel multi-stage server..."
print_config
case "$STAGE" in
0)
run_stage_0
;;
1)
run_stage_1
;;
all)
echo "Launching both stages in one session (legacy mode)..."
echo "Starting Stage 0 (Thinker) in background first..."
run_stage_0 &
STAGE_0_PID=$!
cleanup() {
if [[ -n "${STAGE_0_PID:-}" ]]; then
kill "$STAGE_0_PID" 2>/dev/null || true
wait "$STAGE_0_PID" 2>/dev/null || true
fi
}
trap cleanup EXIT INT TERM
echo "Waiting briefly for Stage 0 to initialize..."
sleep 2
run_stage_1
;;
esac