BAGEL-7B-MoT¶

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/bagel.

Installation¶

Please refer to README.md

Architecture¶

BAGEL-7B-MoT is a Mixture-of-Transformers (MoT) model supporting both image generation and understanding. It offers two deployment topologies:

Topology	Stages	Description
Two-stage (default)	Stage 0 (Thinker, AR) + Stage 1 (DiT, Diffusion)	Thinker handles text/understanding via vLLM AR engine; DiT handles image generation. KV cache is transferred between stages.
Single-stage	Stage 0 (DiT, Diffusion) only	The DiT stage contains a full LLM, ViT, VAE, and tokenizer internally. All modalities are handled within a single diffusion process.

Both topologies support all four modalities: text2img, img2img, img2text, text2text.

Note: These examples work with the default configuration on an NVIDIA A100 (80GB). We also tested on dual NVIDIA RTX 5000 Ada (32GB each). For dual-GPU setups, modify the deploy YAML to distribute stages across devices.

Launch the Server¶

Two-Stage (Default)¶

The default pipeline is auto-detected from the model. No extra flags needed:

vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091

Or use the convenience script:

cd examples/online_serving/bagel
bash run_server.sh

# Initialize each stage in a discrete isolated process terminal
bash run_server_stage_cli.sh --stage 0
bash run_server_stage_cli.sh --stage 1

To use a custom deploy YAML, pass it via --deploy-config:

vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 \
    --deploy-config /path/to/deploy_config.yaml

See bagel.yaml for the default two-stage deploy configuration.

Single-Stage¶

The DiT stage contains a full LLM, ViT, VAE, and tokenizer, so it can handle all modalities (text2img, img2img, img2text, text2text, think) without a separate Thinker stage:

vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 \
    --deploy-config vllm_omni/deploy/bagel_single_stage.yaml

See bagel_single_stage.yaml for configuration. The pipeline: bagel_single_stage field selects the single-stage topology from the pipeline registry.

Tensor Parallelism (TP)¶

For larger models or multi-GPU environments, enable TP via CLI:

vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 --tensor-parallel-size 2

Or set tensor_parallel_size per stage in a custom deploy YAML.

VAE Patch Parallelism¶

VAE Patch Parallelism distributes Bagel VAE decode/encode across multiple GPUs by splitting latent tiles. It lowers per-GPU peak memory during VAE decode, which helps high-resolution text2img / img2img when VAE becomes a bottleneck.

Scope for Bagel:

Topology	VAE patch parallel
Single-stage (DiT only)	Supported on stage 0 (`BagelPipeline` + `DistributedAutoEncoder`)
Two-stage	Supported on stage 1 (DiT) only; stage 0 (Thinker) uses encoder-only VAE and is unrelated

Requirements:

vae_patch_parallel_size > 1 and a distributed VAE (DistributedAutoEncoder on the DiT pipeline).
The DiT process group must have at least vae_patch_parallel_size ranks. In practice this means the diffusion stage world_size must be ≥ 2 (commonly tensor_parallel_size=2 on that stage).
vae_use_tiling must be enabled. If you set vae_patch_parallel_size > 1 and omit tiling, the registry auto-enables vae_use_tiling at startup.

VAE patch parallel reuses the DiT process group (dit_group); it does not create a separate VAE-only worker pool. It is not a substitute for single-GPU VAE tiling (vae_pp=1).

Online serving (single-stage, 2 GPUs):

CUDA_VISIBLE_DEVICES=0,1 vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 \
    --deploy-config vllm_omni/deploy/bagel_single_stage.yaml \
    --tensor-parallel-size 2 \
    --vae-patch-parallel-size 2 \
    --vae-use-tiling

Online serving (two-stage, VAE PP on DiT stage 1): use a custom deploy YAML, for example:

stages:
  - stage_id: 0
    devices: "0"
    # Thinker (AR) — no VAE patch parallel here

  - stage_id: 1
    devices: "0,1"
    vae_use_tiling: true
    parallel_config:
      tensor_parallel_size: 2
      vae_patch_parallel_size: 2

vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 \
    --deploy-config /path/to/bagel_vae_pp.yaml

Verify it is active (check server logs at startup):

INFO ... vae_patch_parallel_size=2 requires vae_use_tiling; automatically enabling it.

CLI flag	Default	Description
`--vae-patch-parallel-size`	`1`	Number of DiT ranks used for VAE tile parallelism. Set to `2` or higher to enable. Should be ≤ DiT process group size (typically match `--tensor-parallel-size` on the diffusion stage).
`--vae-use-tiling`	off	Enable VAE spatial tiling. Required for VAE patch parallel (auto-enabled when `vae_patch_parallel_size > 1`).

Hybrid Sharded Data Parallel (HSDP)¶

For larger Bagel deployments on multiple GPUs, you can enable HSDP (Hybrid Sharded Data Parallel) by modifying the stage configuration (for example, bagel.yaml). HSDP shards transformer weights across GPUs to reduce per-GPU memory usage.

Enable HSDP: Set use_hsdp: true.
Set shard size: Set hsdp_shard_size to the number of GPUs used for sharding (for example, 4).
Set replicate size: Usually keep hsdp_replicate_size: 1 unless you want replicated HSDP groups.
Set devices: Specify the comma-separated GPU IDs used by the diffusion stage (for example, "0,1,2,3").

Example configuration for HSDP across 4 GPUs:

  - stage_id: 1
    devices: "0,1,2,3"
    parallel_config:
      use_hsdp: true
      hsdp_shard_size: 4
      hsdp_replicate_size: 1

Multi-Node Deployment¶

Deploy each stage on a separate node for better resource utilization. Replace <ORCHESTRATOR_IP> with the actual IP address of your orchestrator node.

1. Launch Stage 0 (Thinker / Orchestrator) on the orchestrator node:

# API server port for client requests: 8000
vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni \
    --port 8000 \
    --stage-id 0 \
    --omni-master-address <ORCHESTRATOR_IP> \
    --omni-master-port 8091

2. Launch Stage 1 (DiT) on the remote node in headless mode:

vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni \
    --stage-id 1 \
    --headless \
    --omni-master-address <ORCHESTRATOR_IP> \
    --omni-master-port 8091

Or use the convenience script:

# Terminal 1: Stage 0
bash run_server_stage_cli.sh --stage 0

# Terminal 2: Stage 1
bash run_server_stage_cli.sh --stage 1

# With extra args
bash run_server_stage_cli.sh --stage 0 -- --tensor-parallel-size 2
bash run_server_stage_cli.sh --stage 1 -- --gpu-memory-utilization 0.9

vllm serve arguments:

Argument	Description
`--stage-id`	Which stage this process runs (0 = Thinker, 1 = DiT)
`--headless`	Run without the API server (worker-only mode)
`-oma` / `--omni-master-address`	Orchestrator master address
`-omp` / `--omni-master-port`	Orchestrator master port

[!IMPORTANT] Startup Order: Stage 0 (orchestrator) must be launched before Stage 1 (headless). Stage 0 will appear to hang on startup until Stage 1 (worker) connects — this is expected behavior.

Inter-Stage Connectors¶

When deploying stages across nodes, configure the connector type in the deploy YAML:

SharedMemoryConnector (default): Used for single-node deployments. No explicit configuration needed.
MooncakeTransferEngineConnector: For multi-node setups with RDMA hardware. Defined in bagel.yaml under connectors.rdma_connector.

To use Mooncake, create a custom deploy YAML that binds output_connectors / input_connectors on each stage to the rdma_connector defined in the connectors section.

Send Requests¶

cd examples/online_serving/bagel

Text to Image (text2img)¶

Python client:

python openai_chat_client.py \
    --prompt "A beautiful sunset over mountains" \
    --modality text2img \
    --output sunset.png \
    --steps 50

curl:

curl http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": [{"type": "text", "text": "<|im_start|>A beautiful sunset over mountains<|im_end|>"}]}],
    "modalities": ["image"],
    "height": 512,
    "width": 512,
    "num_inference_steps": 50,
    "seed": 42
  }'

Image to Image (img2img)¶

Python client:

python openai_chat_client.py \
    --prompt "Make the cat stand up" \
    --modality img2img \
    --image-url /path/to/input.jpg \
    --output transformed.png

curl:

IMAGE_BASE64=$(base64 -w 0 cat.jpg)

cat <<EOF > payload.json
{
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "<|im_start|>Make the cat stand up<|im_end|>"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,${IMAGE_BASE64}"}}
      ]
    }],
    "modalities": ["image"],
    "height": 512,
    "width": 512,
    "num_inference_steps": 50,
    "seed": 42
}
EOF

curl http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d @payload.json

Image to Text (img2text)¶

Python client:

python openai_chat_client.py \
    --prompt "Describe this image in detail" \
    --modality img2text \
    --image-url /path/to/image.jpg

curl:

IMAGE_BASE64=$(base64 -w 0 cat.jpg)

cat <<EOF > payload.json
{
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "<|im_start|>user\n<|image_pad|>\nDescribe this image in detail<|im_end|>\n<|im_start|>assistant\n"},
      {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,${IMAGE_BASE64}"}}
    ]
  }],
  "modalities": ["text"]
}
EOF

curl http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d @payload.json

Text to Text (text2text)¶

Python client:

python openai_chat_client.py \
    --prompt "What is the capital of France?" \
    --modality text2text

curl:

curl http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": [{"type": "text", "text": "<|im_start|>user\nWhat is the capital of France?<|im_end|>\n<|im_start|>assistant\n"}]}],
    "modalities": ["text"]
  }'

Python Client Arguments¶

Argument	Default	Description
`--prompt` / `-p`	`A cute cat`	Text prompt
`--output` / `-o`	`bagel_output.png`	Output file path
`--server` / `-s`	`http://localhost:8091`	Server URL
`--image-url` / `-i`	`None`	Input image URL or local path (img2img/img2text)
`--modality` / `-m`	`text2img`	`text2img`, `img2img`, `img2text`, `text2text`
`--height`	`512`	Image height in pixels
`--width`	`512`	Image width in pixels
`--steps`	`25`	Number of inference steps
`--seed`	`42`	Random seed
`--negative`	`None`	Negative prompt for CFG

Example with custom parameters:

python openai_chat_client.py \
    --prompt "A futuristic city" \
    --modality text2img \
    --height 768 \
    --width 768 \
    --steps 50 \
    --seed 42 \
    --negative "blurry, low quality"

Configuration Reference¶

Deploy YAML Files¶

File	Description
`bagel.yaml`	Two-stage default (Thinker + DiT on GPU 0)
`bagel_single_stage.yaml`	Single-stage (DiT only)

Key Deploy YAML Fields¶

Field	Scope	Description
`pipeline`	top-level	Override auto-detected pipeline (e.g. `bagel_single_stage`)
`stages[].stage_id`	per-stage	Stage identifier (0, 1, ...)
`stages[].devices`	per-stage	GPU device IDs (e.g. `"0"`, `"0,1"`)
`stages[].max_num_seqs`	per-stage	Maximum concurrent sequences
`stages[].gpu_memory_utilization`	per-stage	Fraction of GPU memory to use
`stages[].enforce_eager`	per-stage	Disable CUDA graphs
`stages[].tensor_parallel_size`	per-stage	TP degree for this stage
`stages[].parallel_config.vae_patch_parallel_size`	per-stage (DiT)	VAE tile parallelism degree (DiT stage only)
`stages[].vae_use_tiling`	per-stage (DiT)	Enable VAE tiling (required for VAE patch parallel)
`connectors`	top-level	Define available connector instances (SHM, Mooncake)
`platforms`	top-level	Platform-specific overrides (e.g. `xpu`)

FAQ¶

If you encounter OOM errors, try decreasing max_model_len or gpu_memory_utilization in the deploy YAML.

Two-stage VRAM usage:

Stage	VRAM
Stage 0 (Thinker)	15.04 GiB + KV Cache
Stage 1 (DiT)	26.50 GiB
Total	~42 GiB + KV Cache

Single-stage VRAM usage: The DiT loads the full model (~42 GiB) in one process.

Example materials¶

openai_chat_client.py

#!/usr/bin/env python3
"""
Bagel OpenAI-compatible chat client for image generation and multimodal tasks.

Usage:
    python openai_chat_client.py --prompt "A cute cat" --output output.png
    python openai_chat_client.py --prompt "Describe this image" --image-url https://example.com/image.png
"""

import argparse
import base64
from pathlib import Path

import requests


def generate_image(
    prompt: str,
    server_url: str = "http://localhost:8091",
    image_url: str | None = None,
    height: int | None = None,
    width: int | None = None,
    steps: int | None = None,
    seed: int | None = None,
    negative_prompt: str | None = None,
    modality: str = "text2img",  # "text2img" (default), "img2img", "img2text", "text2text"
) -> bytes | str | None:
    """Generate an image or text using the chat completions API.

    Args:
        prompt: Text description or prompt
        server_url: Server URL
        image_url: URL or path to input image (for img2img/img2text)
        height: Image height in pixels
        width: Image width in pixels
        steps: Number of inference steps
        seed: Random seed
        negative_prompt: Negative prompt
        modality: Task modality hint

    Returns:
        Image bytes (for image outputs) or Text string (for text outputs) or None if failed
    """

    # Construct Message Content
    content = [{"type": "text", "text": f"<|im_start|>{prompt}<|im_end|>"}]

    if image_url:
        # Check if local file
        if Path(image_url).exists():
            with open(image_url, "rb") as f:
                b64_data = base64.b64encode(f.read()).decode("utf-8")
                final_image_url = f"data:image/jpeg;base64,{b64_data}"
        else:
            final_image_url = image_url

        content.append({"type": "image_url", "image_url": {"url": final_image_url}})

    messages = [{"role": "user", "content": content}]

    # Build request payload with all parameters at top level
    # Note: vLLM ignores "extra_body", so we put parameters directly in the payload
    payload = {"messages": messages}

    # Set output modalities at top level
    if modality == "text2img" or modality == "img2img":
        payload["modalities"] = ["image"]
    elif modality == "img2text" or modality == "text2text":
        payload["modalities"] = ["text"]

    # Add generation parameters directly to payload
    if height is not None:
        payload["height"] = height
    if width is not None:
        payload["width"] = width
    if steps is not None:
        payload["num_inference_steps"] = steps
    if seed is not None:
        payload["seed"] = seed
    if negative_prompt:
        payload["negative_prompt"] = negative_prompt

    # Send request
    try:
        print(f"Sending request to {server_url} with modality {modality}...")
        response = requests.post(
            f"{server_url}/v1/chat/completions",
            headers={"Content-Type": "application/json"},
            json=payload,
            timeout=300,
        )
        response.raise_for_status()
        data = response.json()

        # Extract content - check ALL choices since server may return multiple
        # (e.g., text in choices[0], image in choices[1])
        choices = data.get("choices", [])

        # First pass: look for image output in any choice
        for choice in choices:
            choice_content = choice.get("message", {}).get("content")

            # Handle Image Output
            if isinstance(choice_content, list) and len(choice_content) > 0:
                first_item = choice_content[0]
                if isinstance(first_item, dict) and "image_url" in first_item:
                    img_url_str = first_item["image_url"].get("url", "")
                    if img_url_str.startswith("data:image"):
                        _, b64_data = img_url_str.split(",", 1)
                        return base64.b64decode(b64_data)

        # Second pass: look for text output if no image found
        for choice in choices:
            choice_content = choice.get("message", {}).get("content")
            if isinstance(choice_content, str) and choice_content:
                return choice_content

        print(f"Unexpected response format: {choices}")
        return None

    except Exception as e:
        print(f"Error: {e}")
        return None


def main():
    parser = argparse.ArgumentParser(description="Bagel multimodal chat client")
    parser.add_argument("--prompt", "-p", default="A cute cat", help="Text prompt")
    parser.add_argument("--output", "-o", default="bagel_output.png", help="Output file (for image results)")
    parser.add_argument("--server", "-s", default="http://localhost:8091", help="Server URL")

    # Modality Control
    parser.add_argument("--image-url", "-i", type=str, help="Input image URL or local path")
    parser.add_argument(
        "--modality",
        "-m",
        default="text2img",
        choices=["text2img", "img2img", "img2text", "text2text"],
        help="Task modality",
    )

    # Generation Params
    parser.add_argument("--height", type=int, default=512, help="Image height")
    parser.add_argument("--width", type=int, default=512, help="Image width")
    parser.add_argument("--steps", type=int, default=25, help="Inference steps")
    parser.add_argument("--seed", type=int, default=42, help="Random seed")
    parser.add_argument("--negative", help="Negative prompt")

    args = parser.parse_args()

    print(f"Mode: {args.modality}")
    if args.image_url:
        print(f"Input Image: {args.image_url}")

    result = generate_image(
        prompt=args.prompt,
        server_url=args.server,
        image_url=args.image_url,
        height=args.height,
        width=args.width,
        steps=args.steps,
        seed=args.seed,
        negative_prompt=args.negative,
        modality=args.modality,
    )

    if result:
        if isinstance(result, bytes):
            # It's an image
            output_path = Path(args.output)
            output_path.write_bytes(result)
            print(f"Image saved to: {output_path}")
            print(f"Size: {len(result) / 1024:.1f} KB")
        elif isinstance(result, str):
            # It's text
            print("Response:")
            print(result)
    else:
        print("Failed to generate response")
        exit(1)


if __name__ == "__main__":
    main()

run_server.sh

#!/bin/bash
# Bagel online serving startup script

MODEL="${MODEL:-ByteDance-Seed/BAGEL-7B-MoT}"
PORT="${PORT:-8091}"

echo "Starting Bagel server..."
echo "Model: $MODEL"
echo "Port: $PORT"

vllm serve "$MODEL" --omni \
    --port "$PORT"

run_server_stage_cli.sh

#!/bin/bash
# Bagel multi-stage online serving startup script.
#
# Usage:
#   ./run_server_stage_cli.sh --stage 0
#   ./run_server_stage_cli.sh --stage 1
#   ./run_server_stage_cli.sh --stage 0 -- --tensor-parallel-size 2
#   ./run_server_stage_cli.sh --stage 1 -- --gpu-memory-utilization 0.9
#
# By default, `--stage all` keeps the old behavior and launches both stages in
# one session. Use `--stage 0` / `--stage 1` to launch each stage separately in
# different terminal sessions, with stage-specific extra CLI arguments passed
# after `--`.

set -euo pipefail

MODEL="${MODEL:-ByteDance-Seed/BAGEL-7B-MoT}"
PORT="${PORT:-8091}"
MASTER_ADDRESS="${MASTER_ADDRESS:-127.0.0.1}"
MASTER_PORT="${MASTER_PORT:-8092}"
STAGE="all"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
DEPLOY_CONFIG="${DEPLOY_CONFIG:-$SCRIPT_DIR/../../../vllm_omni/deploy/bagel.yaml}"
EXTRA_ARGS=()

usage() {
    cat <<EOF
Usage: $0 [OPTIONS] [-- EXTRA_VLLM_ARGS...]

Options:
  --stage {0|1|all}          Stage to launch (default: all)
  --model MODEL              Model name/path (default: $MODEL)
  --port PORT                API port for stage 0 (default: $PORT)
  --master-address ADDRESS   Master/orchestrator address (default: $MASTER_ADDRESS)
  --master-port PORT         Master/orchestrator port (default: $MASTER_PORT)
  --deploy-config PATH       Deploy config YAML path (default: $DEPLOY_CONFIG)
  --help                     Show this help message

Examples:
  $0 --stage 0
  $0 --stage 1
  $0 --stage 0 -- --tensor-parallel-size 2
  $0 --stage 1 -- --gpu-memory-utilization 0.9

Notes:
  - Use different terminal sessions to launch stage 0 and stage 1 separately.
  - Extra args after '--' are forwarded only to the selected stage.
  - When using '--stage all', the extra args are forwarded to both stages.
EOF
}

while [[ $# -gt 0 ]]; do
    case "$1" in
        --stage)
            STAGE="$2"
            shift 2
            ;;
        --model)
            MODEL="$2"
            shift 2
            ;;
        --port)
            PORT="$2"
            shift 2
            ;;
        --master-address)
            MASTER_ADDRESS="$2"
            shift 2
            ;;
        --master-port)
            MASTER_PORT="$2"
            shift 2
            ;;
        --deploy-config)
            DEPLOY_CONFIG="$2"
            shift 2
            ;;
        --help|-h)
            usage
            exit 0
            ;;
        --)
            shift
            EXTRA_ARGS=("$@")
            break
            ;;
        *)
            echo "Unknown option: $1" >&2
            usage
            exit 1
            ;;
    esac
done

if [[ "$STAGE" != "0" && "$STAGE" != "1" && "$STAGE" != "all" ]]; then
    echo "Invalid --stage value: $STAGE" >&2
    usage
    exit 1
fi

print_config() {
    echo "Model: $MODEL"
    echo "API Port: $PORT"
    echo "Master Address: $MASTER_ADDRESS"
    echo "Master Port: $MASTER_PORT"
    echo "Deploy Config: $DEPLOY_CONFIG"
    echo "Selected Stage: $STAGE"
    if [[ ${#EXTRA_ARGS[@]} -gt 0 ]]; then
        echo "Extra Args: ${EXTRA_ARGS[*]}"
    fi
}

run_stage_0() {
    echo "Starting Stage 0 (Thinker) as master..."
    vllm serve "$MODEL" --omni \
        --port "$PORT" \
        --deploy-config "$DEPLOY_CONFIG" \
        --stage-id 0 \
        --omni-master-address "$MASTER_ADDRESS" \
        --omni-master-port "$MASTER_PORT" \
        "${EXTRA_ARGS[@]}"
}

run_stage_1() {
    echo "Starting Stage 1 (DiT) in headless mode..."
    vllm serve "$MODEL" --omni \
        --deploy-config "$DEPLOY_CONFIG" \
        --stage-id 1 \
        --headless \
        --omni-master-address "$MASTER_ADDRESS" \
        --omni-master-port "$MASTER_PORT" \
        "${EXTRA_ARGS[@]}"
}

echo "Starting Bagel multi-stage server..."
print_config

case "$STAGE" in
    0)
        run_stage_0
        ;;
    1)
        run_stage_1
        ;;
    all)
        echo "Launching both stages in one session (legacy mode)..."
        echo "Starting Stage 0 (Thinker) in background first..."
        run_stage_0 &
        STAGE_0_PID=$!

        cleanup() {
            if [[ -n "${STAGE_0_PID:-}" ]]; then
                kill "$STAGE_0_PID" 2>/dev/null || true
                wait "$STAGE_0_PID" 2>/dev/null || true
            fi
        }

        trap cleanup EXIT INT TERM

        echo "Waiting briefly for Stage 0 to initialize..."
        sleep 2
        run_stage_1
        ;;
esac