BAGEL-7B-MoT¶

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/bagel.

Installation¶

Please refer to README.md

Architecture¶

BAGEL-7B-MoT is a Mixture-of-Transformers (MoT) model supporting both image generation and understanding. It offers two deployment topologies:

Topology	Stages	Description
Two-stage (default)	Stage 0 (Thinker, AR) + Stage 1 (DiT, Diffusion)	Thinker handles text/understanding via vLLM AR engine; DiT handles image generation. KV cache is transferred between stages.
Single-stage	Stage 0 (DiT, Diffusion) only	The DiT stage contains a full LLM, ViT, VAE, and tokenizer internally. All modalities are handled within a single diffusion process.

Both topologies support all four modalities: text2img, img2img, img2text, text2text.

Note: These examples work with the default configuration on an NVIDIA A100 (80GB). We also tested on dual NVIDIA RTX 5000 Ada (32GB each). For dual-GPU setups, modify the deploy YAML to distribute stages across devices.

Launch the Server¶

Two-Stage (Default)¶

The default pipeline is auto-detected from the model. No extra flags needed:

vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091

Or use the convenience script:

cd examples/online_serving/bagel
bash run_server.sh

# Launch a single stage per terminal
bash run_server_stage_cli.sh --stage 0
bash run_server_stage_cli.sh --stage 1

To use a custom deploy YAML (note: --stage-configs-path is deprecated in favor of --deploy-config):

vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 \
    --deploy-config /path/to/deploy_config.yaml

See bagel.yaml for the default two-stage deploy configuration.

Single-Stage¶

The DiT stage contains a full LLM, ViT, VAE, and tokenizer, so it can handle all modalities (text2img, img2img, img2text, text2text, think) without a separate Thinker stage:

vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 \
    --deploy-config vllm_omni/deploy/bagel_single_stage.yaml

See bagel_single_stage.yaml for configuration. The pipeline: bagel_single_stage field selects the single-stage topology from the pipeline registry.

Tensor Parallelism (TP)¶

For larger models or multi-GPU environments, enable TP via CLI:

vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 --tensor-parallel-size 2

Or set tensor_parallel_size per stage in a custom deploy YAML.

Multi-Node Deployment¶

Deploy each stage on a separate node for better resource utilization. Replace <ORCHESTRATOR_IP> with the actual IP address of your orchestrator node.

1. Launch Stage 0 (Thinker / Orchestrator) on the orchestrator node:

# API server port for client requests: 8000
vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni \
    --port 8000 \
    --stage-id 0 \
    --omni-master-address <ORCHESTRATOR_IP> \
    --omni-master-port 8091

2. Launch Stage 1 (DiT) on the remote node in headless mode:

vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni \
    --stage-id 1 \
    --headless \
    --omni-master-address <ORCHESTRATOR_IP> \
    --omni-master-port 8091

Or use the convenience script:

# Terminal 1: Stage 0
bash run_server_stage_cli.sh --stage 0

# Terminal 2: Stage 1
bash run_server_stage_cli.sh --stage 1

# With extra args
bash run_server_stage_cli.sh --stage 0 -- --tensor-parallel-size 2
bash run_server_stage_cli.sh --stage 1 -- --gpu-memory-utilization 0.9

vllm serve arguments:

Argument	Description
`--stage-id`	Which stage this process runs (0 = Thinker, 1 = DiT)
`--headless`	Run without the API server (worker-only mode)
`-oma` / `--omni-master-address`	Orchestrator master address
`-omp` / `--omni-master-port`	Orchestrator master port

[!IMPORTANT] Startup Order: Stage 0 (orchestrator) must be launched before Stage 1 (headless). Stage 0 will appear to hang on startup until Stage 1 (worker) connects — this is expected behavior.

Inter-Stage Connectors¶

When deploying stages across nodes, configure the connector type in the deploy YAML:

SharedMemoryConnector (default): Used for single-node deployments. No explicit configuration needed.
MooncakeTransferEngineConnector: For multi-node setups with RDMA hardware. Defined in bagel.yaml under connectors.rdma_connector.

To use Mooncake, create a custom deploy YAML that binds output_connectors / input_connectors on each stage to the rdma_connector defined in the connectors section.

Send Requests¶

cd examples/online_serving/bagel

Text to Image (text2img)¶

Python client:

python openai_chat_client.py \
    --prompt "A beautiful sunset over mountains" \
    --modality text2img \
    --output sunset.png \
    --steps 50

curl:

curl http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": [{"type": "text", "text": "<|im_start|>A beautiful sunset over mountains<|im_end|>"}]}],
    "modalities": ["image"],
    "height": 512,
    "width": 512,
    "num_inference_steps": 50,
    "seed": 42
  }'

Image to Image (img2img)¶

Python client:

python openai_chat_client.py \
    --prompt "Make the cat stand up" \
    --modality img2img \
    --image-url /path/to/input.jpg \
    --output transformed.png

curl:

IMAGE_BASE64=$(base64 -w 0 cat.jpg)

cat <<EOF > payload.json
{
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "<|im_start|>Make the cat stand up<|im_end|>"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,${IMAGE_BASE64}"}}
      ]
    }],
    "modalities": ["image"],
    "height": 512,
    "width": 512,
    "num_inference_steps": 50,
    "seed": 42
}
EOF

curl http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d @payload.json

Image to Text (img2text)¶

Python client:

python openai_chat_client.py \
    --prompt "Describe this image in detail" \
    --modality img2text \
    --image-url /path/to/image.jpg

curl:

IMAGE_BASE64=$(base64 -w 0 cat.jpg)

cat <<EOF > payload.json
{
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "<|im_start|>user\n<|image_pad|>\nDescribe this image in detail<|im_end|>\n<|im_start|>assistant\n"},
      {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,${IMAGE_BASE64}"}}
    ]
  }],
  "modalities": ["text"]
}
EOF

curl http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d @payload.json

Text to Text (text2text)¶

Python client:

python openai_chat_client.py \
    --prompt "What is the capital of France?" \
    --modality text2text

curl:

curl http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": [{"type": "text", "text": "<|im_start|>user\nWhat is the capital of France?<|im_end|>\n<|im_start|>assistant\n"}]}],
    "modalities": ["text"]
  }'

Python Client Arguments¶

Argument	Default	Description
`--prompt` / `-p`	`A cute cat`	Text prompt
`--output` / `-o`	`bagel_output.png`	Output file path
`--server` / `-s`	`http://localhost:8091`	Server URL
`--image-url` / `-i`	`None`	Input image URL or local path (img2img/img2text)
`--modality` / `-m`	`text2img`	`text2img`, `img2img`, `img2text`, `text2text`
`--height`	`512`	Image height in pixels
`--width`	`512`	Image width in pixels
`--steps`	`25`	Number of inference steps
`--seed`	`42`	Random seed
`--negative`	`None`	Negative prompt for CFG

Example with custom parameters:

python openai_chat_client.py \
    --prompt "A futuristic city" \
    --modality text2img \
    --height 768 \
    --width 768 \
    --steps 50 \
    --seed 42 \
    --negative "blurry, low quality"

Configuration Reference¶

Deploy YAML Files¶

File	Description
`bagel.yaml`	Two-stage default (Thinker + DiT on GPU 0)
`bagel_single_stage.yaml`	Single-stage (DiT only)

Key Deploy YAML Fields¶

Field	Scope	Description
`pipeline`	top-level	Override auto-detected pipeline (e.g. `bagel_single_stage`)
`stages[].stage_id`	per-stage	Stage identifier (0, 1, ...)
`stages[].devices`	per-stage	GPU device IDs (e.g. `"0"`, `"0,1"`)
`stages[].max_num_seqs`	per-stage	Maximum concurrent sequences
`stages[].gpu_memory_utilization`	per-stage	Fraction of GPU memory to use
`stages[].enforce_eager`	per-stage	Disable CUDA graphs
`stages[].tensor_parallel_size`	per-stage	TP degree for this stage
`connectors`	top-level	Define available connector instances (SHM, Mooncake)
`platforms`	top-level	Platform-specific overrides (e.g. `xpu`)

FAQ¶

If you encounter OOM errors, try decreasing max_model_len or gpu_memory_utilization in the deploy YAML.

Two-stage VRAM usage:

Stage	VRAM
Stage 0 (Thinker)	15.04 GiB + KV Cache
Stage 1 (DiT)	26.50 GiB
Total	~42 GiB + KV Cache

Single-stage VRAM usage: The DiT loads the full model (~42 GiB) in one process.

Example materials¶

openai_chat_client.py

run_server.sh

run_server_stage_cli.sh