Continuous Batching for Step-Wise Diffusion¶

Experimental Feature

This feature is experimental. It currently applies only to native diffusion pipelines running with step_execution=True.

This document describes the batching extension built on top of Diffusion Step Execution. The base step-execution contract is unchanged. The batching work is mainly in the scheduler and runner layers.

Why It Helps¶

Step-wise execution breaks a long denoise loop into scheduler-visible units. That gives the runtime a place to admit other compatible requests between steps instead of waiting for an entire request to finish.

This matters most in low-MFU or bursty serving scenarios:

one request's denoise step may not fully saturate the GPU
several compatible requests can share the same denoise forward pass
throughput and device utilization can improve without changing request-local scheduler state

This is not a guaranteed single-request latency win. The main benefit is usually higher utilization and better throughput when the workload contains multiple in-flight compatible requests.

Overview¶

With continuous batching enabled:

the scheduler may keep multiple compatible requests active at the same time
the runner packs request-local step state into one InputBatch
denoise_step() runs on that batch
step_scheduler() and post_decode() still run per request

The current implementation is conservative:

only compatible requests are batched together
per-request progress and completion remain independent

Here, "continuous batching" means the step-wise path enabled by step_execution=True. Request-mode DiffusionRequestBatch is static request-level batching for one full pipeline forward() call; it does not admit or remove requests between denoise steps.

Enablement¶

Use --step-execution as the feature gate, then increase --max-num-seqs above 1 if you want batching:

vllm serve Qwen/Qwen-Image --omni \
  --port 8091 \
  --step-execution \
  --max-num-seqs 8

--max-num-seqs 1 keeps the step-wise path without enabling batching.

For a reproducible replay flow using the bundled serving benchmark, see the Qwen-Image replay commands in benchmarks/diffusion/README.md and benchmarks/diffusion/performance_dashboard/qwen_image_serving_performance.md.

Scheduler¶

The scheduler derives its batch capacity from max_num_seqs through max_num_running_reqs.

Batch admission is gated by SamplingParamsKey, which is built from shape-sensitive and CFG-sensitive sampling fields. This is the core correctness rule for batching: requests are only co-batched when they share the same denoise tensor contract.

There are three important details:

num_inference_steps is not part of the key, so requests with different total step counts can still share a batch
requests also do not need to be at the same current denoise progress; active requests can continue batching even when their current step indices diverge
admission is still FIFO, so an incompatible request at the head of the waiting queue blocks later compatible requests

Today that compatibility rule is still shape-sensitive. height, width, num_frames, and CFG-related fields remain part of the key, so different resolutions or incompatible guidance settings do not co-batch yet. The key also covers LoRA identity (lora_int_id, lora_scale), so requests targeting different adapters or scales run in separate batches and the worker can activate exactly one adapter per step.

The scheduler batching unit is one logical OmniDiffusionRequest. In the step-wise path, runtime tensor batching is represented as StepInputBatch. For request-mode prompt semantics, see Request-Level Batching.

Runner¶

The runner keeps persistent per-request execution state in DiffusionRequestState, while the scheduler owns a separate lightweight request state for queueing and lifecycle tracking.

For each step, the runner builds an InputBatch from the active request states:

prompt embeddings and masks are normalized and padded
dynamic tensors such as latents and timesteps are gathered each step
buffers are reused when batch composition stays the same

The step-wise batched path is:

Run prepare_encode() for newly admitted requests.
Build or refresh InputBatch.
Run one batched denoise_step(input_batch).
Slice the batched noise_pred back per request.
Run per-request step_scheduler().
Run post_decode() only for requests that finished denoising.
Scatter updated latents back into persistent request state with scatter_latents().

This keeps the shared work limited to the denoise forward pass while preserving request-local scheduler state and outputs.

Engine¶

DiffusionEngine provides the background loop and async add-request path needed for multiple requests to accumulate in the scheduler.

When step_execution=True, the engine routes work through the step-wise executor path. The continuous batching behavior is defined by scheduler-side compatibility gating and runner-side StepInputBatch packing.

Current Limitations¶

Experimental feature; use max_num_seqs=1 for the older conservative path.
Only native pipelines that already support step_execution=True.
Only homogeneous batches keyed by SamplingParamsKey are supported.
cache_backend, KV transfer, and other request-mode extras are not wired into the batched step-wise path yet.
Future work can relax the current same-shape restriction with richer heterogeneous batching policies such as bucketing or padded execution for different resolutions.

Scheduler base: vllm_omni/diffusion/sched/base_scheduler.py
Scheduler interface: vllm_omni/diffusion/sched/interface.py
Step scheduler: vllm_omni/diffusion/sched/step_scheduler.py
Runner: vllm_omni/diffusion/worker/diffusion_model_runner.py
Input batch: vllm_omni/diffusion/worker/input_batch.py
Tests: tests/diffusion/test_diffusion_scheduler.py