Continuous Batching for Step-Wise Diffusion¶
Experimental Feature
This feature is experimental. It currently applies only to native diffusion pipelines running with step_execution=True.
This document describes the batching extension built on top of Diffusion Step Execution. The base step-execution contract is unchanged. The batching work is mainly in the scheduler and runner layers.
Why It Helps¶
Step-wise execution breaks a long denoise loop into scheduler-visible units. That gives the runtime a place to admit other compatible requests between steps instead of waiting for an entire request to finish.
This matters most in low-MFU or bursty serving scenarios:
- one request's denoise step may not fully saturate the GPU
- several compatible requests can share the same denoise forward pass
- throughput and device utilization can improve without changing request-local scheduler state
This is not a guaranteed single-request latency win. The main benefit is usually higher utilization and better throughput when the workload contains multiple in-flight compatible requests.
Overview¶
With continuous batching enabled:
- the scheduler may keep multiple compatible requests active at the same time
- the runner packs request-local step state into one
InputBatch denoise_step()runs on that batchstep_scheduler()andpost_decode()still run per request
The current implementation is conservative:
- only compatible requests are batched together
- request-mode diffusion still runs with
max_num_seqs=1 - per-request progress and completion remain independent
Enablement¶
Use --step-execution as the feature gate, then increase --max-num-seqs above 1 if you want batching:
--max-num-seqs 1 keeps the step-wise path without enabling batching.
For a reproducible replay flow using the bundled serving benchmark, see the Qwen-Image replay commands in benchmarks/diffusion/README.md and benchmarks/diffusion/performance_dashboard/qwen_image_serving_performance.md.
Scheduler¶
The scheduler derives its batch capacity from max_num_seqs through max_num_running_reqs.
Batch admission is gated by SamplingParamsKey, which is built from shape-sensitive and CFG-sensitive sampling fields. This is the core correctness rule for batching: requests are only co-batched when they share the same denoise tensor contract.
There are two important details:
num_inference_stepsis not part of the key, so requests with different total step counts can still share a batch- requests also do not need to be at the same current denoise progress; active requests can continue batching even when their current step indices diverge
- admission is still FIFO, so an incompatible request at the head of the waiting queue blocks later compatible requests
Today that compatibility rule is still shape-sensitive. height, width, num_frames, and CFG-related fields remain part of the key, so different resolutions or incompatible guidance settings do not co-batch yet. The key also covers LoRA identity (lora_int_id, lora_scale), so requests targeting different adapters or scales run in separate batches and the worker can activate exactly one adapter per step.
The current batching unit is one OmniDiffusionRequest. Requests with multiple prompts do not participate in batching today.
Runner¶
The runner keeps persistent per-request execution state in DiffusionRequestState, while the scheduler owns a separate lightweight request state for queueing and lifecycle tracking.
For each step, the runner builds an InputBatch from the active request states:
- prompt embeddings and masks are normalized and padded
- dynamic tensors such as
latentsandtimestepsare gathered each step - buffers are reused when batch composition stays the same
The step-wise batched path is:
- Run
prepare_encode()for newly admitted requests. - Build or refresh
InputBatch. - Run one batched
denoise_step(input_batch). - Slice the batched
noise_predback per request. - Run per-request
step_scheduler(). - Run
post_decode()only for requests that finished denoising. - Scatter updated latents back into persistent request state with
scatter_latents().
This keeps the shared work limited to the denoise forward pass while preserving request-local scheduler state and outputs.
Engine¶
DiffusionEngine provides the background loop and async add-request path needed for multiple requests to accumulate in the scheduler.
This is supporting infrastructure, not the main design point. The batching behavior is defined by scheduler-side compatibility gating and runner-side batch packing.
Current Limitations¶
- Experimental feature; use
max_num_seqs=1for the older conservative path. - Only native pipelines that already support
step_execution=True. - Request-mode diffusion still clamps
max_num_seqsback to1. - Only homogeneous batches keyed by
SamplingParamsKeyare supported. - Multi-prompt requests are not batched.
cache_backend, KV transfer, and other request-mode extras are not wired into the batched step-wise path yet.- Future work can relax the current same-shape restriction with richer heterogeneous batching policies such as bucketing or padded execution for different resolutions.
Related Files¶
- Scheduler base:
vllm_omni/diffusion/sched/base_scheduler.py - Scheduler interface:
vllm_omni/diffusion/sched/interface.py - Step scheduler:
vllm_omni/diffusion/sched/step_scheduler.py - Runner:
vllm_omni/diffusion/worker/diffusion_model_runner.py - Input batch:
vllm_omni/diffusion/worker/input_batch.py - Tests:
tests/diffusion/test_diffusion_scheduler.py