Adding Step Execution Support for Diffusion Pipelines¶
This guide documents vLLM-Omni's stepwise diffusion contract for model authors and contributors implementing step_execution=True support for a diffusion pipeline.
For end-user enablement, supported models, and current limitations, see Step Execution.
This document describes the base step-execution contract only. For the experimental batching policy layered on top of the step-wise path, see Continuous Batching for Step-Wise Diffusion.
Current Support Scope¶
step_execution is not a generic diffusion toggle. It only works for pipelines that implement the segmented stateful contract in vllm_omni/diffusion/models/interface.py.
This page is intentionally author-facing. Treat runtime enablement (step_execution=True in Python or --step-execution in serving) as an opt-in user knob layered on top of the implementation contract below.
Current in-tree support:
| Pipeline | Example models | Step execution |
|---|---|---|
QwenImagePipeline | Qwen/Qwen-Image, Qwen/Qwen-Image-2512 | Yes |
| All other diffusion pipelines | QwenImageEditPipeline, QwenImageEditPlusPipeline, QwenImageLayeredPipeline, GLM-Image, Wan, Flux, etc. | No |
Current engine/runtime limitations:
- Continuous batching with
max_num_seqs > 1is experimental and documented in Continuous Batching for Step-Wise Diffusion. Keepmax_num_seqs=1if you want the older conservative behavior. cache_backendis not supported in step mode.- Request-mode extras such as KV transfer are not wired into step mode yet.
- Unsupported pipelines now fail early during model loading instead of failing on the first request.
Execution Contract¶
Step mode is driven by four pipeline methods plus the shared mutable request state object:
prepare_encode(state): one-time request preparation.denoise_step(state): compute the noise prediction for the current step.step_scheduler(state, noise_pred): mutate latents and advance step state.post_decode(state): decode the final output after denoising is complete.
The state lives in vllm_omni/diffusion/worker/utils.py as DiffusionRequestState. Store request-scoped tensors there, or use state.extra for model-specific fields that do not justify extending the core dataclass.
The worker-side step loop lives in vllm_omni/diffusion/worker/diffusion_model_runner.py:
prepare_encode()runs once for a new request.denoise_step()runs every scheduler tick.step_scheduler()mutatesstate.latentsand advancesstate.step_index.post_decode()runs exactly once afterstate.denoise_completedbecomes true.
Recommended Split¶
When converting an existing request-level forward() pipeline, keep the split strict and mechanical:
| Request-level phase | Stepwise method | What belongs there |
|---|---|---|
| Input validation, prompt encoding, latent init, timestep prep, per-request scheduler creation | prepare_encode() | Anything that should happen once per request |
| Transformer forward / noise prediction | denoise_step() | Pure denoise computation for the current timestep |
scheduler.step(...) and step_index += 1 | step_scheduler() | Only latent/state mutation for one step |
| VAE decode / postprocess | post_decode() | Final decode only |
Keep the stepwise path reusing the same helpers as the request-level path whenever possible. Reimplementing the denoise loop from scratch is the easiest way to introduce behavioral drift.
Qwen-Image Reference¶
pipeline_qwen_image.py is the reference implementation and is split correctly for the current contract:
prepare_encode()reuses_prepare_generation_context()so prompt encoding, latent init, timestep creation, CFG setup, and shape bookkeeping stay aligned withforward().prepare_encode()deep-copiesself.schedulerafterprepare_timesteps()so request-specific scheduler state is isolated.denoise_step()reuses_build_denoise_kwargs()pluspredict_noise_maybe_with_cfg(), so sequential CFG, CFG-parallel, and non-CFG behavior stay identical to the request-level path.step_scheduler()only callsscheduler_step_maybe_with_cfg(..., per_request_scheduler=state.scheduler)and incrementsstate.step_index.post_decode()reuses_decode_latents(), so the final image decode matches the normalforward()path.
That decomposition is the target pattern for future models.
Rules For New Pipelines¶
- Do not keep request-scoped scheduler state on
self.scheduler. Copy it intostate.schedulerduringprepare_encode(). - Do not mutate
state.step_indexinsidedenoise_step(). Onlystep_scheduler()should advance the step. - Do not decode partial outputs in
denoise_step()orstep_scheduler(). - If the request-level pipeline has condition latents, masks, or edit-specific tensors, store them in
stateorstate.extra, not in global pipeline attributes. - Preserve CFG behavior by sharing the same helper path used by
forward(). - Keep
post_decode()equivalent to the tail offorward().
Validation Checklist¶
Before marking a pipeline as supports_step_execution = True, verify:
- Stepwise output matches request-level output for the same seed and sampling params.
- Per-request scheduler state is isolated across concurrent requests.
- Abort during denoise does not leak cached state.
step_indexreported byRunnerOutputmatches the scheduler progress.- CFG-parallel and non-CFG paths both work if the request-level pipeline supports them.
Related Files¶
- Contract:
vllm_omni/diffusion/models/interface.py - State:
vllm_omni/diffusion/worker/utils.py - Runner loop:
vllm_omni/diffusion/worker/diffusion_model_runner.py - Scheduler transport:
vllm_omni/diffusion/sched/interface.py - Reference pipeline:
vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py