vllm_omni.diffusion.models.dreamzero.pipeline_dreamzero ¶

DreamZero pipeline for vllm-omni.

Entry point for DiffusionEngine.step() -> pipeline.forward(req)

MAX_DREAMZERO_SESSIONS `module-attribute` ¶

MAX_DREAMZERO_SESSIONS = 64

logger `module-attribute` ¶

logger = logging.getLogger(__name__)

DreamZeroPipeline ¶

Bases: Module, CFGParallelMixin

DreamZero world model pipeline.

Multi-output: predict_noise() returns (video_pred, action_pred). CFG: video gets standard CFG, action takes positive branch only.

KV is managed by the AR-Diffusion engine: self._ar_diffusion_kv_state is set by the runner before forward() and the pipeline routes all KV access (get / update / commit / reset) through the pool-backed state. Purely duck-typed (no engine import).

action_horizon `instance-attribute` ¶

action_horizon: int = ah_config['action_horizon']

action_norm_stats `instance-attribute` ¶

action_norm_stats = self._parse_action_norm_stats(metadata)

cfg_scale `instance-attribute` ¶

cfg_scale: float = model_config.get(
    "cfg_scale", DEFAULT_CFG_SCALE
)

decouple_inference_noise `instance-attribute` ¶

decouple_inference_noise: bool = ah_config[
    "decouple_inference_noise"
]

default_robot_embodiment `instance-attribute` ¶

default_robot_embodiment = model_config.get(
    "default_robot_embodiment", DEFAULT_EMBODIMENT
)

embodiment_name_to_id `instance-attribute` ¶

embodiment_name_to_id: dict[str, int] = model_config.get(
    "embodiment_name_to_id", DEFAULT_EMBODIMENT_NAME_TO_ID
)

image_encoder `instance-attribute` ¶

image_encoder = DreamZeroImageEncoder()

max_action_dim `instance-attribute` ¶

max_action_dim: int = ah_config['max_action_dim']

max_state_dim `instance-attribute` ¶

max_state_dim: int = ah_config['max_state_dim']

negative_prompt `instance-attribute` ¶

negative_prompt: str = model_config.get(
    "negative_prompt", DEFAULT_NEGATIVE_PROMPT
)

num_frame_per_block `instance-attribute` ¶

num_frame_per_block: int = ah_config['num_frame_per_block']

num_frames `instance-attribute` ¶

num_frames: int = ah_config['num_frames']

num_inference_steps `instance-attribute` ¶

num_inference_steps: int = model_config.get(
    "num_inference_steps", DEFAULT_NUM_INFERENCE_STEPS
)

od_config `instance-attribute` ¶

od_config = od_config

relative_action `instance-attribute` ¶

relative_action: bool = model_config.get(
    "relative_action", True
)

relative_action_dim `instance-attribute` ¶

relative_action_dim: int = model_config.get(
    "relative_action_dim", 7
)

scheduler `instance-attribute` ¶

scheduler = FlowUniPCMultistepScheduler(
    num_train_timesteps=1000,
    shift=1,
    use_dynamic_shifting=False,
)

seed `instance-attribute` ¶

seed: int = model_config.get('seed', DEFAULT_SEED)

sigma_shift `instance-attribute` ¶

sigma_shift: float = model_config.get(
    "sigma_shift", DEFAULT_SIGMA_SHIFT
)

state `instance-attribute` ¶

state = self._get_or_create_state('default')

state_norm_stats `instance-attribute` ¶

state_norm_stats = self._parse_state_norm_stats(metadata)

text_encoder `instance-attribute` ¶

text_encoder = UMT5EncoderModel(umt5_config)

tokenizer `instance-attribute` ¶

tokenizer = AutoTokenizer.from_pretrained(tokenizer_source)

transformer `instance-attribute` ¶

transformer = CausalWanModel(**transformer_kwargs)

vae `instance-attribute` ¶

vae = DistributedAutoencoderKLWan.from_pretrained(
    vae_source, torch_dtype=torch.float32
)

video_inference_final_noise `instance-attribute` ¶

video_inference_final_noise: float = ah_config[
    "video_inference_final_noise"
]

weights_sources `property` ¶

weights_sources

ComponentSource list for DiffusersPipelineLoader.

clear_accumulated_video_latents ¶

clear_accumulated_video_latents(
    session_id: str | None = None,
) -> None

Clear accumulated video latents for session_id without resetting KV state.

combine_cfg_noise ¶

combine_cfg_noise(
    positive_noise_pred: Tensor | tuple[Tensor, ...],
    negative_noise_pred: Tensor | tuple[Tensor, ...],
    true_cfg_scale: float,
    cfg_normalize: bool = False,
) -> Tensor | tuple[Tensor, ...]

Video: standard CFG. Action: positive only (no CFG). action = cond only (no uncond blending)

decode_accumulated_video_latents ¶

decode_accumulated_video_latents(
    session_id: str | None = None,
) -> Tensor

Decode all AR-chunk latents accumulated for session_id.

decode_video_latents ¶

decode_video_latents(video_latents: Tensor) -> Tensor

Decode normalized VAE latents into RGB video tensors.

diffuse ¶

diffuse(
    video_latents: Tensor,
    action_latents: Tensor,
    timesteps_video: Tensor,
    timesteps_action: Tensor,
    prompt_embeds: Tensor,
    negative_prompt_embeds: Tensor | None,
    video_action_scheduler: VideoActionScheduler,
    do_true_cfg: bool,
    state: DreamZeroState,
    **kwargs,
) -> tuple[Tensor, Tensor]

Denoising loop with CFG parallel support.

For each timestep

Build positive_kwargs / negative_kwargs
predict_noise_maybe_with_cfg() -> (video_pred, action_pred)
scheduler_step_maybe_with_cfg() -> VideoActionScheduler
_synchronize_cfg_parallel_step_output()

forward ¶

forward(
    req: DiffusionRequestBatch, **kwargs
) -> DiffusionOutput

Full inference step. Called by DiffusionEngine.step().

load_weights ¶

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Load checkpoint weights with key remapping.

predict_noise ¶

predict_noise(**kwargs) -> tuple[Tensor, Tensor]

Call CausalWanModel, return (video_pred, action_pred).

setup_compile ¶

setup_compile() -> None

Compile DreamZero encoders, VAE, and per-block DiT for inference.

Paper D.2 uses mode=reduce-overhead, fullgraph=True, dynamic=False on text/image/VAE and DiT. VAE decode uses a tensor feat_cache patch and compiles decoder.forward (not _decode, which has a Python frame loop). Incremental VAE encode (_vae_encode_encoder_chunk) stays eager because Wan feat_cache mutation is incompatible with CUDAGraph capture. DiT blocks use per-block fullgraph=True.

warmup_compile ¶

warmup_compile() -> None

Warm up compiled text/image/VAE paths before timed inference.

VideoActionScheduler ¶

Wraps video + action schedulers into single .step() interface.

action_scheduler `instance-attribute` ¶

action_scheduler = action_scheduler

video_scheduler `instance-attribute` ¶

video_scheduler = video_scheduler

step ¶

step(
    noise_pred,
    t,
    latents,
    return_dict=False,
    generator=None,
)

vllm_omni.diffusion.models.dreamzero.pipeline_dreamzero ¶

MAX_DREAMZERO_SESSIONS module-attribute ¶

logger module-attribute ¶

DreamZeroPipeline ¶

action_horizon instance-attribute ¶

action_norm_stats instance-attribute ¶

cfg_scale instance-attribute ¶

decouple_inference_noise instance-attribute ¶

default_robot_embodiment instance-attribute ¶

embodiment_name_to_id instance-attribute ¶

image_encoder instance-attribute ¶

max_action_dim instance-attribute ¶

max_state_dim instance-attribute ¶

negative_prompt instance-attribute ¶

num_frame_per_block instance-attribute ¶

num_frames instance-attribute ¶

num_inference_steps instance-attribute ¶

od_config instance-attribute ¶

relative_action instance-attribute ¶

relative_action_dim instance-attribute ¶

scheduler instance-attribute ¶

seed instance-attribute ¶

sigma_shift instance-attribute ¶

state instance-attribute ¶

state_norm_stats instance-attribute ¶

text_encoder instance-attribute ¶

tokenizer instance-attribute ¶

transformer instance-attribute ¶

vae instance-attribute ¶

video_inference_final_noise instance-attribute ¶

weights_sources property ¶