vllm_omni.diffusion.models.dreamzero.transform.base ¶

Base transform interface for DreamZero robot policy serving.

Transforms handle dataset-specific concerns ONLY

Observation key mapping
Multi-view stitching (embodiment-specific layout)
Language template wrapping (embodiment-specific)
Raw state extraction (dataset-specific keys)
Output action slicing (to actual action_dim)

Model-specific concerns belong in the pipeline

Tokenization (pipeline owns tokenizer)
State padding (pipeline knows MAX_STATE_DIM)
Negative prompt (pipeline owns the string)
Noise generation, encoding, decoding

Flow

raw obs (dataset format) → DreamZeroPipeline selects transform by embodiment → unified dict (stitched video, templated prompt str, raw state) → tokenize, pad, encode, denoise → transform_action_output() → ndarray (N, action_dim)

TRANSFORMS `module-attribute` ¶

TRANSFORMS: dict[str, RobotPolicyTransform] = {}

RobotPolicyTransform ¶

Base class for dataset-specific observation transforms.

Subclasses MUST define

IMAGE_KEY_MAP: dict — dataset obs keys → unified keys EMBODIMENT_NAME: str — embodiment identity (pipeline maps to numeric ID) ACTION_DIM: int — actual action dimensions (for output slicing)

Subclasses MUST override

_stitch_views() — multi-view → single stitched image _language_template() — prompt → embodiment-aware template _extract_raw_state() — obs → raw state ndarray

ACTION_DIM `instance-attribute` ¶

ACTION_DIM: int

EMBODIMENT_NAME `instance-attribute` ¶

EMBODIMENT_NAME: str

IMAGE_KEY_MAP `instance-attribute` ¶

IMAGE_KEY_MAP: dict[str, str]

transform_action_output ¶

transform_action_output(actions: Any) -> ndarray

Adapt model action output to this transform's action dimensions.

transform_input ¶

transform_input(obs: dict) -> dict

Dataset-specific transform: key map → stitch → template → state.

get_transform ¶

get_transform(name: str) -> RobotPolicyTransform

register_transform ¶

register_transform(
    name: str, transform: RobotPolicyTransform
) -> None