vllm_omni.diffusion.models.cosmos3.pipeline_cosmos3 ¶

Cosmos3 text/image/video/sound/action pipeline for vllm-omni.

One pipeline class serves the Cosmos3 family modes. Output modality is selected mainly by prompt["modalities"]:

"image" selects T2I (text-to-image) and forces a single visual frame.
"video" or omitted modalities select video generation.
"audio" is accepted for compatibility but does not request sound by itself; sound is enabled with generate_sound or sound_gen.

Video generation is further specialized by inputs and extra args:

no image/video input: T2V (text-to-video).
multi_modal_data["image"]: I2V (image-to-video).
multi_modal_data["video"] with no action/transfer mode: V2V (video-to-video).
transfer hints (edge, blur, depth, seg, or wsm): control transfer video generation.
action_mode: action-capable video generation. RoboLab/OpenPI observation payloads in extra_args["robot_obs"] or extra_args["observation"] bypass normal video output and return action-only custom output.

Generated sound is video-only, cannot be combined with action or transfer, and is produced from sound latents rather than from multi_modal_data["audio"].

COSMOS3_DEFAULT_CONDITION_PIXEL_FRAMES `module-attribute` ¶

COSMOS3_DEFAULT_CONDITION_PIXEL_FRAMES = (
    max(COSMOS3_DEFAULT_CONDITION_FRAME_INDEXES_VISION)
    * COSMOS3_VAE_TEMPORAL_COMPRESSION
    + 1
)

COSMOS3_DEFAULT_MAX_SEQUENCE_LENGTH `module-attribute` ¶

COSMOS3_DEFAULT_MAX_SEQUENCE_LENGTH = 4096

COSMOS3_DURATION_TEMPLATE `module-attribute` ¶

COSMOS3_DURATION_TEMPLATE = "The video is {duration:.1f} seconds long and is of {fps:.0f} FPS."

COSMOS3_IMAGE_RESOLUTION_TEMPLATE `module-attribute` ¶

COSMOS3_IMAGE_RESOLUTION_TEMPLATE = (
    "This image is of {height}x{width} resolution."
)

COSMOS3_INVERSE_DURATION_TEMPLATE `module-attribute` ¶

COSMOS3_INVERSE_DURATION_TEMPLATE = "The video is not {duration:.1f} seconds long and is not of {fps:.0f} FPS."

COSMOS3_INVERSE_IMAGE_RESOLUTION_TEMPLATE `module-attribute` ¶

COSMOS3_INVERSE_IMAGE_RESOLUTION_TEMPLATE = (
    "This image is not of {height}x{width} resolution."
)

COSMOS3_INVERSE_RESOLUTION_TEMPLATE `module-attribute` ¶

COSMOS3_INVERSE_RESOLUTION_TEMPLATE = (
    "This video is not of {height}x{width} resolution."
)

COSMOS3_RESOLUTION_TEMPLATE `module-attribute` ¶

COSMOS3_RESOLUTION_TEMPLATE = (
    "This video is of {height}x{width} resolution."
)

COSMOS3_SYSTEM_PROMPT `module-attribute` ¶

COSMOS3_SYSTEM_PROMPT = "You are a helpful assistant who will generate videos from a give prompt."

COSMOS3_T2I_DEFAULT_FLOW_SHIFT `module-attribute` ¶

COSMOS3_T2I_DEFAULT_FLOW_SHIFT = 3.0

COSMOS3_T2I_DEFAULT_GUIDANCE_INTERVAL `module-attribute` ¶

COSMOS3_T2I_DEFAULT_GUIDANCE_INTERVAL: tuple[
    float, float
] = (400.0, 1000.0)

COSMOS3_T2I_DEFAULT_GUIDANCE_SCALE `module-attribute` ¶

COSMOS3_T2I_DEFAULT_GUIDANCE_SCALE = 7.0

COSMOS3_T2I_DEFAULT_HEIGHT `module-attribute` ¶

COSMOS3_T2I_DEFAULT_HEIGHT = 1024

COSMOS3_T2I_DEFAULT_NUM_INFERENCE_STEPS `module-attribute` ¶

COSMOS3_T2I_DEFAULT_NUM_INFERENCE_STEPS = 50

COSMOS3_T2I_DEFAULT_WIDTH `module-attribute` ¶

COSMOS3_T2I_DEFAULT_WIDTH = 1024

COSMOS3_T2I_SYSTEM_PROMPT `module-attribute` ¶

COSMOS3_T2I_SYSTEM_PROMPT = "You are a helpful assistant who will generate images from a give prompt."

COSMOS3_T2V_DEFAULT_GUIDANCE_SCALE `module-attribute` ¶

COSMOS3_T2V_DEFAULT_GUIDANCE_SCALE = 6.0

COSMOS3_T2V_DEFAULT_HEIGHT `module-attribute` ¶

COSMOS3_T2V_DEFAULT_HEIGHT = 720

COSMOS3_T2V_DEFAULT_NUM_FRAMES `module-attribute` ¶

COSMOS3_T2V_DEFAULT_NUM_FRAMES = 189

COSMOS3_T2V_DEFAULT_NUM_INFERENCE_STEPS `module-attribute` ¶

COSMOS3_T2V_DEFAULT_NUM_INFERENCE_STEPS = 35

COSMOS3_T2V_DEFAULT_WIDTH `module-attribute` ¶

COSMOS3_T2V_DEFAULT_WIDTH = 1280

COSMOS3_V2V_DEFAULT_FLOW_SHIFT `module-attribute` ¶

COSMOS3_V2V_DEFAULT_FLOW_SHIFT = 10.0

logger `module-attribute` ¶

logger = init_logger(__name__)

Cosmos3OmniDiffusersPipeline ¶

Bases: Module, CFGParallelMixin, SupportImageInput, ProgressBarMixin, DiffusionPipelineProfilerMixin

Cosmos3 text/image/video/sound/action pipeline.

Architecture: Mixture-of-Transformers with Qwen3-VL backbone. - Understanding pathway: causal self-attention on text (runs once, K/V cached) - Generation pathway: cross-attention on visual latents and optional transfer-control, action, and sound latents (runs each step)

Supports T2V, I2V, V2V, T2I, transfer, sound-enabled video, and action generation from the same class. Mode is selected at runtime:

T2I when prompt["modalities"] contains "image". Latent T-dim is forced to 1, T2I-specific scheduler defaults are applied (50 steps, flow_shift=3.0, guidance_interval=[400, 1000]), the duration template is suppressed, and post-process emits PIL images.
I2V when the request supplies a preprocessed image via multi_modal_data['image'] (handled by :func:get_cosmos3_pre_process_func) and the requested output modality is not image. Frame 0 of the initial latent is set to the VAE-encoded conditioning image, frame-0 noise predictions are masked to zero, and the clean image latent is re-injected at frame 0 after each scheduler step.
V2V when the request supplies a preprocessed video via multi_modal_data['video'] without an action mode. Explicit latent frame indexes are kept clean with noisy_frame_mask and re-injected after each scheduler step.
Transfer when edge, blur, depth, seg, or wsm hints are supplied. Transfer is video-output only and cannot be combined with sound or action generation.
Sound-enabled video when generate_sound or sound_gen is true. Sound is generated from sound latents, not from multi_modal_data['audio']; T2I, transfer, and action+sound are rejected.
Action generation when action_mode is provided. policy and forward_dynamics require an image or video input; inverse_dynamics requires video input. Action predictions are returned in custom_output. RoboLab/OpenPI observations in extra_args['robot_obs'] or extra_args['observation'] return action-only custom output.
T2V otherwise (default video generation).

color_format `class-attribute` ¶

color_format: str = 'RGB'

device `instance-attribute` ¶

device = get_local_device()

do_classifier_free_guidance `property` ¶

do_classifier_free_guidance

dtype `instance-attribute` ¶

dtype = od_config.dtype

guidance_scale `property` ¶

guidance_scale

num_timesteps `property` ¶

num_timesteps

od_config `instance-attribute` ¶

od_config = od_config

scheduler `instance-attribute` ¶

scheduler = UniPCMultistepScheduler.from_pretrained(
    model_path,
    subfolder="scheduler",
    local_files_only=local_files_only,
)

support_image_input `class-attribute` ¶

support_image_input: bool = True

tokenizer `instance-attribute` ¶

tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    subfolder="text_tokenizer",
    local_files_only=local_files_only,
)

transformer `instance-attribute` ¶

transformer = Cosmos3VFMTransformer(
    od_config=od_config,
    temporal_compression_factor=self.vae_scale_factor_temporal,
    sound_gen=sound_gen,
    sound_dim=sound_dim,
    sound_latent_fps=sound_latent_fps,
)

vae `instance-attribute` ¶

vae = DistributedAutoencoderKLWan.from_pretrained(
    model_path,
    subfolder="vae",
    torch_dtype=self.dtype,
    local_files_only=local_files_only,
).to(self.device)

vae_scale_factor_spatial `instance-attribute` ¶

vae_scale_factor_spatial = getattr(
    self.vae.config, "scale_factor_spatial", 16
)

vae_scale_factor_temporal `instance-attribute` ¶

vae_scale_factor_temporal = int(
    self.vae.config.scale_factor_temporal
)

video_processor `instance-attribute` ¶

video_processor = VideoProcessor(
    vae_scale_factor=self.vae_scale_factor_spatial
)

weights_sources `instance-attribute` ¶

weights_sources = [
    DiffusersPipelineLoader.ComponentSource(
        model_or_path=model_path,
        subfolder=None,
        revision=None,
        prefix="transformer.",
        fall_back_to_pt=True,
        allow_patterns_overrides=[
            "transformer/*.safetensors"
        ],
    )
]

combine_multi_branch_cfg_noise ¶

combine_multi_branch_cfg_noise(
    predictions: list[Tensor | tuple[Tensor, ...]],
    true_cfg_scale: float | dict[str, float],
    cfg_normalize: bool = False,
) -> Tensor | tuple[Tensor, ...]

diffuse ¶

diffuse(
    latents: Tensor,
    timesteps: Tensor,
    cond_ids: Tensor,
    cond_mask: Tensor,
    uncond_ids: Tensor,
    uncond_mask: Tensor,
    guidance_scale: float,
    shared_kwargs: dict,
    *,
    action_latents: Tensor | None = None,
    action_velocity_mask: Tensor | None = None,
    action_condition_latents: Tensor | None = None,
    sound_latents: Tensor | None = None,
    velocity_mask: Tensor | None = None,
    image_latent: Tensor | None = None,
    condition_latents: Tensor | None = None,
    guidance_interval: tuple[float, float] | None = None,
    raw_action_dim: int | None = None,
    scheduler: Any | None = None,
) -> Tensor | tuple[Tensor, ...]

Denoising loop with 3-mode CFG support (parallel, sequential, none).

Cosmos3's UND pathway is text-dependent, so CFG needs separate K/V caches for conditional and unconditional text.

Two modes

CFG parallel (multi-GPU): each rank handles one condition via predict_noise_maybe_with_cfg; caching is rank-local.
Sequential CFG (single-GPU or cfg_size=1): two separate forward passes with explicit cache swapping. We cannot batch B=2 because different text lengths would cause the shorter branch to attend to padding in cross-attention.

I2V conditioning (when both arguments are supplied): * velocity_mask zeros frame-0 noise predictions before stepping. * image_latent is re-injected into frame 0 after each scheduler step, since UniPC's predictor-corrector update rescales the sample (sigma-dependent), so even zero velocity does not preserve frame 0.

guidance_interval (T2I) restricts CFG to timesteps inside the closed interval [lo, hi]. The interval is compared against the raw scheduler timestep value; works for both the [0, 1000] discrete scale and normalized flow-matching scales. Outside the interval the cond/uncond delta is zeroed so all ranks continue to execute identical control flow (CFG-Parallel safe).

diffuse_transfer ¶

diffuse_transfer(
    latents: Tensor,
    timesteps: Tensor,
    cond_ids: Tensor,
    cond_mask: Tensor,
    uncond_ids: Tensor,
    uncond_mask: Tensor,
    guidance_scale: float,
    control_guidance: float,
    control_guidance_interval: tuple[float, float] | None,
    control_latents: list[Tensor],
    shared_kwargs: dict[str, Any],
    *,
    velocity_mask: Tensor,
    condition_latents: Tensor,
    guidance_interval: tuple[float, float] | None = None,
) -> Tensor

forward ¶

forward(req: DiffusionRequestBatch) -> DiffusionOutput

load_weights ¶

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Stream-remap checkpoint weights and load via AutoWeightsLoader.

Handles quantization, TP-aware weight_loader, and buffer loading. Returns the set of loaded parameter names for strict validation.

predict_noise ¶

predict_noise(**kwargs) -> Tensor | tuple[Tensor, ...]

Override CFGParallelMixin.predict_noise for Cosmos3.

The transformer returns the raw prediction: video-only as a tensor, or a tuple in video, action, sound order for multimodal generation.

reference_video_decode_spec `classmethod` ¶

reference_video_decode_spec(
    *,
    num_frames: int | None = None,
    extra_args: dict[str, Any] | None = None,
) -> ReferenceVideoDecodeSpec

get_cosmos3_action_post_process_func ¶

get_cosmos3_action_post_process_func(
    od_config: OmniDiffusionConfig,
)

Build the custom-output postprocessor for Cosmos3 action predictions.

Action modes return predicted action tensors in custom_output alongside normal video output. RoboLab/OpenPI policy serving marks action-only output and carries observation metadata used here to map model-space actions back to the requested robot action representation.

get_cosmos3_ir_op_priority_func ¶

get_cosmos3_ir_op_priority_func(
    od_config: OmniDiffusionConfig,
)

get_cosmos3_post_process_func ¶

get_cosmos3_post_process_func(
    od_config: OmniDiffusionConfig,
)

Build the postprocessor for Cosmos3 image, video, and video+audio output.

The pipeline returns image payloads as {"image": tensor} and video payloads as {"video": tensor}. Sound-enabled video returns the same video payload plus audio and audio_sample_rate. Image output with audio is rejected because Cosmos3 sound generation is video-only.

get_cosmos3_pre_process_func ¶

get_cosmos3_pre_process_func(
    od_config: OmniDiffusionConfig,
)

Build the request preprocessor for Cosmos3 image/video inputs.

For plain T2V (no image or video in multi_modal_data), the request is returned unchanged after the optional guardrail check. For I2V, the conditioning image is loaded, aspect-resized, center-cropped, and stored as additional_information.preprocessed_image. For V2V, source frames are cropped to the target size and stored as additional_information.preprocessed_video.

Action modes reuse image/video preprocessing but use action-specific resize and padding rules. Transfer requests store additional_information.preprocessed_transfer_video for optional input video conditioning. Cosmos3 sound generation is not driven by multi_modal_data["audio"]; it is enabled later from sampling params.

vllm_omni.diffusion.models.cosmos3.pipeline_cosmos3 ¶

COSMOS3_DEFAULT_CONDITION_PIXEL_FRAMES module-attribute ¶

COSMOS3_DEFAULT_MAX_SEQUENCE_LENGTH module-attribute ¶

COSMOS3_DURATION_TEMPLATE module-attribute ¶

COSMOS3_IMAGE_RESOLUTION_TEMPLATE module-attribute ¶

COSMOS3_INVERSE_DURATION_TEMPLATE module-attribute ¶

COSMOS3_INVERSE_IMAGE_RESOLUTION_TEMPLATE module-attribute ¶

COSMOS3_INVERSE_RESOLUTION_TEMPLATE module-attribute ¶

COSMOS3_RESOLUTION_TEMPLATE module-attribute ¶

COSMOS3_SYSTEM_PROMPT module-attribute ¶

COSMOS3_T2I_DEFAULT_FLOW_SHIFT module-attribute ¶

COSMOS3_T2I_DEFAULT_GUIDANCE_INTERVAL module-attribute ¶

COSMOS3_T2I_DEFAULT_GUIDANCE_SCALE module-attribute ¶

COSMOS3_T2I_DEFAULT_HEIGHT module-attribute ¶

COSMOS3_T2I_DEFAULT_NUM_INFERENCE_STEPS module-attribute ¶

COSMOS3_T2I_DEFAULT_WIDTH module-attribute ¶

COSMOS3_T2I_SYSTEM_PROMPT module-attribute ¶

COSMOS3_T2V_DEFAULT_GUIDANCE_SCALE module-attribute ¶

COSMOS3_T2V_DEFAULT_HEIGHT module-attribute ¶

COSMOS3_T2V_DEFAULT_NUM_FRAMES module-attribute ¶

COSMOS3_T2V_DEFAULT_NUM_INFERENCE_STEPS module-attribute ¶

COSMOS3_T2V_DEFAULT_WIDTH module-attribute ¶

COSMOS3_V2V_DEFAULT_FLOW_SHIFT module-attribute ¶

logger module-attribute ¶