vllm_omni.diffusion.models.cosmos3.pipeline_cosmos3 ¶
Cosmos3 text/image-to-video and text-to-image pipeline for vllm-omni.
Single pipeline class supports T2V, I2V, and T2I; the mode is selected at runtime by:
prompt["modalities"]contains"image": T2I (text-to-image).prompt["modalities"]contains"video"or is omitted: T2V (text-to-video).multi_modal_data['image']present on the prompt: I2V (handled by :func:get_cosmos3_pre_process_func)
COSMOS3_DURATION_TEMPLATE module-attribute ¶
COSMOS3_IMAGE_RESOLUTION_TEMPLATE module-attribute ¶
COSMOS3_INVERSE_DURATION_TEMPLATE module-attribute ¶
COSMOS3_INVERSE_DURATION_TEMPLATE = "The video is not {duration:.1f} seconds long and is not of {fps:.0f} FPS."
COSMOS3_INVERSE_IMAGE_RESOLUTION_TEMPLATE module-attribute ¶
COSMOS3_INVERSE_RESOLUTION_TEMPLATE module-attribute ¶
COSMOS3_RESOLUTION_TEMPLATE module-attribute ¶
COSMOS3_SYSTEM_PROMPT module-attribute ¶
COSMOS3_T2I_DEFAULT_GUIDANCE_INTERVAL module-attribute ¶
COSMOS3_T2I_DEFAULT_NUM_INFERENCE_STEPS module-attribute ¶
COSMOS3_T2I_SYSTEM_PROMPT module-attribute ¶
COSMOS3_T2I_SYSTEM_PROMPT = "You are a helpful assistant who will generate images from a given prompt."
COSMOS3_T2V_DEFAULT_NUM_INFERENCE_STEPS module-attribute ¶
Cosmos3OmniDiffusersPipeline ¶
Bases: Module, CFGParallelMixin, SupportImageInput, ProgressBarMixin, DiffusionPipelineProfilerMixin
Cosmos3 text/image-to-video / text-to-image pipeline.
Architecture: Mixture-of-Transformers with Qwen3-VL backbone. - Understanding pathway: causal self-attention on text (runs once, K/V cached) - Generation pathway: cross-attention on noisy visual latents (runs each step)
Supports T2V, I2V, and T2I from the same class. Mode is selected at runtime:
- T2I when
prompt["modalities"]contains"image". Latent T-dim is forced to 1, T2I-specific scheduler defaults are applied (50 steps, flow_shift=3.0, guidance_interval=[400, 1000]), the duration template is suppressed, and post-process emits PIL images. - I2V when the request supplies a preprocessed image via
multi_modal_data['image'](handled by :func:get_cosmos3_pre_process_func) and the requested output modality is not image. Frame 0 of the initial latent is set to the VAE-encoded conditioning image, frame-0 noise predictions are masked to zero, and the clean image latent is re-injected at frame 0 after each scheduler step. - T2V otherwise (default video generation).
scheduler instance-attribute ¶
scheduler = from_pretrained(
model_path,
subfolder="scheduler",
local_files_only=local_files_only,
)
tokenizer instance-attribute ¶
tokenizer = from_pretrained(
model_path,
subfolder="text_tokenizer",
local_files_only=local_files_only,
)
transformer instance-attribute ¶
transformer = Cosmos3VFMTransformer(
od_config=od_config,
temporal_compression_factor=vae_scale_factor_temporal,
sound_gen=sound_gen,
sound_dim=sound_dim,
sound_latent_fps=sound_latent_fps,
)
vae_scale_factor_spatial instance-attribute ¶
vae_scale_factor_spatial = getattr(
config, "scale_factor_spatial", 16
)
vae_scale_factor_temporal instance-attribute ¶
vae_scale_factor_temporal = int(scale_factor_temporal)
video_processor instance-attribute ¶
weights_sources instance-attribute ¶
weights_sources = [
ComponentSource(
model_or_path=model_path,
subfolder=None,
revision=None,
prefix="transformer.",
fall_back_to_pt=True,
allow_patterns_overrides=[
"transformer/*.safetensors"
],
)
]
diffuse ¶
diffuse(
latents: Tensor,
timesteps: Tensor,
cond_ids: Tensor,
cond_mask: Tensor,
uncond_ids: Tensor,
uncond_mask: Tensor,
guidance_scale: float,
shared_kwargs: dict,
*,
action_latents: Tensor | None = None,
action_velocity_mask: Tensor | None = None,
action_condition_latents: Tensor | None = None,
sound_latents: Tensor | None = None,
velocity_mask: Tensor | None = None,
image_latent: Tensor | None = None,
condition_latents: Tensor | None = None,
guidance_interval: tuple[float, float] | None = None,
raw_action_dim: int | None = None,
) -> Tensor | tuple[Tensor, ...]
Denoising loop with 3-mode CFG support (parallel, sequential, none).
Cosmos3's UND pathway is text-dependent, so CFG needs separate K/V caches for conditional and unconditional text.
Two modes
- CFG parallel (multi-GPU): each rank handles one condition via predict_noise_maybe_with_cfg; caching is rank-local.
- Sequential CFG (single-GPU or cfg_size=1): two separate forward passes with explicit cache swapping. We cannot batch B=2 because different text lengths would cause the shorter branch to attend to padding in cross-attention.
I2V conditioning (when both arguments are supplied): * velocity_mask zeros frame-0 noise predictions before stepping. * image_latent is re-injected into frame 0 after each scheduler step, since UniPC's predictor-corrector update rescales the sample (sigma-dependent), so even zero velocity does not preserve frame 0.
guidance_interval (T2I) restricts CFG to timesteps inside the closed interval [lo, hi]. The interval is compared against the raw scheduler timestep value; works for both the [0, 1000] discrete scale and normalized flow-matching scales. Outside the interval the cond/uncond delta is zeroed so all ranks continue to execute identical control flow (CFG-Parallel safe).
load_weights ¶
Stream-remap checkpoint weights and load via AutoWeightsLoader.
Handles quantization, TP-aware weight_loader, and buffer loading. Returns the set of loaded parameter names for strict validation.
get_cosmos3_pre_process_func ¶
get_cosmos3_pre_process_func(
od_config: OmniDiffusionConfig,
)
Pre-process function for both T2V and I2V.
For T2V (no image in multi_modal_data), the request is returned unchanged after the optional guardrails check. For I2V (image present), the conditioning image is loaded, aspect-resized + center-cropped, and stored back on the prompt as additional_information.preprocessed_image.