vllm_omni.diffusion.models.cosmos3 ¶
Modules:
| Name | Description |
|---|---|
action | Action-token helpers for Cosmos3 UVA/action generation. |
audio_tokenizer | |
guardrails | Cosmos3 guardrail hooks for vllm-omni. |
pipeline_cosmos3 | Cosmos3 text/image-to-video and text-to-image pipeline for vllm-omni. |
sound_tokenizer | Cosmos3 sound tokenizer integration. |
transformer_cosmos3 | Cosmos3 VFM Transformer for vllm-omni. |
Cosmos3OmniDiffusersPipeline ¶
Bases: Module, CFGParallelMixin, SupportImageInput, ProgressBarMixin, DiffusionPipelineProfilerMixin
Cosmos3 text/image-to-video / text-to-image pipeline.
Architecture: Mixture-of-Transformers with Qwen3-VL backbone. - Understanding pathway: causal self-attention on text (runs once, K/V cached) - Generation pathway: cross-attention on noisy visual latents (runs each step)
Supports T2V, I2V, and T2I from the same class. Mode is selected at runtime:
- T2I when
prompt["modalities"]contains"image". Latent T-dim is forced to 1, T2I-specific scheduler defaults are applied (50 steps, flow_shift=3.0, guidance_interval=[400, 1000]), the duration template is suppressed, and post-process emits PIL images. - I2V when the request supplies a preprocessed image via
multi_modal_data['image'](handled by :func:get_cosmos3_pre_process_func) and the requested output modality is not image. Frame 0 of the initial latent is set to the VAE-encoded conditioning image, frame-0 noise predictions are masked to zero, and the clean image latent is re-injected at frame 0 after each scheduler step. - T2V otherwise (default video generation).
scheduler instance-attribute ¶
scheduler = from_pretrained(
model_path,
subfolder="scheduler",
local_files_only=local_files_only,
)
tokenizer instance-attribute ¶
tokenizer = from_pretrained(
model_path,
subfolder="text_tokenizer",
local_files_only=local_files_only,
)
transformer instance-attribute ¶
transformer = Cosmos3VFMTransformer(
od_config=od_config,
temporal_compression_factor=vae_scale_factor_temporal,
sound_gen=sound_gen,
sound_dim=sound_dim,
sound_latent_fps=sound_latent_fps,
)
vae_scale_factor_spatial instance-attribute ¶
vae_scale_factor_spatial = getattr(
config, "scale_factor_spatial", 16
)
vae_scale_factor_temporal instance-attribute ¶
vae_scale_factor_temporal = int(scale_factor_temporal)
video_processor instance-attribute ¶
weights_sources instance-attribute ¶
weights_sources = [
ComponentSource(
model_or_path=model_path,
subfolder=None,
revision=None,
prefix="transformer.",
fall_back_to_pt=True,
allow_patterns_overrides=[
"transformer/*.safetensors"
],
)
]
diffuse ¶
diffuse(
latents: Tensor,
timesteps: Tensor,
cond_ids: Tensor,
cond_mask: Tensor,
uncond_ids: Tensor,
uncond_mask: Tensor,
guidance_scale: float,
shared_kwargs: dict,
*,
action_latents: Tensor | None = None,
action_velocity_mask: Tensor | None = None,
action_condition_latents: Tensor | None = None,
sound_latents: Tensor | None = None,
velocity_mask: Tensor | None = None,
image_latent: Tensor | None = None,
condition_latents: Tensor | None = None,
guidance_interval: tuple[float, float] | None = None,
raw_action_dim: int | None = None,
) -> Tensor | tuple[Tensor, ...]
Denoising loop with 3-mode CFG support (parallel, sequential, none).
Cosmos3's UND pathway is text-dependent, so CFG needs separate K/V caches for conditional and unconditional text.
Two modes
- CFG parallel (multi-GPU): each rank handles one condition via predict_noise_maybe_with_cfg; caching is rank-local.
- Sequential CFG (single-GPU or cfg_size=1): two separate forward passes with explicit cache swapping. We cannot batch B=2 because different text lengths would cause the shorter branch to attend to padding in cross-attention.
I2V conditioning (when both arguments are supplied): * velocity_mask zeros frame-0 noise predictions before stepping. * image_latent is re-injected into frame 0 after each scheduler step, since UniPC's predictor-corrector update rescales the sample (sigma-dependent), so even zero velocity does not preserve frame 0.
guidance_interval (T2I) restricts CFG to timesteps inside the closed interval [lo, hi]. The interval is compared against the raw scheduler timestep value; works for both the [0, 1000] discrete scale and normalized flow-matching scales. Outside the interval the cond/uncond delta is zeroed so all ranks continue to execute identical control flow (CFG-Parallel safe).
load_weights ¶
Stream-remap checkpoint weights and load via AutoWeightsLoader.
Handles quantization, TP-aware weight_loader, and buffer loading. Returns the set of loaded parameter names for strict validation.
Cosmos3VFMTransformer ¶
Bases: Module
Cosmos3 VFM Transformer: UND language model + GEN denoising layers.
The UND pathway runs once per generation (K/V cached). The GEN pathway runs at each denoising step.
Layerwise offloading uses gen_layers as the block container.
Sequence parallelism uses _sp_plan to shard/gather the GEN pathway at module boundaries. Cosmos3CrossAttention checks forward_context.sp_active at runtime and routes to the framework Attention layer (with Ulysses all-to-all) or plain SDPA accordingly.
action_dim instance-attribute ¶
action_dim = int(
action_dim_value if action_dim_value is not None else 64
)
action_gen instance-attribute ¶
action_modality_embed instance-attribute ¶
action_proj_in instance-attribute ¶
action_proj_in = DomainAwareLinear(
action_dim,
hidden_size,
num_embodiment_domains,
dtype=dtype,
)
action_proj_out instance-attribute ¶
action_proj_out = DomainAwareLinear(
hidden_size,
action_dim,
num_embodiment_domains,
dtype=dtype,
)
enable_fps_modulation instance-attribute ¶
enable_fps_modulation = bool(
_tf_config_get(
model_config, "enable_fps_modulation", True
)
)
gen_layers instance-attribute ¶
gen_layers = ModuleList(
[
(
Cosmos3GenDecoderLayer(
layer_idx=i,
hidden_size=hidden_size,
intermediate_size=intermediate_size,
num_attention_heads=num_attention_heads,
num_key_value_heads=num_key_value_heads,
head_dim=head_dim,
rms_norm_eps=rms_norm_eps,
quant_config=quant_config,
prefix=f"gen_layers.{i}",
)
)
for i in (range(num_hidden_layers))
]
)
hidden_size instance-attribute ¶
hidden_size = int(
_tf_config_get(model_config, "hidden_size", 4096)
)
intermediate_size instance-attribute ¶
intermediate_size = int(
_tf_config_get(model_config, "intermediate_size", 12288)
)
language_model instance-attribute ¶
language_model = Cosmos3LanguageModel(
hidden_size=hidden_size,
intermediate_size=intermediate_size,
num_hidden_layers=num_hidden_layers,
num_attention_heads=num_attention_heads,
num_key_value_heads=num_key_value_heads,
head_dim=head_dim,
vocab_size=vocab_size,
rms_norm_eps=rms_norm_eps,
rope_theta=rope_theta,
mrope_section=mrope_section,
quant_config=quant_config,
prefix="language_model",
)
latent_channel_size instance-attribute ¶
latent_channel_size = int(
_tf_config_get(model_config, "latent_channel", 48)
)
latent_patch_size instance-attribute ¶
latent_patch_size = int(
_tf_config_get(model_config, "latent_patch_size", 2)
)
num_attention_heads instance-attribute ¶
num_attention_heads = int(
_tf_config_get(model_config, "num_attention_heads", 32)
)
num_embodiment_domains instance-attribute ¶
num_embodiment_domains = int(
_od_config_get(od_config, "num_embodiment_domains", 32)
)
num_hidden_layers instance-attribute ¶
num_hidden_layers = int(
_tf_config_get(model_config, "num_hidden_layers", 36)
)
num_key_value_heads instance-attribute ¶
num_key_value_heads = int(
_tf_config_get(model_config, "num_key_value_heads", 8)
)
patch_latent_dim instance-attribute ¶
rms_norm_eps instance-attribute ¶
rms_norm_eps = float(
_tf_config_get(model_config, "rms_norm_eps", 1e-06)
)
rope_theta instance-attribute ¶
rope_theta = float(
_tf_config_get(model_config, "rope_theta", 5000000)
)
temporal_compression_factor instance-attribute ¶
temporal_compression_factor = int(
temporal_compression_factor
)
temporal_compression_factor_sound instance-attribute ¶
temporal_compression_factor_sound = int(
_tf_config_get(
model_config, "temporal_compression_factor_sound", 1
)
)
temporal_modality_margin instance-attribute ¶
temporal_modality_margin = int(
_tf_config_get(
model_config,
"unified_3d_mrope_temporal_modality_margin",
15000,
)
)
time_embedder instance-attribute ¶
time_embedder = TimestepEmbedder(
hidden_size, target_dtype=dtype
)
timestep_scale instance-attribute ¶
timestep_scale = float(
_tf_config_get(model_config, "timestep_scale", 0.001)
)
vocab_size instance-attribute ¶
vocab_size = int(
_tf_config_get(model_config, "vocab_size", 151936)
)
forward ¶
forward(
hidden_states: Tensor,
timestep: Tensor,
text_ids: Tensor,
text_mask: Tensor,
video_shape: tuple[int, int, int],
fps: float | None = None,
action_latents: Tensor | None = None,
action_domain_ids: Tensor | None = None,
action_noisy_mask: Tensor | None = None,
action_start_frame_offset: int = 1,
action_fps: float | None = None,
sound_latents: Tensor | None = None,
noisy_frame_mask: Tensor | None = None,
**kwargs,
) -> Tensor | tuple[Tensor, ...]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_states | Tensor | [B, C, t, h, w] noisy latents | required |
timestep | Tensor | [B] diffusion timestep | required |
text_ids | Tensor | [B, S_text] tokenized text | required |
text_mask | Tensor | [B, S_text] attention mask (1=real, 0=pad) | required |
video_shape | tuple[int, int, int] | (t, h, w) in latent space | required |
fps | float | None | video frame rate for temporal mRoPE modulation | None |
action_latents | Tensor | None | Optional [B, T_action, D_action] noisy action latents. | None |
action_domain_ids | Tensor | None | Optional [B] embodiment domain IDs for action projections. | None |
action_noisy_mask | Tensor | None | Optional [B, T_action, 1] mask where 1=noisy action token and 0=clean conditioned token. | None |
sound_latents | Tensor | None | Optional [B, C_sound, T_sound] noisy sound latents. | None |
noisy_frame_mask | Tensor | None | Optional [B, 1, t, 1, 1] mask where 1=noisy (add timestep embedding, predict velocity) and 0=conditioned (clean context, skip timestep embedding). None means all frames noisy (T2V mode). | None |
Returns:
| Type | Description |
|---|---|
Tensor | tuple[Tensor, ...] | [B, C, t, h, w] velocity prediction, or |
Tensor | tuple[Tensor, ...] | tuple outputs in video, action, sound order when extra modalities are provided. |
pack_action ¶
Validate and return action latents as [B, T_action, D_action] tokens.
pack_sound ¶
[B, C_sound, T_sound] -> [B, T_sound, C_sound].
patchify ¶
[B, C, t, h, w] -> [B, thpwp, ppC], padding h/w if needed.
unpack_action staticmethod ¶
Return [B, T_action, D_action] action predictions.
unpack_sound staticmethod ¶
[B, T_sound, C_sound] -> [B, C_sound, T_sound].
get_cosmos3_pre_process_func ¶
get_cosmos3_pre_process_func(
od_config: OmniDiffusionConfig,
)
Pre-process function for both T2V and I2V.
For T2V (no image in multi_modal_data), the request is returned unchanged after the optional guardrails check. For I2V (image present), the conditioning image is loaded, aspect-resized + center-cropped, and stored back on the prompt as additional_information.preprocessed_image.