vllm_omni.diffusion.models.cosmos3 ¶

Modules:

Name	Description
`action`	Action-token helpers for Cosmos3 action generation.
`audio_tokenizer`
`guardrails`	Cosmos3 guardrail hooks for vllm-omni.
`pipeline_cosmos3`	Cosmos3 text/image/video/sound/action pipeline for vllm-omni.
`sound_tokenizer`	Cosmos3 sound tokenizer integration.
`transfer`	Cosmos3 transfer inference helpers.
`transformer_cosmos3`	Cosmos3 VFM Transformer for vllm-omni.
`utils`

Cosmos3OmniDiffusersPipeline ¶

Bases: Module, CFGParallelMixin, SupportImageInput, ProgressBarMixin, DiffusionPipelineProfilerMixin

Cosmos3 text/image/video/sound/action pipeline.

Architecture: Mixture-of-Transformers with Qwen3-VL backbone. - Understanding pathway: causal self-attention on text (runs once, K/V cached) - Generation pathway: cross-attention on visual latents and optional transfer-control, action, and sound latents (runs each step)

Supports T2V, I2V, V2V, T2I, transfer, sound-enabled video, and action generation from the same class. Mode is selected at runtime:

T2I when prompt["modalities"] contains "image". Latent T-dim is forced to 1, T2I-specific scheduler defaults are applied (50 steps, flow_shift=3.0, guidance_interval=[400, 1000]), the duration template is suppressed, and post-process emits PIL images.
I2V when the request supplies a preprocessed image via multi_modal_data['image'] (handled by :func:get_cosmos3_pre_process_func) and the requested output modality is not image. Frame 0 of the initial latent is set to the VAE-encoded conditioning image, frame-0 noise predictions are masked to zero, and the clean image latent is re-injected at frame 0 after each scheduler step.
V2V when the request supplies a preprocessed video via multi_modal_data['video'] without an action mode. Explicit latent frame indexes are kept clean with noisy_frame_mask and re-injected after each scheduler step.
Transfer when edge, blur, depth, seg, or wsm hints are supplied. Transfer is video-output only and cannot be combined with sound or action generation.
Sound-enabled video when generate_sound or sound_gen is true. Sound is generated from sound latents, not from multi_modal_data['audio']; T2I, transfer, and action+sound are rejected.
Action generation when action_mode is provided. policy and forward_dynamics require an image or video input; inverse_dynamics requires video input. Action predictions are returned in custom_output. RoboLab/OpenPI observations in extra_args['robot_obs'] or extra_args['observation'] return action-only custom output.
T2V otherwise (default video generation).

color_format `class-attribute` ¶

color_format: str = 'RGB'

device `instance-attribute` ¶

device = get_local_device()

do_classifier_free_guidance `property` ¶

do_classifier_free_guidance

dtype `instance-attribute` ¶

dtype = od_config.dtype

guidance_scale `property` ¶

guidance_scale

num_timesteps `property` ¶

num_timesteps

od_config `instance-attribute` ¶

od_config = od_config

scheduler `instance-attribute` ¶

scheduler = UniPCMultistepScheduler.from_pretrained(
    model_path,
    subfolder="scheduler",
    local_files_only=local_files_only,
)

support_image_input `class-attribute` ¶

support_image_input: bool = True

tokenizer `instance-attribute` ¶

tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    subfolder="text_tokenizer",
    local_files_only=local_files_only,
)

transformer `instance-attribute` ¶

transformer = Cosmos3VFMTransformer(
    od_config=od_config,
    temporal_compression_factor=self.vae_scale_factor_temporal,
    sound_gen=sound_gen,
    sound_dim=sound_dim,
    sound_latent_fps=sound_latent_fps,
)

vae `instance-attribute` ¶

vae = DistributedAutoencoderKLWan.from_pretrained(
    model_path,
    subfolder="vae",
    torch_dtype=self.dtype,
    local_files_only=local_files_only,
).to(self.device)

vae_scale_factor_spatial `instance-attribute` ¶

vae_scale_factor_spatial = getattr(
    self.vae.config, "scale_factor_spatial", 16
)

vae_scale_factor_temporal `instance-attribute` ¶

vae_scale_factor_temporal = int(
    self.vae.config.scale_factor_temporal
)

video_processor `instance-attribute` ¶

video_processor = VideoProcessor(
    vae_scale_factor=self.vae_scale_factor_spatial
)

weights_sources `instance-attribute` ¶

weights_sources = [
    DiffusersPipelineLoader.ComponentSource(
        model_or_path=model_path,
        subfolder=None,
        revision=None,
        prefix="transformer.",
        fall_back_to_pt=True,
        allow_patterns_overrides=[
            "transformer/*.safetensors"
        ],
    )
]

combine_multi_branch_cfg_noise ¶

combine_multi_branch_cfg_noise(
    predictions: list[Tensor | tuple[Tensor, ...]],
    true_cfg_scale: float | dict[str, float],
    cfg_normalize: bool = False,
) -> Tensor | tuple[Tensor, ...]

diffuse ¶

diffuse(
    latents: Tensor,
    timesteps: Tensor,
    cond_ids: Tensor,
    cond_mask: Tensor,
    uncond_ids: Tensor,
    uncond_mask: Tensor,
    guidance_scale: float,
    shared_kwargs: dict,
    *,
    action_latents: Tensor | None = None,
    action_velocity_mask: Tensor | None = None,
    action_condition_latents: Tensor | None = None,
    sound_latents: Tensor | None = None,
    velocity_mask: Tensor | None = None,
    image_latent: Tensor | None = None,
    condition_latents: Tensor | None = None,
    guidance_interval: tuple[float, float] | None = None,
    raw_action_dim: int | None = None,
    scheduler: Any | None = None,
) -> Tensor | tuple[Tensor, ...]

Denoising loop with 3-mode CFG support (parallel, sequential, none).

Cosmos3's UND pathway is text-dependent, so CFG needs separate K/V caches for conditional and unconditional text.

Two modes

CFG parallel (multi-GPU): each rank handles one condition via predict_noise_maybe_with_cfg; caching is rank-local.
Sequential CFG (single-GPU or cfg_size=1): two separate forward passes with explicit cache swapping. We cannot batch B=2 because different text lengths would cause the shorter branch to attend to padding in cross-attention.

I2V conditioning (when both arguments are supplied): * velocity_mask zeros frame-0 noise predictions before stepping. * image_latent is re-injected into frame 0 after each scheduler step, since UniPC's predictor-corrector update rescales the sample (sigma-dependent), so even zero velocity does not preserve frame 0.

guidance_interval (T2I) restricts CFG to timesteps inside the closed interval [lo, hi]. The interval is compared against the raw scheduler timestep value; works for both the [0, 1000] discrete scale and normalized flow-matching scales. Outside the interval the cond/uncond delta is zeroed so all ranks continue to execute identical control flow (CFG-Parallel safe).

diffuse_transfer ¶

diffuse_transfer(
    latents: Tensor,
    timesteps: Tensor,
    cond_ids: Tensor,
    cond_mask: Tensor,
    uncond_ids: Tensor,
    uncond_mask: Tensor,
    guidance_scale: float,
    control_guidance: float,
    control_guidance_interval: tuple[float, float] | None,
    control_latents: list[Tensor],
    shared_kwargs: dict[str, Any],
    *,
    velocity_mask: Tensor,
    condition_latents: Tensor,
    guidance_interval: tuple[float, float] | None = None,
) -> Tensor

forward ¶

forward(req: DiffusionRequestBatch) -> DiffusionOutput

load_weights ¶

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Stream-remap checkpoint weights and load via AutoWeightsLoader.

Handles quantization, TP-aware weight_loader, and buffer loading. Returns the set of loaded parameter names for strict validation.

predict_noise ¶

predict_noise(**kwargs) -> Tensor | tuple[Tensor, ...]

Override CFGParallelMixin.predict_noise for Cosmos3.

The transformer returns the raw prediction: video-only as a tensor, or a tuple in video, action, sound order for multimodal generation.

reference_video_decode_spec `classmethod` ¶

reference_video_decode_spec(
    *,
    num_frames: int | None = None,
    extra_args: dict[str, Any] | None = None,
) -> ReferenceVideoDecodeSpec

Cosmos3VFMTransformer ¶

Bases: Module

Cosmos3 VFM Transformer: UND language model + GEN denoising layers.

The UND pathway runs once per generation (K/V cached). The GEN pathway runs at each denoising step over the target video/image latent stream and optional transfer-control, action, and sound latent streams.

Layerwise offloading uses gen_layers as the block container.

Sequence parallelism uses _sp_plan to shard/gather the GEN pathway at module boundaries. Cosmos3CrossAttention checks forward_context.sp_active at runtime and routes to the framework Attention layer (with Ulysses all-to-all) or plain SDPA accordingly.

action_dim `instance-attribute` ¶

action_dim = int(
    action_dim_value if action_dim_value is not None else 64
)

action_gen `instance-attribute` ¶

action_gen = (
    _as_bool(action_gen_value)
    if action_gen_value is not None
    else False
)

action_modality_embed `instance-attribute` ¶

action_modality_embed = nn.Parameter(
    torch.zeros(self.hidden_size, dtype=dtype)
)

action_proj_in `instance-attribute` ¶

action_proj_in = DomainAwareLinear(
    self.action_dim,
    self.hidden_size,
    self.num_embodiment_domains,
    dtype=dtype,
)

action_proj_out `instance-attribute` ¶

action_proj_out = DomainAwareLinear(
    self.hidden_size,
    self.action_dim,
    self.num_embodiment_domains,
    dtype=dtype,
)

audio_modality_embed `instance-attribute` ¶

audio_modality_embed = nn.Parameter(
    torch.zeros(self.hidden_size)
)

audio_proj_in `instance-attribute` ¶

audio_proj_in = nn.Linear(self.sound_dim, self.hidden_size)

audio_proj_out `instance-attribute` ¶

audio_proj_out = nn.Linear(self.hidden_size, self.sound_dim)

base_fps `instance-attribute` ¶

base_fps = float(
    _tf_config_get(model_config, "base_fps", 24.0)
)

cached_freqs_gen `instance-attribute` ¶

cached_freqs_gen: tuple[Tensor, Tensor] | None = None

cached_kv `instance-attribute` ¶

cached_kv: list[tuple[Tensor, Tensor]] | None = None

device `property` ¶

device: device

enable_fps_modulation `instance-attribute` ¶

enable_fps_modulation = bool(
    _tf_config_get(
        model_config, "enable_fps_modulation", True
    )
)

gen_layers `instance-attribute` ¶

gen_layers = nn.ModuleList(
    [
        (
            Cosmos3GenDecoderLayer(
                layer_idx=i,
                hidden_size=self.hidden_size,
                intermediate_size=self.intermediate_size,
                num_attention_heads=self.num_attention_heads,
                num_key_value_heads=self.num_key_value_heads,
                head_dim=self.head_dim,
                rms_norm_eps=self.rms_norm_eps,
                quant_config=quant_config,
                prefix=f"gen_layers.{i}",
            )
        )
        for i in (range(self.num_hidden_layers))
    ]
)

gen_sp_gather `instance-attribute` ¶

gen_sp_gather = nn.Identity()

gen_sp_prepare `instance-attribute` ¶

gen_sp_prepare = Cosmos3GenSPPrepare()

head_dim `instance-attribute` ¶

head_dim = int(
    _tf_config_get(model_config, "head_dim", 128)
)

hidden_size `instance-attribute` ¶

hidden_size = int(
    _tf_config_get(model_config, "hidden_size", 4096)
)

intermediate_size `instance-attribute` ¶

intermediate_size = int(
    _tf_config_get(model_config, "intermediate_size", 12288)
)

language_model `instance-attribute` ¶

language_model = Cosmos3LanguageModel(
    hidden_size=self.hidden_size,
    intermediate_size=self.intermediate_size,
    num_hidden_layers=self.num_hidden_layers,
    num_attention_heads=self.num_attention_heads,
    num_key_value_heads=self.num_key_value_heads,
    head_dim=self.head_dim,
    vocab_size=self.vocab_size,
    rms_norm_eps=self.rms_norm_eps,
    rope_theta=self.rope_theta,
    mrope_section=self.mrope_section,
    quant_config=quant_config,
    prefix="language_model",
)

latent_channel_size `instance-attribute` ¶

latent_channel_size = int(
    _tf_config_get(model_config, "latent_channel", 48)
)

latent_patch_size `instance-attribute` ¶

latent_patch_size = int(
    _tf_config_get(model_config, "latent_patch_size", 2)
)

mrope_section `instance-attribute` ¶

mrope_section = list(
    rope_scaling.get("mrope_section", [24, 20, 20])
)

norm_moe_gen `instance-attribute` ¶

norm_moe_gen = RMSNorm(
    self.hidden_size, eps=self.rms_norm_eps
)

num_attention_heads `instance-attribute` ¶

num_attention_heads = int(
    _tf_config_get(model_config, "num_attention_heads", 32)
)

num_embodiment_domains `instance-attribute` ¶

num_embodiment_domains = int(
    _od_config_get(od_config, "num_embodiment_domains", 32)
)

num_hidden_layers `instance-attribute` ¶

num_hidden_layers = int(
    _tf_config_get(model_config, "num_hidden_layers", 36)
)

num_key_value_heads `instance-attribute` ¶

num_key_value_heads = int(
    _tf_config_get(model_config, "num_key_value_heads", 8)
)

packed_modules_mapping `class-attribute` `instance-attribute` ¶

packed_modules_mapping = {}

patch_latent_dim `instance-attribute` ¶

patch_latent_dim = (
    self.latent_patch_size**2 * self.latent_channel_size
)

proj_in `instance-attribute` ¶

proj_in = nn.Linear(self.patch_latent_dim, self.hidden_size)

proj_out `instance-attribute` ¶

proj_out = nn.Linear(
    self.hidden_size, self.patch_latent_dim
)

rms_norm_eps `instance-attribute` ¶

rms_norm_eps = float(
    _tf_config_get(model_config, "rms_norm_eps", 1e-06)
)

rope_theta `instance-attribute` ¶

rope_theta = float(
    _tf_config_get(model_config, "rope_theta", 5000000)
)

sound_dim `instance-attribute` ¶

sound_dim = sound_dim

sound_gen `instance-attribute` ¶

sound_gen = sound_gen

sound_latent_fps `instance-attribute` ¶

sound_latent_fps = sound_latent_fps

temporal_compression_factor `instance-attribute` ¶

temporal_compression_factor = int(
    temporal_compression_factor
)

temporal_compression_factor_sound `instance-attribute` ¶

temporal_compression_factor_sound = int(
    _tf_config_get(
        model_config, "temporal_compression_factor_sound", 1
    )
)

temporal_modality_margin `instance-attribute` ¶

temporal_modality_margin = int(
    _tf_config_get(
        model_config,
        "unified_3d_mrope_temporal_modality_margin",
        15000,
    )
)

time_embedder `instance-attribute` ¶

time_embedder = TimestepEmbedder(
    self.hidden_size, target_dtype=dtype
)

timestep_scale `instance-attribute` ¶

timestep_scale = float(
    _tf_config_get(model_config, "timestep_scale", 0.001)
)

vocab_size `instance-attribute` ¶

vocab_size = int(
    _tf_config_get(model_config, "vocab_size", 151936)
)

forward ¶

forward(
    hidden_states: Tensor,
    timestep: Tensor,
    text_ids: Tensor,
    text_mask: Tensor,
    video_shape: tuple[int, int, int],
    fps: float | None = None,
    action_latents: Tensor | None = None,
    action_domain_ids: Tensor | None = None,
    action_noisy_mask: Tensor | None = None,
    action_start_frame_offset: int = 1,
    action_fps: float | None = None,
    sound_latents: Tensor | None = None,
    noisy_frame_mask: Tensor | None = None,
    control_latents: list[Tensor]
    | tuple[Tensor, ...]
    | Tensor
    | None = None,
    transfer_share_vision_temporal_positions: bool = True,
    **kwargs,
) -> Tensor | tuple[Tensor, ...]

Parameters:

Name	Type	Description	Default
`hidden_states`	`Tensor`	[B, C, t, h, w] noisy latents	required
`timestep`	`Tensor`	[B] diffusion timestep	required
`text_ids`	`Tensor`	[B, S_text] tokenized text	required
`text_mask`	`Tensor`	[B, S_text] attention mask (1=real, 0=pad)	required
`video_shape`	`tuple[int, int, int]`	(t, h, w) in latent space	required
`fps`	`float \| None`	video frame rate for temporal mRoPE modulation	`None`
`action_latents`	`Tensor \| None`	Optional [B, T_action, D_action] noisy action latents.	`None`
`action_domain_ids`	`Tensor \| None`	Optional [B] embodiment domain IDs for action projections.	`None`
`action_noisy_mask`	`Tensor \| None`	Optional [B, T_action, 1] mask where 1=noisy action token and 0=clean conditioned token.	`None`
`sound_latents`	`Tensor \| None`	Optional [B, C_sound, T_sound] noisy sound latents.	`None`
`noisy_frame_mask`	`Tensor \| None`	Optional [B, 1, t, 1, 1] mask where 1=noisy (add timestep embedding, predict velocity) and 0=conditioned (clean context, skip timestep embedding). None means all target vision frames are noisy, as in T2I/T2V.	`None`
`control_latents`	`list[Tensor] \| tuple[Tensor, ...] \| Tensor \| None`	Optional transfer-control latents. Controls are clean vision context and are packed before the noisy target.	`None`

Returns:

Type	Description
`Tensor \| tuple[Tensor, ...]`	[B, C, t, h, w] velocity prediction, or
`Tensor \| tuple[Tensor, ...]`	tuple outputs in video, action, sound order when action/sound streams
`Tensor \| tuple[Tensor, ...]`	are provided. Transfer-control streams condition the video prediction
`Tensor \| tuple[Tensor, ...]`	and are not returned.

pack_action ¶

pack_action(action_latents: Tensor) -> Tensor

Validate and return action latents as [B, T_action, D_action] tokens.

pack_sound ¶

pack_sound(sound_latents: Tensor) -> Tensor

[B, C_sound, T_sound] -> [B, T_sound, C_sound].

patchify ¶

patchify(latents: Tensor, t: int, h: int, w: int) -> Tensor

[B, C, t, h, w] -> [B, thpwp, ppC], padding h/w if needed.

post_load_weights ¶

post_load_weights() -> None

Post-load processing: ensure correct dtypes.

reset_cache ¶

reset_cache() -> None

sound_latent_frames_for_sequence_parallel ¶

sound_latent_frames_for_sequence_parallel(
    *,
    video_shape: tuple[int, int, int],
    sound_frames: int,
    num_vision_items: int = 1,
) -> int

unpack_action `staticmethod` ¶

unpack_action(tokens: Tensor) -> Tensor

Return [B, T_action, D_action] action predictions.

unpack_sound `staticmethod` ¶

unpack_sound(tokens: Tensor) -> Tensor

[B, T_sound, C_sound] -> [B, C_sound, T_sound].

unpatchify ¶

unpatchify(
    tokens: Tensor, t: int, h: int, w: int
) -> Tensor

[B, thpwp, ppC] -> [B, C, t, h, w], cropping padding if needed.

get_cosmos3_action_post_process_func ¶

get_cosmos3_action_post_process_func(
    od_config: OmniDiffusionConfig,
)

Build the custom-output postprocessor for Cosmos3 action predictions.

Action modes return predicted action tensors in custom_output alongside normal video output. RoboLab/OpenPI policy serving marks action-only output and carries observation metadata used here to map model-space actions back to the requested robot action representation.

get_cosmos3_post_process_func ¶

get_cosmos3_post_process_func(
    od_config: OmniDiffusionConfig,
)

Build the postprocessor for Cosmos3 image, video, and video+audio output.

The pipeline returns image payloads as {"image": tensor} and video payloads as {"video": tensor}. Sound-enabled video returns the same video payload plus audio and audio_sample_rate. Image output with audio is rejected because Cosmos3 sound generation is video-only.

get_cosmos3_pre_process_func ¶

get_cosmos3_pre_process_func(
    od_config: OmniDiffusionConfig,
)

Build the request preprocessor for Cosmos3 image/video inputs.

For plain T2V (no image or video in multi_modal_data), the request is returned unchanged after the optional guardrail check. For I2V, the conditioning image is loaded, aspect-resized, center-cropped, and stored as additional_information.preprocessed_image. For V2V, source frames are cropped to the target size and stored as additional_information.preprocessed_video.

Action modes reuse image/video preprocessing but use action-specific resize and padding rules. Transfer requests store additional_information.preprocessed_transfer_video for optional input video conditioning. Cosmos3 sound generation is not driven by multi_modal_data["audio"]; it is enabled later from sampling params.

vllm_omni.diffusion.models.cosmos3 ¶

Cosmos3OmniDiffusersPipeline ¶

color_format class-attribute ¶

device instance-attribute ¶

do_classifier_free_guidance property ¶

dtype instance-attribute ¶

guidance_scale property ¶

num_timesteps property ¶

od_config instance-attribute ¶

scheduler instance-attribute ¶

support_image_input class-attribute ¶

tokenizer instance-attribute ¶

transformer instance-attribute ¶

vae instance-attribute ¶

vae_scale_factor_spatial instance-attribute ¶

vae_scale_factor_temporal instance-attribute ¶

video_processor instance-attribute ¶

weights_sources instance-attribute ¶

combine_multi_branch_cfg_noise ¶

diffuse ¶

diffuse_transfer ¶

forward ¶

load_weights ¶

predict_noise ¶

reference_video_decode_spec classmethod ¶

Cosmos3VFMTransformer ¶

action_dim instance-attribute ¶

action_gen instance-attribute ¶

action_modality_embed instance-attribute ¶

action_proj_in instance-attribute ¶

action_proj_out instance-attribute ¶

audio_modality_embed instance-attribute ¶

audio_proj_in instance-attribute ¶

audio_proj_out instance-attribute ¶

base_fps instance-attribute ¶

cached_freqs_gen instance-attribute ¶

cached_kv instance-attribute ¶

device property ¶

enable_fps_modulation instance-attribute ¶

gen_layers instance-attribute ¶

gen_sp_gather instance-attribute ¶

gen_sp_prepare instance-attribute ¶

head_dim instance-attribute ¶

hidden_size instance-attribute ¶

intermediate_size instance-attribute ¶

language_model instance-attribute ¶

latent_channel_size instance-attribute ¶

latent_patch_size instance-attribute ¶

mrope_section instance-attribute ¶

norm_moe_gen instance-attribute ¶

num_attention_heads instance-attribute ¶

num_embodiment_domains instance-attribute ¶

num_hidden_layers instance-attribute ¶

num_key_value_heads instance-attribute ¶

packed_modules_mapping class-attribute instance-attribute ¶

patch_latent_dim instance-attribute ¶

proj_in instance-attribute ¶

proj_out instance-attribute ¶

rms_norm_eps instance-attribute ¶

rope_theta instance-attribute ¶

sound_dim instance-attribute ¶

sound_gen instance-attribute ¶

sound_latent_fps instance-attribute ¶

temporal_compression_factor instance-attribute ¶

temporal_compression_factor_sound instance-attribute ¶

temporal_modality_margin instance-attribute ¶

time_embedder instance-attribute ¶

timestep_scale instance-attribute ¶

vocab_size instance-attribute ¶

forward ¶

pack_action ¶

pack_sound ¶

patchify ¶

post_load_weights ¶

reset_cache ¶

sound_latent_frames_for_sequence_parallel ¶

unpack_action staticmethod ¶

unpack_sound staticmethod ¶

unpatchify ¶

get_cosmos3_action_post_process_func ¶

color_format `class-attribute` ¶

device `instance-attribute` ¶

do_classifier_free_guidance `property` ¶

dtype `instance-attribute` ¶

guidance_scale `property` ¶

num_timesteps `property` ¶

od_config `instance-attribute` ¶

scheduler `instance-attribute` ¶

support_image_input `class-attribute` ¶

tokenizer `instance-attribute` ¶

transformer `instance-attribute` ¶

vae `instance-attribute` ¶

vae_scale_factor_spatial `instance-attribute` ¶

vae_scale_factor_temporal `instance-attribute` ¶

video_processor `instance-attribute` ¶

weights_sources `instance-attribute` ¶

reference_video_decode_spec `classmethod` ¶

action_dim `instance-attribute` ¶

action_gen `instance-attribute` ¶

action_modality_embed `instance-attribute` ¶

action_proj_in `instance-attribute` ¶

action_proj_out `instance-attribute` ¶

audio_modality_embed `instance-attribute` ¶

audio_proj_in `instance-attribute` ¶

audio_proj_out `instance-attribute` ¶

base_fps `instance-attribute` ¶

cached_freqs_gen `instance-attribute` ¶

cached_kv `instance-attribute` ¶

device `property` ¶

enable_fps_modulation `instance-attribute` ¶

gen_layers `instance-attribute` ¶

gen_sp_gather `instance-attribute` ¶

gen_sp_prepare `instance-attribute` ¶

head_dim `instance-attribute` ¶

hidden_size `instance-attribute` ¶

intermediate_size `instance-attribute` ¶

language_model `instance-attribute` ¶

latent_channel_size `instance-attribute` ¶

latent_patch_size `instance-attribute` ¶

mrope_section `instance-attribute` ¶

norm_moe_gen `instance-attribute` ¶

num_attention_heads `instance-attribute` ¶

num_embodiment_domains `instance-attribute` ¶

num_hidden_layers `instance-attribute` ¶

num_key_value_heads `instance-attribute` ¶

packed_modules_mapping `class-attribute` `instance-attribute` ¶

patch_latent_dim `instance-attribute` ¶

proj_in `instance-attribute` ¶

proj_out `instance-attribute` ¶

rms_norm_eps `instance-attribute` ¶

rope_theta `instance-attribute` ¶

sound_dim `instance-attribute` ¶

sound_gen `instance-attribute` ¶

sound_latent_fps `instance-attribute` ¶

temporal_compression_factor `instance-attribute` ¶

temporal_compression_factor_sound `instance-attribute` ¶

temporal_modality_margin `instance-attribute` ¶

time_embedder `instance-attribute` ¶

timestep_scale `instance-attribute` ¶

vocab_size `instance-attribute` ¶

unpack_action `staticmethod` ¶

unpack_sound `staticmethod` ¶