Skip to content

vllm_omni.diffusion.models.ltx2

Modules:

Name Description
ltx2_transformer
pipeline_ltx2
pipeline_ltx2_3

Fully independent LTX-2.3 pipeline for vLLM-Omni.

pipeline_ltx2_3_image2video

Re-exports for LTX-2.3 I2V pipeline variants.

pipeline_ltx2_image2video
pipeline_ltx2_latent_upsample

LTX23ImageToVideoPipeline

Bases: Module

LTX-2.3 image-to-video pipeline placeholder.

LTX23Pipeline

Bases: Module, CFGParallelMixin, ProgressBarMixin

Fully independent LTX-2.3 pipeline.

Key differences from LTX2Pipeline: - Text encoding: uses ALL 49 hidden states from Gemma-3-12B, flattened - Connectors: uses padding_side API (not additive_mask) - Vocoder: uses LTX2VocoderWithBWE (48kHz output) - Transformer: passes sigma for prompt_adaln - CPU offloading: text encoder, connectors, VAE, vocoder stay on CPU

audio_hop_length instance-attribute

audio_hop_length = (
    mel_hop_length if audio_vae is not None else 160
)

audio_sampling_rate instance-attribute

audio_sampling_rate = (
    sample_rate if audio_vae is not None else 16000
)

audio_vae instance-attribute

audio_vae = from_pretrained_with_prefetch(
    from_pretrained,
    model,
    subfolder="audio_vae",
    prefetch_list=ltx2_subfolders,
    local_files_only=local_files_only,
    torch_dtype=dtype,
)

audio_vae_mel_compression_ratio instance-attribute

audio_vae_mel_compression_ratio = (
    mel_compression_ratio if audio_vae is not None else 4
)

audio_vae_temporal_compression_ratio instance-attribute

audio_vae_temporal_compression_ratio = (
    temporal_compression_ratio
    if audio_vae is not None
    else 4
)

connectors instance-attribute

connectors = from_pretrained_with_prefetch(
    from_pretrained,
    model,
    subfolder="connectors",
    prefetch_list=ltx2_subfolders,
    local_files_only=local_files_only,
    torch_dtype=dtype,
)

current_timestep property

current_timestep

device instance-attribute

device = get_local_device()

do_classifier_free_guidance property

do_classifier_free_guidance

dummy_run_num_frames class-attribute instance-attribute

dummy_run_num_frames = 2

guidance_scale property

guidance_scale

interrupt property

interrupt

num_timesteps property

num_timesteps

od_config instance-attribute

od_config = od_config

scheduler instance-attribute

scheduler = from_pretrained(
    model,
    subfolder="scheduler",
    local_files_only=local_files_only,
)

text_encoder instance-attribute

text_encoder = from_pretrained_with_prefetch(
    from_pretrained,
    model,
    subfolder="text_encoder",
    prefetch_list=ltx2_subfolders,
    local_files_only=local_files_only,
    torch_dtype=dtype,
)

tokenizer instance-attribute

tokenizer = from_pretrained(
    model,
    subfolder="tokenizer",
    local_files_only=local_files_only,
)

tokenizer_max_length instance-attribute

tokenizer_max_length = int(tokenizer_max_length)

transformer instance-attribute

transformer = create_transformer_from_config(
    transformer_config, quant_config=quant_config
)

transformer_spatial_patch_size instance-attribute

transformer_spatial_patch_size = (
    patch_size if transformer is not None else 1
)

transformer_temporal_patch_size instance-attribute

transformer_temporal_patch_size = (
    patch_size_t if transformer is not None else 1
)

vae instance-attribute

vae = from_pretrained_with_prefetch(
    from_pretrained,
    model,
    subfolder="vae",
    prefetch_list=ltx2_subfolders,
    local_files_only=local_files_only,
    torch_dtype=dtype,
)

vae_spatial_compression_ratio instance-attribute

vae_spatial_compression_ratio = (
    spatial_compression_ratio if vae is not None else 32
)

vae_temporal_compression_ratio instance-attribute

vae_temporal_compression_ratio = (
    temporal_compression_ratio if vae is not None else 8
)

video_processor instance-attribute

video_processor = VideoProcessor(
    vae_scale_factor=vae_spatial_compression_ratio
)

vocoder instance-attribute

vocoder = from_pretrained(
    model,
    subfolder="vocoder",
    torch_dtype=dtype,
    local_files_only=local_files_only,
)

weights_sources instance-attribute

weights_sources = [
    ComponentSource(
        model_or_path=model,
        subfolder="transformer",
        revision=None,
        prefix="transformer.",
        fall_back_to_pt=True,
    )
]

check_inputs

check_inputs(
    prompt,
    height,
    width,
    prompt_embeds=None,
    negative_prompt_embeds=None,
    prompt_attention_mask=None,
    negative_prompt_attention_mask=None,
)

combine_cfg_noise

combine_cfg_noise(
    positive_noise_pred,
    negative_noise_pred,
    true_cfg_scale,
    cfg_normalize=False,
    *,
    video_latents: Tensor | None = None,
    audio_latents: Tensor | None = None,
    video_sigma: Tensor | None = None,
    audio_sigma: Tensor | None = None,
)

encode_prompt

encode_prompt(
    prompt: str | list[str],
    negative_prompt: str | list[str] | None = None,
    do_classifier_free_guidance: bool = True,
    num_videos_per_prompt: int = 1,
    prompt_embeds: Tensor | None = None,
    negative_prompt_embeds: Tensor | None = None,
    prompt_attention_mask: Tensor | None = None,
    negative_prompt_attention_mask: Tensor | None = None,
    max_sequence_length: int = 1024,
    device: device | None = None,
    dtype: dtype | None = None,
)

forward

forward(
    req: OmniDiffusionRequest,
    prompt: str | list[str] | None = None,
    negative_prompt: str | list[str] | None = None,
    height: int | None = None,
    width: int | None = None,
    num_frames: int | None = None,
    frame_rate: float | None = None,
    num_inference_steps: int | None = None,
    sigmas: list[float] | None = None,
    timesteps: list[int] | None = None,
    guidance_scale: float = 4.0,
    noise_scale: float = 0.0,
    num_videos_per_prompt: int | None = 1,
    generator: Generator | list[Generator] | None = None,
    latents: Tensor | None = None,
    audio_latents: Tensor | None = None,
    prompt_embeds: Tensor | None = None,
    negative_prompt_embeds: Tensor | None = None,
    prompt_attention_mask: Tensor | None = None,
    negative_prompt_attention_mask: Tensor | None = None,
    decode_timestep: float | list[float] = 0.0,
    decode_noise_scale: float | list[float] | None = None,
    output_type: str = "np",
    return_dict: bool = True,
    attention_kwargs: dict[str, Any] | None = None,
    max_sequence_length: int | None = None,
) -> DiffusionOutput

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

predict_noise

predict_noise(**kwargs)

predict_noise_with_parallel_cfg

predict_noise_with_parallel_cfg(
    true_cfg_scale: float,
    positive_kwargs: dict[str, Any],
    negative_kwargs: dict[str, Any],
    cfg_normalize: bool = True,
    output_slice: int | None = None,
    *,
    video_latents: Tensor | None = None,
    audio_latents: Tensor | None = None,
    video_sigma: Tensor | None = None,
    audio_sigma: Tensor | None = None,
) -> tuple[Tensor, Tensor]

prepare_audio_latents

prepare_audio_latents(
    batch_size: int = 1,
    num_channels_latents: int = 8,
    audio_latent_length: int = 1,
    num_mel_bins: int = 64,
    noise_scale: float = 0.0,
    dtype: dtype | None = None,
    device: device | None = None,
    generator: Generator | list[Generator] | None = None,
    latents: Tensor | None = None,
) -> tuple[Tensor, int, int]

prepare_latents

prepare_latents(
    batch_size: int = 1,
    num_channels_latents: int = 128,
    height: int = 512,
    width: int = 768,
    num_frames: int = 121,
    noise_scale: float = 0.0,
    dtype: dtype | None = None,
    device: device | None = None,
    generator: Generator | None = None,
    latents: Tensor | None = None,
) -> Tensor

LTX2I2VDMD2Pipeline

Bases: DMD2PipelineMixin, LTX2ImageToVideoPipeline

LTX-2 I2V pipeline for FastGen DMD2-distilled models.

LTX2ImageToVideoPipeline

Bases: LTX2Pipeline

support_image_input class-attribute instance-attribute

support_image_input = True

video_processor instance-attribute

video_processor = VideoProcessor(
    vae_scale_factor=vae_spatial_compression_ratio,
    resample="bilinear",
)

check_inputs

check_inputs(
    image,
    height,
    width,
    prompt,
    latents=None,
    prompt_embeds=None,
    negative_prompt_embeds=None,
    prompt_attention_mask=None,
    negative_prompt_attention_mask=None,
)

forward

forward(
    req: OmniDiffusionRequest,
    image: Image | Tensor | None = None,
    prompt: str | list[str] | None = None,
    negative_prompt: str | list[str] | None = None,
    height: int | None = None,
    width: int | None = None,
    num_frames: int | None = None,
    frame_rate: float | None = None,
    num_inference_steps: int | None = None,
    sigmas: list[float] | None = None,
    timesteps: list[int] | None = None,
    guidance_scale: float = 4.0,
    guidance_rescale: float = 0.0,
    noise_scale: float = 0.0,
    num_videos_per_prompt: int | None = 1,
    generator: Generator | list[Generator] | None = None,
    latents: Tensor | None = None,
    audio_latents: Tensor | None = None,
    prompt_embeds: Tensor | None = None,
    negative_prompt_embeds: Tensor | None = None,
    prompt_attention_mask: Tensor | None = None,
    negative_prompt_attention_mask: Tensor | None = None,
    decode_timestep: float | list[float] = 0.0,
    decode_noise_scale: float | list[float] | None = None,
    output_type: str = "np",
    return_dict: bool = True,
    attention_kwargs: dict[str, Any] | None = None,
    max_sequence_length: int | None = None,
) -> DiffusionOutput

prepare_latents

prepare_latents(
    image: Tensor | None = None,
    batch_size: int = 1,
    num_channels_latents: int = 128,
    height: int = 512,
    width: int = 768,
    num_frames: int = 121,
    noise_scale: float = 0.0,
    dtype: dtype | None = None,
    device: device | None = None,
    generator: Generator | list[Generator] | None = None,
    latents: Tensor | None = None,
) -> tuple[Tensor, Tensor]

LTX2ImageToVideoTwoStagesPipeline

Bases: Module, SupportsComponentDiscovery

LTXImageToVideoTwoStagesPipeline is for two stages image to video generation

device instance-attribute

device = get_local_device()

distilled instance-attribute

distilled = False

dtype instance-attribute

dtype = getattr(od_config, 'dtype', bfloat16)

dummy_run_num_frames class-attribute instance-attribute

dummy_run_num_frames = 2

lora_manager instance-attribute

lora_manager = DiffusionLoRAManager(
    pipeline=pipe,
    device=device,
    dtype=dtype,
    max_cached_adapters=max_cpu_loras,
)

model_path instance-attribute

model_path = model

pipe instance-attribute

pipe = LTX2ImageToVideoPipeline(
    od_config=od_config, prefix=prefix
)

support_image_input class-attribute instance-attribute

support_image_input = True

upsample_pipe instance-attribute

upsample_pipe = LTX2LatentUpsamplePipeline(
    vae=vae, od_config=od_config
)

weights_sources instance-attribute

weights_sources = [
    ComponentSource(
        model_or_path=model,
        subfolder="transformer",
        revision=None,
        prefix="pipe.transformer.",
        fall_back_to_pt=True,
    )
]

forward

forward(
    req: OmniDiffusionRequest,
    image: Image | Tensor | None = None,
    prompt: str | list[str] | None = None,
    negative_prompt: str | list[str] | None = None,
    height: int | None = None,
    width: int | None = None,
    num_frames: int | None = None,
    frame_rate: float | None = None,
    num_inference_steps: int | None = None,
    sigmas: list[float] | None = None,
    timesteps: list[int] | None = None,
    guidance_scale: float = 4.0,
    guidance_rescale: float = 0.0,
    noise_scale: float = 0.0,
    num_videos_per_prompt: int | None = 1,
    generator: Generator | list[Generator] | None = None,
    latents: Tensor | None = None,
    audio_latents: Tensor | None = None,
    prompt_embeds: Tensor | None = None,
    negative_prompt_embeds: Tensor | None = None,
    prompt_attention_mask: Tensor | None = None,
    negative_prompt_attention_mask: Tensor | None = None,
    decode_timestep: float | list[float] = 0.0,
    decode_noise_scale: float | list[float] | None = None,
    output_type: str = "np",
    return_dict: bool = True,
    attention_kwargs: dict[str, Any] | None = None,
    max_sequence_length: int | None = None,
)

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

LTX2LatentUpsamplePipeline

Bases: Module

device instance-attribute

device = get_local_device()

latent_upsampler instance-attribute

latent_upsampler = latent_upsampler

vae instance-attribute

vae = vae

vae_spatial_compression_ratio instance-attribute

vae_spatial_compression_ratio = (
    spatial_compression_ratio
    if getattr(self, "vae", None) is not None
    else 32
)

vae_temporal_compression_ratio instance-attribute

vae_temporal_compression_ratio = (
    temporal_compression_ratio
    if getattr(self, "vae", None) is not None
    else 8
)

video_processor instance-attribute

video_processor = VideoProcessor(
    vae_scale_factor=vae_spatial_compression_ratio
)

adain_filter_latent

adain_filter_latent(
    latents: Tensor,
    reference_latents: Tensor,
    factor: float = 1.0,
)

check_inputs

check_inputs(
    video,
    height,
    width,
    latents,
    tone_map_compression_ratio,
)

forward

forward(
    video: list[PipelineImageInput] | None = None,
    height: int = 512,
    width: int = 768,
    num_frames: int = 121,
    spatial_patch_size: int = 1,
    temporal_patch_size: int = 1,
    latents: Tensor | None = None,
    latents_normalized: bool = False,
    decode_timestep: float | list[float] = 0.0,
    decode_noise_scale: float | list[float] | None = None,
    adain_factor: float = 0.0,
    tone_map_compression_ratio: float = 0.0,
    generator: Generator | list[Generator] | None = None,
    output_type: str | None = "pil",
    return_dict: bool = True,
)

prepare_latents

prepare_latents(
    video: Tensor | None = None,
    batch_size: int = 1,
    num_frames: int = 121,
    height: int = 512,
    width: int = 768,
    spatial_patch_size: int = 1,
    temporal_patch_size: int = 1,
    dtype: dtype | None = None,
    device: device | None = None,
    generator: Generator | None = None,
    latents: Tensor | None = None,
) -> Tensor

tone_map_latents

tone_map_latents(
    latents: Tensor, compression: float
) -> Tensor

LTX2Pipeline

Bases: Module, CFGParallelMixin, ProgressBarMixin

attention_kwargs property

attention_kwargs

audio_hop_length instance-attribute

audio_hop_length = (
    mel_hop_length
    if getattr(self, "audio_vae", None) is not None
    else 160
)

audio_sampling_rate instance-attribute

audio_sampling_rate = (
    sample_rate
    if getattr(self, "audio_vae", None) is not None
    else 16000
)

audio_vae instance-attribute

audio_vae = to(device)

audio_vae_mel_compression_ratio instance-attribute

audio_vae_mel_compression_ratio = (
    mel_compression_ratio
    if getattr(self, "audio_vae", None) is not None
    else 4
)

audio_vae_temporal_compression_ratio instance-attribute

audio_vae_temporal_compression_ratio = (
    temporal_compression_ratio
    if getattr(self, "audio_vae", None) is not None
    else 4
)

connectors instance-attribute

connectors = to(device)

current_timestep property

current_timestep

device instance-attribute

device = get_local_device()

do_classifier_free_guidance property

do_classifier_free_guidance

dummy_run_num_frames class-attribute instance-attribute

dummy_run_num_frames = 2

guidance_rescale property

guidance_rescale

guidance_scale property

guidance_scale

interrupt property

interrupt

num_timesteps property

num_timesteps

od_config instance-attribute

od_config = od_config

scheduler instance-attribute

scheduler = from_pretrained(
    model,
    subfolder="scheduler",
    local_files_only=local_files_only,
)

text_encoder instance-attribute

text_encoder = to(device)

tokenizer instance-attribute

tokenizer = from_pretrained(
    model,
    subfolder="tokenizer",
    local_files_only=local_files_only,
)

tokenizer_max_length instance-attribute

tokenizer_max_length = int(tokenizer_max_length)

transformer instance-attribute

transformer = create_transformer_from_config(
    transformer_config, quant_config=quant_config
)

transformer_spatial_patch_size instance-attribute

transformer_spatial_patch_size = (
    patch_size
    if getattr(self, "transformer", None) is not None
    else 1
)

transformer_temporal_patch_size instance-attribute

transformer_temporal_patch_size = (
    patch_size_t
    if getattr(self, "transformer", None) is not None
    else 1
)

vae instance-attribute

vae = to(device)

vae_spatial_compression_ratio instance-attribute

vae_spatial_compression_ratio = (
    spatial_compression_ratio
    if getattr(self, "vae", None) is not None
    else 32
)

vae_temporal_compression_ratio instance-attribute

vae_temporal_compression_ratio = (
    temporal_compression_ratio
    if getattr(self, "vae", None) is not None
    else 8
)

video_processor instance-attribute

video_processor = VideoProcessor(
    vae_scale_factor=vae_spatial_compression_ratio
)

vocoder instance-attribute

vocoder = to(device)

weights_sources instance-attribute

weights_sources = [
    ComponentSource(
        model_or_path=model,
        subfolder="transformer",
        revision=None,
        prefix="transformer.",
        fall_back_to_pt=True,
    )
]

check_inputs

check_inputs(
    prompt,
    height,
    width,
    prompt_embeds=None,
    negative_prompt_embeds=None,
    prompt_attention_mask=None,
    negative_prompt_attention_mask=None,
)

combine_cfg_noise

combine_cfg_noise(
    positive_noise_pred,
    negative_noise_pred,
    true_cfg_scale,
    cfg_normalize=False,
)

Per-element CFG combine with guidance_rescale support.

encode_prompt

encode_prompt(
    prompt: str | list[str],
    negative_prompt: str | list[str] | None = None,
    do_classifier_free_guidance: bool = True,
    num_videos_per_prompt: int = 1,
    prompt_embeds: Tensor | None = None,
    negative_prompt_embeds: Tensor | None = None,
    prompt_attention_mask: Tensor | None = None,
    negative_prompt_attention_mask: Tensor | None = None,
    max_sequence_length: int = 1024,
    scale_factor: int = 8,
    device: device | None = None,
    dtype: dtype | None = None,
)

forward

forward(
    req: OmniDiffusionRequest,
    prompt: str | list[str] | None = None,
    negative_prompt: str | list[str] | None = None,
    height: int | None = None,
    width: int | None = None,
    num_frames: int | None = None,
    frame_rate: float | None = None,
    num_inference_steps: int | None = None,
    sigmas: list[float] | None = None,
    timesteps: list[int] | None = None,
    guidance_scale: float = 4.0,
    guidance_rescale: float = 0.0,
    noise_scale: float = 0.0,
    num_videos_per_prompt: int | None = 1,
    generator: Generator | list[Generator] | None = None,
    latents: Tensor | None = None,
    audio_latents: Tensor | None = None,
    prompt_embeds: Tensor | None = None,
    negative_prompt_embeds: Tensor | None = None,
    prompt_attention_mask: Tensor | None = None,
    negative_prompt_attention_mask: Tensor | None = None,
    decode_timestep: float | list[float] = 0.0,
    decode_noise_scale: float | list[float] | None = None,
    output_type: str = "np",
    return_dict: bool = True,
    attention_kwargs: dict[str, Any] | None = None,
    max_sequence_length: int | None = None,
) -> DiffusionOutput

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

predict_noise

predict_noise(**kwargs)

prepare_audio_latents

prepare_audio_latents(
    batch_size: int = 1,
    num_channels_latents: int = 8,
    audio_latent_length: int = 1,
    num_mel_bins: int = 64,
    noise_scale: float = 0.0,
    dtype: dtype | None = None,
    device: device | None = None,
    generator: Generator | list[Generator] | None = None,
    latents: Tensor | None = None,
) -> tuple[Tensor, int, int]

prepare_latents

prepare_latents(
    batch_size: int = 1,
    num_channels_latents: int = 128,
    height: int = 512,
    width: int = 768,
    num_frames: int = 121,
    noise_scale: float = 0.0,
    dtype: dtype | None = None,
    device: device | None = None,
    generator: Generator | None = None,
    latents: Tensor | None = None,
) -> Tensor

LTX2T2VDMD2Pipeline

Bases: DMD2PipelineMixin, LTX2Pipeline

LTX-2 T2V pipeline for FastGen DMD2-distilled models.

LTX2TwoStagesPipeline

Bases: Module, SupportsComponentDiscovery

LTX2TwoStagesPipeline is for two stages image to video generation

device instance-attribute

device = get_local_device()

distilled instance-attribute

distilled = False

dtype instance-attribute

dtype = getattr(od_config, 'dtype', bfloat16)

dummy_run_num_frames class-attribute instance-attribute

dummy_run_num_frames = 2

lora_manager instance-attribute

lora_manager = DiffusionLoRAManager(
    pipeline=pipe,
    device=device,
    dtype=dtype,
    max_cached_adapters=max_cpu_loras,
)

model_path instance-attribute

model_path = model

pipe instance-attribute

pipe = LTX2Pipeline(od_config=od_config, prefix=prefix)

upsample_pipe instance-attribute

upsample_pipe = LTX2LatentUpsamplePipeline(
    vae=vae, od_config=od_config
)

weights_sources instance-attribute

weights_sources = [
    ComponentSource(
        model_or_path=model,
        subfolder="transformer",
        revision=None,
        prefix="pipe.transformer.",
        fall_back_to_pt=True,
    )
]

forward

forward(
    req: OmniDiffusionRequest,
    prompt: str | list[str] | None = None,
    negative_prompt: str | list[str] | None = None,
    height: int | None = None,
    width: int | None = None,
    num_frames: int | None = None,
    frame_rate: float | None = None,
    num_inference_steps: int | None = None,
    timesteps: list[int] | None = None,
    guidance_scale: float = 4.0,
    guidance_rescale: float = 0.0,
    noise_scale: float = 0.0,
    num_videos_per_prompt: int | None = 1,
    generator: Generator | list[Generator] | None = None,
    latents: Tensor | None = None,
    audio_latents: Tensor | None = None,
    prompt_embeds: Tensor | None = None,
    negative_prompt_embeds: Tensor | None = None,
    prompt_attention_mask: Tensor | None = None,
    negative_prompt_attention_mask: Tensor | None = None,
    decode_timestep: float | list[float] = 0.0,
    decode_noise_scale: float | list[float] | None = None,
    output_type: str = "np",
    return_dict: bool = True,
    attention_kwargs: dict[str, Any] | None = None,
    max_sequence_length: int | None = None,
)

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

LTX2VideoTransformer3DModel

Bases: Module

A Transformer model for video-like data used in LTX.

Parameters:

Name Type Description Default
in_channels `int`, defaults to `128`

The number of channels in the input.

128
out_channels `int`, defaults to `128`

The number of channels in the output.

128
patch_size `int`, defaults to `1`

The size of the spatial patches to use in the patch embedding layer.

1
patch_size_t `int`, defaults to `1`

The size of the tmeporal patches to use in the patch embedding layer.

1
num_attention_heads `int`, defaults to `32`

The number of heads to use for multi-head attention.

32
attention_head_dim `int`, defaults to `64`

The number of channels in each head.

128
cross_attention_dim `int`, defaults to `2048 `

The number of channels for cross attention heads.

4096
num_layers `int`, defaults to `28`

The number of layers of Transformer blocks to use.

48
activation_fn `str`, defaults to `"gelu-approximate"`

Activation function to use in feed-forward.

'gelu-approximate'
qk_norm `str`, defaults to `"rms_norm_across_heads"`

The normalization layer to use.

'rms_norm_across_heads'

audio_caption_projection instance-attribute

audio_caption_projection = PixArtAlphaTextProjection(
    in_features=caption_channels,
    hidden_size=audio_inner_dim,
)

audio_norm_out instance-attribute

audio_norm_out = LayerNorm(
    audio_inner_dim, eps=1e-06, elementwise_affine=False
)

audio_proj_in instance-attribute

audio_proj_in = Linear(audio_in_channels, audio_inner_dim)

audio_proj_out instance-attribute

audio_proj_out = Linear(audio_inner_dim, audio_out_channels)

audio_prompt_adaln instance-attribute

audio_prompt_adaln = LTX2AdaLayerNormSingle(
    audio_inner_dim,
    num_mod_params=2,
    use_additional_conditions=False,
)

audio_rope instance-attribute

audio_rope = LTX2AudioVideoRotaryPosEmbed(
    dim=audio_inner_dim,
    patch_size=audio_patch_size,
    patch_size_t=audio_patch_size_t,
    base_num_frames=audio_pos_embed_max_pos,
    sampling_rate=audio_sampling_rate,
    hop_length=audio_hop_length,
    scale_factors=[audio_scale_factor],
    theta=rope_theta,
    causal_offset=causal_offset,
    modality="audio",
    double_precision=rope_double_precision,
    rope_type=rope_type,
    num_attention_heads=audio_num_attention_heads,
)

audio_scale_shift_table instance-attribute

audio_scale_shift_table = Parameter(
    randn(2, audio_inner_dim) / audio_inner_dim**0.5
)

audio_time_embed instance-attribute

audio_time_embed = LTX2AdaLayerNormSingle(
    audio_inner_dim,
    num_mod_params=audio_num_mod_params,
    use_additional_conditions=False,
)

av_cross_attn_audio_scale_shift instance-attribute

av_cross_attn_audio_scale_shift = LTX2AdaLayerNormSingle(
    audio_inner_dim,
    num_mod_params=4,
    use_additional_conditions=False,
)

av_cross_attn_audio_v2a_gate instance-attribute

av_cross_attn_audio_v2a_gate = LTX2AdaLayerNormSingle(
    audio_inner_dim,
    num_mod_params=1,
    use_additional_conditions=False,
)

av_cross_attn_video_a2v_gate instance-attribute

av_cross_attn_video_a2v_gate = LTX2AdaLayerNormSingle(
    inner_dim,
    num_mod_params=1,
    use_additional_conditions=False,
)

av_cross_attn_video_scale_shift instance-attribute

av_cross_attn_video_scale_shift = LTX2AdaLayerNormSingle(
    inner_dim,
    num_mod_params=4,
    use_additional_conditions=False,
)

caption_projection instance-attribute

caption_projection = PixArtAlphaTextProjection(
    in_features=caption_channels, hidden_size=inner_dim
)

config instance-attribute

config = SimpleNamespace(
    in_channels=in_channels,
    out_channels=out_channels,
    patch_size=patch_size,
    patch_size_t=patch_size_t,
    num_attention_heads=num_attention_heads,
    attention_head_dim=attention_head_dim,
    cross_attention_dim=cross_attention_dim,
    vae_scale_factors=vae_scale_factors,
    pos_embed_max_pos=pos_embed_max_pos,
    base_height=base_height,
    base_width=base_width,
    audio_in_channels=audio_in_channels,
    audio_out_channels=audio_out_channels,
    audio_patch_size=audio_patch_size,
    audio_patch_size_t=audio_patch_size_t,
    audio_num_attention_heads=audio_num_attention_heads,
    audio_attention_head_dim=audio_attention_head_dim,
    audio_cross_attention_dim=audio_cross_attention_dim,
    audio_scale_factor=audio_scale_factor,
    audio_pos_embed_max_pos=audio_pos_embed_max_pos,
    audio_sampling_rate=audio_sampling_rate,
    audio_hop_length=audio_hop_length,
    num_layers=num_layers,
    activation_fn=activation_fn,
    qk_norm=qk_norm,
    norm_elementwise_affine=norm_elementwise_affine,
    norm_eps=norm_eps,
    caption_channels=caption_channels,
    attention_bias=attention_bias,
    attention_out_bias=attention_out_bias,
    rope_theta=rope_theta,
    rope_double_precision=rope_double_precision,
    causal_offset=causal_offset,
    timestep_scale_multiplier=timestep_scale_multiplier,
    cross_attn_timestep_scale_multiplier=cross_attn_timestep_scale_multiplier,
    rope_type=rope_type,
)

cross_attn_audio_rope instance-attribute

cross_attn_audio_rope = LTX2AudioVideoRotaryPosEmbed(
    dim=audio_cross_attention_dim,
    patch_size=audio_patch_size,
    patch_size_t=audio_patch_size_t,
    base_num_frames=cross_attn_pos_embed_max_pos,
    sampling_rate=audio_sampling_rate,
    hop_length=audio_hop_length,
    theta=rope_theta,
    causal_offset=causal_offset,
    modality="audio",
    double_precision=rope_double_precision,
    rope_type=rope_type,
    num_attention_heads=audio_num_attention_heads,
)

cross_attn_rope instance-attribute

cross_attn_rope = LTX2AudioVideoRotaryPosEmbed(
    dim=audio_cross_attention_dim,
    patch_size=patch_size,
    patch_size_t=patch_size_t,
    base_num_frames=cross_attn_pos_embed_max_pos,
    base_height=base_height,
    base_width=base_width,
    theta=rope_theta,
    causal_offset=causal_offset,
    modality="video",
    double_precision=rope_double_precision,
    rope_type=rope_type,
    num_attention_heads=num_attention_heads,
)

gradient_checkpointing instance-attribute

gradient_checkpointing = False

norm_out instance-attribute

norm_out = LayerNorm(
    inner_dim, eps=1e-06, elementwise_affine=False
)

packed_modules_mapping class-attribute instance-attribute

packed_modules_mapping = {
    "to_qkv": ["to_q", "to_k", "to_v"]
}

perturbed_attn instance-attribute

perturbed_attn = perturbed_attn

proj_in instance-attribute

proj_in = Linear(in_channels, inner_dim)

proj_out instance-attribute

proj_out = Linear(inner_dim, out_channels)

prompt_adaln instance-attribute

prompt_adaln = LTX2AdaLayerNormSingle(
    inner_dim,
    num_mod_params=2,
    use_additional_conditions=False,
)

prompt_modulation instance-attribute

prompt_modulation = cross_attn_mod or audio_cross_attn_mod

rope instance-attribute

rope = LTX2AudioVideoRotaryPosEmbed(
    dim=inner_dim,
    patch_size=patch_size,
    patch_size_t=patch_size_t,
    base_num_frames=pos_embed_max_pos,
    base_height=base_height,
    base_width=base_width,
    scale_factors=vae_scale_factors,
    theta=rope_theta,
    causal_offset=causal_offset,
    modality="video",
    double_precision=rope_double_precision,
    rope_type=rope_type,
    num_attention_heads=num_attention_heads,
)

scale_shift_table instance-attribute

scale_shift_table = Parameter(
    randn(2, inner_dim) / inner_dim**0.5
)

time_embed instance-attribute

time_embed = LTX2AdaLayerNormSingle(
    inner_dim,
    num_mod_params=video_num_mod_params,
    use_additional_conditions=False,
)

transformer_blocks instance-attribute

transformer_blocks = ModuleList(
    [
        (
            LTX2VideoTransformerBlock(
                dim=inner_dim,
                num_attention_heads=num_attention_heads,
                attention_head_dim=attention_head_dim,
                cross_attention_dim=cross_attention_dim,
                audio_dim=audio_inner_dim,
                audio_num_attention_heads=audio_num_attention_heads,
                audio_attention_head_dim=audio_attention_head_dim,
                audio_cross_attention_dim=audio_cross_attention_dim,
                video_gated_attn=gated_attn,
                video_cross_attn_adaln=cross_attn_mod,
                audio_gated_attn=audio_gated_attn,
                audio_cross_attn_adaln=audio_cross_attn_mod,
                qk_norm=qk_norm,
                activation_fn=activation_fn,
                attention_bias=attention_bias,
                attention_out_bias=attention_out_bias,
                eps=norm_eps,
                elementwise_affine=norm_elementwise_affine,
                rope_type=rope_type,
                perturbed_attn=perturbed_attn,
                quant_config=quant_config,
                prefix=f"transformer_blocks.{layer_idx}",
            )
        )
        for layer_idx in (range(num_layers))
    ]
)

disable_gradient_checkpointing

disable_gradient_checkpointing() -> None

enable_gradient_checkpointing

enable_gradient_checkpointing() -> None

forward

forward(
    hidden_states: Tensor,
    audio_hidden_states: Tensor,
    encoder_hidden_states: Tensor,
    audio_encoder_hidden_states: Tensor,
    timestep: LongTensor,
    audio_timestep: LongTensor | None = None,
    sigma: Tensor | None = None,
    audio_sigma: Tensor | None = None,
    encoder_attention_mask: Tensor | None = None,
    audio_encoder_attention_mask: Tensor | None = None,
    num_frames: int | None = None,
    height: int | None = None,
    width: int | None = None,
    fps: float = 24.0,
    audio_num_frames: int | None = None,
    video_coords: Tensor | None = None,
    audio_coords: Tensor | None = None,
    attention_kwargs: dict[str, Any] | None = None,
    return_dict: bool = True,
    **kwargs,
) -> Tensor

Forward pass for LTX-2.0 audiovisual video transformer.

Parameters:

Name Type Description Default
hidden_states `torch.Tensor`

Input patchified video latents of shape (batch_size, num_video_tokens, in_channels).

required
audio_hidden_states `torch.Tensor`

Input patchified audio latents of shape (batch_size, num_audio_tokens, audio_in_channels).

required
encoder_hidden_states `torch.Tensor`

Input video text embeddings of shape (batch_size, text_seq_len, self.config.caption_channels).

required
audio_encoder_hidden_states `torch.Tensor`

Input audio text embeddings of shape (batch_size, text_seq_len, self.config.caption_channels).

required
timestep `torch.Tensor`

Input timestep of shape (batch_size, num_video_tokens). These should already be scaled by self.config.timestep_scale_multiplier.

required
audio_timestep `torch.Tensor`, *optional*

Input timestep of shape (batch_size,) or (batch_size, num_audio_tokens) for audio modulation params. This is only used by certain pipelines such as the I2V pipeline.

None
encoder_attention_mask `torch.Tensor`, *optional*

Optional multiplicative text attention mask of shape (batch_size, text_seq_len).

None
audio_encoder_attention_mask `torch.Tensor`, *optional*

Optional multiplicative text attention mask of shape (batch_size, text_seq_len) for audio modeling.

None
num_frames `int`, *optional*

The number of latent video frames. Used if calculating the video coordinates for RoPE.

None
height `int`, *optional*

The latent video height. Used if calculating the video coordinates for RoPE.

None
width `int`, *optional*

The latent video width. Used if calculating the video coordinates for RoPE.

None
fps float

(float, optional, defaults to 24.0): The desired frames per second of the generated video. Used if calculating the video coordinates for RoPE.

24.0
audio_num_frames int | None

(int, optional): The number of latent audio frames. Used if calculating the audio coordinates for RoPE.

None
video_coords `torch.Tensor`, *optional*

The video coordinates to be used when calculating the rotary positional embeddings (RoPE) of shape (batch_size, 3, num_video_tokens, 2). If not supplied, this will be calculated inside forward.

None
audio_coords `torch.Tensor`, *optional*

The audio coordinates to be used when calculating the rotary positional embeddings (RoPE) of shape (batch_size, 1, num_audio_tokens, 2). If not supplied, this will be calculated inside forward.

None
attention_kwargs `Dict[str, Any]`, *optional*

Optional dict of keyword args to be passed to the attention processor.

None
return_dict `bool`, *optional*, defaults to `True`

Whether to return a dict-like structured output of type AudioVisualModelOutput or a tuple.

True

Returns:

Type Description
Tensor

AudioVisualModelOutput or tuple: If return_dict is True, returns a structured output of type AudioVisualModelOutput, otherwise a tuple is returned where the first element is the denoised video latent patch sequence and the second element is the denoised audio latent patch sequence.

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Load weights from a pretrained model, mapping separate Q/K/V projections into fused QKV projections for self-attention blocks.

Returns:

Type Description
set[str]

Set of parameter names that were successfully loaded.

create_transformer_from_config

create_transformer_from_config(
    config: dict,
    quant_config: QuantizationConfig | None = None,
) -> LTX2VideoTransformer3DModel

Create LTX2VideoTransformer3DModel from config dict.

get_ltx2_post_process_func

get_ltx2_post_process_func(od_config: OmniDiffusionConfig)

load_transformer_config

load_transformer_config(
    model_path: str,
    subfolder: str = "transformer",
    local_files_only: bool = True,
) -> dict

Load transformer config from model directory or HF Hub.