vllm_omni.diffusion.models.ltx2 ¶
Modules:
| Name | Description |
|---|---|
ltx2_transformer | |
pipeline_ltx2 | |
pipeline_ltx2_3 | Fully independent LTX-2.3 pipeline for vLLM-Omni. |
pipeline_ltx2_3_image2video | Re-exports for LTX-2.3 I2V pipeline variants. |
pipeline_ltx2_image2video | |
pipeline_ltx2_latent_upsample | |
LTX23ImageToVideoPipeline ¶
Bases: Module
LTX-2.3 image-to-video pipeline placeholder.
LTX23Pipeline ¶
Bases: Module, CFGParallelMixin, ProgressBarMixin
Fully independent LTX-2.3 pipeline.
Key differences from LTX2Pipeline: - Text encoding: uses ALL 49 hidden states from Gemma-3-12B, flattened - Connectors: uses padding_side API (not additive_mask) - Vocoder: uses LTX2VocoderWithBWE (48kHz output) - Transformer: passes sigma for prompt_adaln - CPU offloading: text encoder, connectors, VAE, vocoder stay on CPU
audio_hop_length instance-attribute ¶
audio_sampling_rate instance-attribute ¶
audio_vae instance-attribute ¶
audio_vae = from_pretrained_with_prefetch(
from_pretrained,
model,
subfolder="audio_vae",
prefetch_list=ltx2_subfolders,
local_files_only=local_files_only,
torch_dtype=dtype,
)
audio_vae_mel_compression_ratio instance-attribute ¶
audio_vae_temporal_compression_ratio instance-attribute ¶
audio_vae_temporal_compression_ratio = (
temporal_compression_ratio
if audio_vae is not None
else 4
)
connectors instance-attribute ¶
connectors = from_pretrained_with_prefetch(
from_pretrained,
model,
subfolder="connectors",
prefetch_list=ltx2_subfolders,
local_files_only=local_files_only,
torch_dtype=dtype,
)
scheduler instance-attribute ¶
text_encoder instance-attribute ¶
text_encoder = from_pretrained_with_prefetch(
from_pretrained,
model,
subfolder="text_encoder",
prefetch_list=ltx2_subfolders,
local_files_only=local_files_only,
torch_dtype=dtype,
)
tokenizer instance-attribute ¶
transformer instance-attribute ¶
transformer = create_transformer_from_config(
transformer_config, quant_config=quant_config
)
transformer_spatial_patch_size instance-attribute ¶
transformer_temporal_patch_size instance-attribute ¶
vae instance-attribute ¶
vae = from_pretrained_with_prefetch(
from_pretrained,
model,
subfolder="vae",
prefetch_list=ltx2_subfolders,
local_files_only=local_files_only,
torch_dtype=dtype,
)
vae_spatial_compression_ratio instance-attribute ¶
vae_temporal_compression_ratio instance-attribute ¶
video_processor instance-attribute ¶
vocoder instance-attribute ¶
vocoder = from_pretrained(
model,
subfolder="vocoder",
torch_dtype=dtype,
local_files_only=local_files_only,
)
weights_sources instance-attribute ¶
weights_sources = [
ComponentSource(
model_or_path=model,
subfolder="transformer",
revision=None,
prefix="transformer.",
fall_back_to_pt=True,
)
]
check_inputs ¶
check_inputs(
prompt,
height,
width,
prompt_embeds=None,
negative_prompt_embeds=None,
prompt_attention_mask=None,
negative_prompt_attention_mask=None,
)
combine_cfg_noise ¶
combine_cfg_noise(
positive_noise_pred,
negative_noise_pred,
true_cfg_scale,
cfg_normalize=False,
*,
video_latents: Tensor | None = None,
audio_latents: Tensor | None = None,
video_sigma: Tensor | None = None,
audio_sigma: Tensor | None = None,
)
encode_prompt ¶
encode_prompt(
prompt: str | list[str],
negative_prompt: str | list[str] | None = None,
do_classifier_free_guidance: bool = True,
num_videos_per_prompt: int = 1,
prompt_embeds: Tensor | None = None,
negative_prompt_embeds: Tensor | None = None,
prompt_attention_mask: Tensor | None = None,
negative_prompt_attention_mask: Tensor | None = None,
max_sequence_length: int = 1024,
device: device | None = None,
dtype: dtype | None = None,
)
forward ¶
forward(
req: OmniDiffusionRequest,
prompt: str | list[str] | None = None,
negative_prompt: str | list[str] | None = None,
height: int | None = None,
width: int | None = None,
num_frames: int | None = None,
frame_rate: float | None = None,
num_inference_steps: int | None = None,
sigmas: list[float] | None = None,
timesteps: list[int] | None = None,
guidance_scale: float = 4.0,
noise_scale: float = 0.0,
num_videos_per_prompt: int | None = 1,
generator: Generator | list[Generator] | None = None,
latents: Tensor | None = None,
audio_latents: Tensor | None = None,
prompt_embeds: Tensor | None = None,
negative_prompt_embeds: Tensor | None = None,
prompt_attention_mask: Tensor | None = None,
negative_prompt_attention_mask: Tensor | None = None,
decode_timestep: float | list[float] = 0.0,
decode_noise_scale: float | list[float] | None = None,
output_type: str = "np",
return_dict: bool = True,
attention_kwargs: dict[str, Any] | None = None,
max_sequence_length: int | None = None,
) -> DiffusionOutput
predict_noise_with_parallel_cfg ¶
predict_noise_with_parallel_cfg(
true_cfg_scale: float,
positive_kwargs: dict[str, Any],
negative_kwargs: dict[str, Any],
cfg_normalize: bool = True,
output_slice: int | None = None,
*,
video_latents: Tensor | None = None,
audio_latents: Tensor | None = None,
video_sigma: Tensor | None = None,
audio_sigma: Tensor | None = None,
) -> tuple[Tensor, Tensor]
prepare_audio_latents ¶
prepare_audio_latents(
batch_size: int = 1,
num_channels_latents: int = 8,
audio_latent_length: int = 1,
num_mel_bins: int = 64,
noise_scale: float = 0.0,
dtype: dtype | None = None,
device: device | None = None,
generator: Generator | list[Generator] | None = None,
latents: Tensor | None = None,
) -> tuple[Tensor, int, int]
prepare_latents ¶
prepare_latents(
batch_size: int = 1,
num_channels_latents: int = 128,
height: int = 512,
width: int = 768,
num_frames: int = 121,
noise_scale: float = 0.0,
dtype: dtype | None = None,
device: device | None = None,
generator: Generator | None = None,
latents: Tensor | None = None,
) -> Tensor
LTX2I2VDMD2Pipeline ¶
Bases: DMD2PipelineMixin, LTX2ImageToVideoPipeline
LTX-2 I2V pipeline for FastGen DMD2-distilled models.
LTX2ImageToVideoPipeline ¶
Bases: LTX2Pipeline
video_processor instance-attribute ¶
video_processor = VideoProcessor(
vae_scale_factor=vae_spatial_compression_ratio,
resample="bilinear",
)
check_inputs ¶
check_inputs(
image,
height,
width,
prompt,
latents=None,
prompt_embeds=None,
negative_prompt_embeds=None,
prompt_attention_mask=None,
negative_prompt_attention_mask=None,
)
forward ¶
forward(
req: OmniDiffusionRequest,
image: Image | Tensor | None = None,
prompt: str | list[str] | None = None,
negative_prompt: str | list[str] | None = None,
height: int | None = None,
width: int | None = None,
num_frames: int | None = None,
frame_rate: float | None = None,
num_inference_steps: int | None = None,
sigmas: list[float] | None = None,
timesteps: list[int] | None = None,
guidance_scale: float = 4.0,
guidance_rescale: float = 0.0,
noise_scale: float = 0.0,
num_videos_per_prompt: int | None = 1,
generator: Generator | list[Generator] | None = None,
latents: Tensor | None = None,
audio_latents: Tensor | None = None,
prompt_embeds: Tensor | None = None,
negative_prompt_embeds: Tensor | None = None,
prompt_attention_mask: Tensor | None = None,
negative_prompt_attention_mask: Tensor | None = None,
decode_timestep: float | list[float] = 0.0,
decode_noise_scale: float | list[float] | None = None,
output_type: str = "np",
return_dict: bool = True,
attention_kwargs: dict[str, Any] | None = None,
max_sequence_length: int | None = None,
) -> DiffusionOutput
prepare_latents ¶
prepare_latents(
image: Tensor | None = None,
batch_size: int = 1,
num_channels_latents: int = 128,
height: int = 512,
width: int = 768,
num_frames: int = 121,
noise_scale: float = 0.0,
dtype: dtype | None = None,
device: device | None = None,
generator: Generator | list[Generator] | None = None,
latents: Tensor | None = None,
) -> tuple[Tensor, Tensor]
LTX2ImageToVideoTwoStagesPipeline ¶
Bases: Module, SupportsComponentDiscovery
LTXImageToVideoTwoStagesPipeline is for two stages image to video generation
lora_manager instance-attribute ¶
lora_manager = DiffusionLoRAManager(
pipeline=pipe,
device=device,
dtype=dtype,
max_cached_adapters=max_cpu_loras,
)
upsample_pipe instance-attribute ¶
upsample_pipe = LTX2LatentUpsamplePipeline(
vae=vae, od_config=od_config
)
weights_sources instance-attribute ¶
weights_sources = [
ComponentSource(
model_or_path=model,
subfolder="transformer",
revision=None,
prefix="pipe.transformer.",
fall_back_to_pt=True,
)
]
forward ¶
forward(
req: OmniDiffusionRequest,
image: Image | Tensor | None = None,
prompt: str | list[str] | None = None,
negative_prompt: str | list[str] | None = None,
height: int | None = None,
width: int | None = None,
num_frames: int | None = None,
frame_rate: float | None = None,
num_inference_steps: int | None = None,
sigmas: list[float] | None = None,
timesteps: list[int] | None = None,
guidance_scale: float = 4.0,
guidance_rescale: float = 0.0,
noise_scale: float = 0.0,
num_videos_per_prompt: int | None = 1,
generator: Generator | list[Generator] | None = None,
latents: Tensor | None = None,
audio_latents: Tensor | None = None,
prompt_embeds: Tensor | None = None,
negative_prompt_embeds: Tensor | None = None,
prompt_attention_mask: Tensor | None = None,
negative_prompt_attention_mask: Tensor | None = None,
decode_timestep: float | list[float] = 0.0,
decode_noise_scale: float | list[float] | None = None,
output_type: str = "np",
return_dict: bool = True,
attention_kwargs: dict[str, Any] | None = None,
max_sequence_length: int | None = None,
)
LTX2LatentUpsamplePipeline ¶
Bases: Module
vae_spatial_compression_ratio instance-attribute ¶
vae_spatial_compression_ratio = (
spatial_compression_ratio
if getattr(self, "vae", None) is not None
else 32
)
vae_temporal_compression_ratio instance-attribute ¶
vae_temporal_compression_ratio = (
temporal_compression_ratio
if getattr(self, "vae", None) is not None
else 8
)
video_processor instance-attribute ¶
adain_filter_latent ¶
adain_filter_latent(
latents: Tensor,
reference_latents: Tensor,
factor: float = 1.0,
)
forward ¶
forward(
video: list[PipelineImageInput] | None = None,
height: int = 512,
width: int = 768,
num_frames: int = 121,
spatial_patch_size: int = 1,
temporal_patch_size: int = 1,
latents: Tensor | None = None,
latents_normalized: bool = False,
decode_timestep: float | list[float] = 0.0,
decode_noise_scale: float | list[float] | None = None,
adain_factor: float = 0.0,
tone_map_compression_ratio: float = 0.0,
generator: Generator | list[Generator] | None = None,
output_type: str | None = "pil",
return_dict: bool = True,
)
prepare_latents ¶
prepare_latents(
video: Tensor | None = None,
batch_size: int = 1,
num_frames: int = 121,
height: int = 512,
width: int = 768,
spatial_patch_size: int = 1,
temporal_patch_size: int = 1,
dtype: dtype | None = None,
device: device | None = None,
generator: Generator | None = None,
latents: Tensor | None = None,
) -> Tensor
LTX2Pipeline ¶
Bases: Module, CFGParallelMixin, ProgressBarMixin
audio_hop_length instance-attribute ¶
audio_hop_length = (
mel_hop_length
if getattr(self, "audio_vae", None) is not None
else 160
)
audio_sampling_rate instance-attribute ¶
audio_sampling_rate = (
sample_rate
if getattr(self, "audio_vae", None) is not None
else 16000
)
audio_vae_mel_compression_ratio instance-attribute ¶
audio_vae_mel_compression_ratio = (
mel_compression_ratio
if getattr(self, "audio_vae", None) is not None
else 4
)
audio_vae_temporal_compression_ratio instance-attribute ¶
audio_vae_temporal_compression_ratio = (
temporal_compression_ratio
if getattr(self, "audio_vae", None) is not None
else 4
)
scheduler instance-attribute ¶
tokenizer instance-attribute ¶
transformer instance-attribute ¶
transformer = create_transformer_from_config(
transformer_config, quant_config=quant_config
)
transformer_spatial_patch_size instance-attribute ¶
transformer_spatial_patch_size = (
patch_size
if getattr(self, "transformer", None) is not None
else 1
)
transformer_temporal_patch_size instance-attribute ¶
transformer_temporal_patch_size = (
patch_size_t
if getattr(self, "transformer", None) is not None
else 1
)
vae_spatial_compression_ratio instance-attribute ¶
vae_spatial_compression_ratio = (
spatial_compression_ratio
if getattr(self, "vae", None) is not None
else 32
)
vae_temporal_compression_ratio instance-attribute ¶
vae_temporal_compression_ratio = (
temporal_compression_ratio
if getattr(self, "vae", None) is not None
else 8
)
video_processor instance-attribute ¶
weights_sources instance-attribute ¶
weights_sources = [
ComponentSource(
model_or_path=model,
subfolder="transformer",
revision=None,
prefix="transformer.",
fall_back_to_pt=True,
)
]
check_inputs ¶
check_inputs(
prompt,
height,
width,
prompt_embeds=None,
negative_prompt_embeds=None,
prompt_attention_mask=None,
negative_prompt_attention_mask=None,
)
combine_cfg_noise ¶
Per-element CFG combine with guidance_rescale support.
encode_prompt ¶
encode_prompt(
prompt: str | list[str],
negative_prompt: str | list[str] | None = None,
do_classifier_free_guidance: bool = True,
num_videos_per_prompt: int = 1,
prompt_embeds: Tensor | None = None,
negative_prompt_embeds: Tensor | None = None,
prompt_attention_mask: Tensor | None = None,
negative_prompt_attention_mask: Tensor | None = None,
max_sequence_length: int = 1024,
scale_factor: int = 8,
device: device | None = None,
dtype: dtype | None = None,
)
forward ¶
forward(
req: OmniDiffusionRequest,
prompt: str | list[str] | None = None,
negative_prompt: str | list[str] | None = None,
height: int | None = None,
width: int | None = None,
num_frames: int | None = None,
frame_rate: float | None = None,
num_inference_steps: int | None = None,
sigmas: list[float] | None = None,
timesteps: list[int] | None = None,
guidance_scale: float = 4.0,
guidance_rescale: float = 0.0,
noise_scale: float = 0.0,
num_videos_per_prompt: int | None = 1,
generator: Generator | list[Generator] | None = None,
latents: Tensor | None = None,
audio_latents: Tensor | None = None,
prompt_embeds: Tensor | None = None,
negative_prompt_embeds: Tensor | None = None,
prompt_attention_mask: Tensor | None = None,
negative_prompt_attention_mask: Tensor | None = None,
decode_timestep: float | list[float] = 0.0,
decode_noise_scale: float | list[float] | None = None,
output_type: str = "np",
return_dict: bool = True,
attention_kwargs: dict[str, Any] | None = None,
max_sequence_length: int | None = None,
) -> DiffusionOutput
prepare_audio_latents ¶
prepare_audio_latents(
batch_size: int = 1,
num_channels_latents: int = 8,
audio_latent_length: int = 1,
num_mel_bins: int = 64,
noise_scale: float = 0.0,
dtype: dtype | None = None,
device: device | None = None,
generator: Generator | list[Generator] | None = None,
latents: Tensor | None = None,
) -> tuple[Tensor, int, int]
prepare_latents ¶
prepare_latents(
batch_size: int = 1,
num_channels_latents: int = 128,
height: int = 512,
width: int = 768,
num_frames: int = 121,
noise_scale: float = 0.0,
dtype: dtype | None = None,
device: device | None = None,
generator: Generator | None = None,
latents: Tensor | None = None,
) -> Tensor
LTX2T2VDMD2Pipeline ¶
LTX2TwoStagesPipeline ¶
Bases: Module, SupportsComponentDiscovery
LTX2TwoStagesPipeline is for two stages image to video generation
lora_manager instance-attribute ¶
lora_manager = DiffusionLoRAManager(
pipeline=pipe,
device=device,
dtype=dtype,
max_cached_adapters=max_cpu_loras,
)
upsample_pipe instance-attribute ¶
upsample_pipe = LTX2LatentUpsamplePipeline(
vae=vae, od_config=od_config
)
weights_sources instance-attribute ¶
weights_sources = [
ComponentSource(
model_or_path=model,
subfolder="transformer",
revision=None,
prefix="pipe.transformer.",
fall_back_to_pt=True,
)
]
forward ¶
forward(
req: OmniDiffusionRequest,
prompt: str | list[str] | None = None,
negative_prompt: str | list[str] | None = None,
height: int | None = None,
width: int | None = None,
num_frames: int | None = None,
frame_rate: float | None = None,
num_inference_steps: int | None = None,
timesteps: list[int] | None = None,
guidance_scale: float = 4.0,
guidance_rescale: float = 0.0,
noise_scale: float = 0.0,
num_videos_per_prompt: int | None = 1,
generator: Generator | list[Generator] | None = None,
latents: Tensor | None = None,
audio_latents: Tensor | None = None,
prompt_embeds: Tensor | None = None,
negative_prompt_embeds: Tensor | None = None,
prompt_attention_mask: Tensor | None = None,
negative_prompt_attention_mask: Tensor | None = None,
decode_timestep: float | list[float] = 0.0,
decode_noise_scale: float | list[float] | None = None,
output_type: str = "np",
return_dict: bool = True,
attention_kwargs: dict[str, Any] | None = None,
max_sequence_length: int | None = None,
)
LTX2VideoTransformer3DModel ¶
Bases: Module
A Transformer model for video-like data used in LTX.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
in_channels | `int`, defaults to `128` | The number of channels in the input. | 128 |
out_channels | `int`, defaults to `128` | The number of channels in the output. | 128 |
patch_size | `int`, defaults to `1` | The size of the spatial patches to use in the patch embedding layer. | 1 |
patch_size_t | `int`, defaults to `1` | The size of the tmeporal patches to use in the patch embedding layer. | 1 |
num_attention_heads | `int`, defaults to `32` | The number of heads to use for multi-head attention. | 32 |
attention_head_dim | `int`, defaults to `64` | The number of channels in each head. | 128 |
cross_attention_dim | `int`, defaults to `2048 ` | The number of channels for cross attention heads. | 4096 |
num_layers | `int`, defaults to `28` | The number of layers of Transformer blocks to use. | 48 |
activation_fn | `str`, defaults to `"gelu-approximate"` | Activation function to use in feed-forward. | 'gelu-approximate' |
qk_norm | `str`, defaults to `"rms_norm_across_heads"` | The normalization layer to use. | 'rms_norm_across_heads' |
audio_caption_projection instance-attribute ¶
audio_caption_projection = PixArtAlphaTextProjection(
in_features=caption_channels,
hidden_size=audio_inner_dim,
)
audio_norm_out instance-attribute ¶
audio_prompt_adaln instance-attribute ¶
audio_prompt_adaln = LTX2AdaLayerNormSingle(
audio_inner_dim,
num_mod_params=2,
use_additional_conditions=False,
)
audio_rope instance-attribute ¶
audio_rope = LTX2AudioVideoRotaryPosEmbed(
dim=audio_inner_dim,
patch_size=audio_patch_size,
patch_size_t=audio_patch_size_t,
base_num_frames=audio_pos_embed_max_pos,
sampling_rate=audio_sampling_rate,
hop_length=audio_hop_length,
scale_factors=[audio_scale_factor],
theta=rope_theta,
causal_offset=causal_offset,
modality="audio",
double_precision=rope_double_precision,
rope_type=rope_type,
num_attention_heads=audio_num_attention_heads,
)
audio_scale_shift_table instance-attribute ¶
audio_time_embed instance-attribute ¶
audio_time_embed = LTX2AdaLayerNormSingle(
audio_inner_dim,
num_mod_params=audio_num_mod_params,
use_additional_conditions=False,
)
av_cross_attn_audio_scale_shift instance-attribute ¶
av_cross_attn_audio_scale_shift = LTX2AdaLayerNormSingle(
audio_inner_dim,
num_mod_params=4,
use_additional_conditions=False,
)
av_cross_attn_audio_v2a_gate instance-attribute ¶
av_cross_attn_audio_v2a_gate = LTX2AdaLayerNormSingle(
audio_inner_dim,
num_mod_params=1,
use_additional_conditions=False,
)
av_cross_attn_video_a2v_gate instance-attribute ¶
av_cross_attn_video_a2v_gate = LTX2AdaLayerNormSingle(
inner_dim,
num_mod_params=1,
use_additional_conditions=False,
)
av_cross_attn_video_scale_shift instance-attribute ¶
av_cross_attn_video_scale_shift = LTX2AdaLayerNormSingle(
inner_dim,
num_mod_params=4,
use_additional_conditions=False,
)
caption_projection instance-attribute ¶
caption_projection = PixArtAlphaTextProjection(
in_features=caption_channels, hidden_size=inner_dim
)
config instance-attribute ¶
config = SimpleNamespace(
in_channels=in_channels,
out_channels=out_channels,
patch_size=patch_size,
patch_size_t=patch_size_t,
num_attention_heads=num_attention_heads,
attention_head_dim=attention_head_dim,
cross_attention_dim=cross_attention_dim,
vae_scale_factors=vae_scale_factors,
pos_embed_max_pos=pos_embed_max_pos,
base_height=base_height,
base_width=base_width,
audio_in_channels=audio_in_channels,
audio_out_channels=audio_out_channels,
audio_patch_size=audio_patch_size,
audio_patch_size_t=audio_patch_size_t,
audio_num_attention_heads=audio_num_attention_heads,
audio_attention_head_dim=audio_attention_head_dim,
audio_cross_attention_dim=audio_cross_attention_dim,
audio_scale_factor=audio_scale_factor,
audio_pos_embed_max_pos=audio_pos_embed_max_pos,
audio_sampling_rate=audio_sampling_rate,
audio_hop_length=audio_hop_length,
num_layers=num_layers,
activation_fn=activation_fn,
qk_norm=qk_norm,
norm_elementwise_affine=norm_elementwise_affine,
norm_eps=norm_eps,
caption_channels=caption_channels,
attention_bias=attention_bias,
attention_out_bias=attention_out_bias,
rope_theta=rope_theta,
rope_double_precision=rope_double_precision,
causal_offset=causal_offset,
timestep_scale_multiplier=timestep_scale_multiplier,
cross_attn_timestep_scale_multiplier=cross_attn_timestep_scale_multiplier,
rope_type=rope_type,
)
cross_attn_audio_rope instance-attribute ¶
cross_attn_audio_rope = LTX2AudioVideoRotaryPosEmbed(
dim=audio_cross_attention_dim,
patch_size=audio_patch_size,
patch_size_t=audio_patch_size_t,
base_num_frames=cross_attn_pos_embed_max_pos,
sampling_rate=audio_sampling_rate,
hop_length=audio_hop_length,
theta=rope_theta,
causal_offset=causal_offset,
modality="audio",
double_precision=rope_double_precision,
rope_type=rope_type,
num_attention_heads=audio_num_attention_heads,
)
cross_attn_rope instance-attribute ¶
cross_attn_rope = LTX2AudioVideoRotaryPosEmbed(
dim=audio_cross_attention_dim,
patch_size=patch_size,
patch_size_t=patch_size_t,
base_num_frames=cross_attn_pos_embed_max_pos,
base_height=base_height,
base_width=base_width,
theta=rope_theta,
causal_offset=causal_offset,
modality="video",
double_precision=rope_double_precision,
rope_type=rope_type,
num_attention_heads=num_attention_heads,
)
norm_out instance-attribute ¶
packed_modules_mapping class-attribute instance-attribute ¶
prompt_adaln instance-attribute ¶
prompt_adaln = LTX2AdaLayerNormSingle(
inner_dim,
num_mod_params=2,
use_additional_conditions=False,
)
rope instance-attribute ¶
rope = LTX2AudioVideoRotaryPosEmbed(
dim=inner_dim,
patch_size=patch_size,
patch_size_t=patch_size_t,
base_num_frames=pos_embed_max_pos,
base_height=base_height,
base_width=base_width,
scale_factors=vae_scale_factors,
theta=rope_theta,
causal_offset=causal_offset,
modality="video",
double_precision=rope_double_precision,
rope_type=rope_type,
num_attention_heads=num_attention_heads,
)
scale_shift_table instance-attribute ¶
time_embed instance-attribute ¶
time_embed = LTX2AdaLayerNormSingle(
inner_dim,
num_mod_params=video_num_mod_params,
use_additional_conditions=False,
)
transformer_blocks instance-attribute ¶
transformer_blocks = ModuleList(
[
(
LTX2VideoTransformerBlock(
dim=inner_dim,
num_attention_heads=num_attention_heads,
attention_head_dim=attention_head_dim,
cross_attention_dim=cross_attention_dim,
audio_dim=audio_inner_dim,
audio_num_attention_heads=audio_num_attention_heads,
audio_attention_head_dim=audio_attention_head_dim,
audio_cross_attention_dim=audio_cross_attention_dim,
video_gated_attn=gated_attn,
video_cross_attn_adaln=cross_attn_mod,
audio_gated_attn=audio_gated_attn,
audio_cross_attn_adaln=audio_cross_attn_mod,
qk_norm=qk_norm,
activation_fn=activation_fn,
attention_bias=attention_bias,
attention_out_bias=attention_out_bias,
eps=norm_eps,
elementwise_affine=norm_elementwise_affine,
rope_type=rope_type,
perturbed_attn=perturbed_attn,
quant_config=quant_config,
prefix=f"transformer_blocks.{layer_idx}",
)
)
for layer_idx in (range(num_layers))
]
)
forward ¶
forward(
hidden_states: Tensor,
audio_hidden_states: Tensor,
encoder_hidden_states: Tensor,
audio_encoder_hidden_states: Tensor,
timestep: LongTensor,
audio_timestep: LongTensor | None = None,
sigma: Tensor | None = None,
audio_sigma: Tensor | None = None,
encoder_attention_mask: Tensor | None = None,
audio_encoder_attention_mask: Tensor | None = None,
num_frames: int | None = None,
height: int | None = None,
width: int | None = None,
fps: float = 24.0,
audio_num_frames: int | None = None,
video_coords: Tensor | None = None,
audio_coords: Tensor | None = None,
attention_kwargs: dict[str, Any] | None = None,
return_dict: bool = True,
**kwargs,
) -> Tensor
Forward pass for LTX-2.0 audiovisual video transformer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_states | `torch.Tensor` | Input patchified video latents of shape | required |
audio_hidden_states | `torch.Tensor` | Input patchified audio latents of shape | required |
encoder_hidden_states | `torch.Tensor` | Input video text embeddings of shape | required |
audio_encoder_hidden_states | `torch.Tensor` | Input audio text embeddings of shape | required |
timestep | `torch.Tensor` | Input timestep of shape | required |
audio_timestep | `torch.Tensor`, *optional* | Input timestep of shape | None |
encoder_attention_mask | `torch.Tensor`, *optional* | Optional multiplicative text attention mask of shape | None |
audio_encoder_attention_mask | `torch.Tensor`, *optional* | Optional multiplicative text attention mask of shape | None |
num_frames | `int`, *optional* | The number of latent video frames. Used if calculating the video coordinates for RoPE. | None |
height | `int`, *optional* | The latent video height. Used if calculating the video coordinates for RoPE. | None |
width | `int`, *optional* | The latent video width. Used if calculating the video coordinates for RoPE. | None |
fps | float | ( | 24.0 |
audio_num_frames | int | None | ( | None |
video_coords | `torch.Tensor`, *optional* | The video coordinates to be used when calculating the rotary positional embeddings (RoPE) of shape | None |
audio_coords | `torch.Tensor`, *optional* | The audio coordinates to be used when calculating the rotary positional embeddings (RoPE) of shape | None |
attention_kwargs | `Dict[str, Any]`, *optional* | Optional dict of keyword args to be passed to the attention processor. | None |
return_dict | `bool`, *optional*, defaults to `True` | Whether to return a dict-like structured output of type | True |
Returns:
| Type | Description |
|---|---|
Tensor |
|
create_transformer_from_config ¶
create_transformer_from_config(
config: dict,
quant_config: QuantizationConfig | None = None,
) -> LTX2VideoTransformer3DModel
Create LTX2VideoTransformer3DModel from config dict.