vllm_omni.diffusion.models.hunyuan_video ¶
HunyuanVideo-1.5 diffusion model components (T2V and I2V).
Modules:
| Name | Description |
|---|---|
hunyuan_video_15_transformer | |
pipeline_hunyuan_video_1_5 | |
pipeline_hunyuan_video_1_5_i2v | |
HunyuanVideo15I2VPipeline ¶
Bases: Module, CFGParallelMixin, SupportImageInput, ProgressBarMixin, DiffusionPipelineProfilerMixin
feature_extractor instance-attribute ¶
feature_extractor = from_pretrained(
model,
subfolder="feature_extractor",
local_files_only=local_files_only,
)
num_channels_latents instance-attribute ¶
num_channels_latents = (
latent_channels if hasattr(vae, "config") else 32
)
scheduler instance-attribute ¶
system_message instance-attribute ¶
system_message = "You are a helpful assistant. Describe the video by detailing the following aspects: 1. The main content and theme of the video. 2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects. 3. Actions, events, behaviors temporal relationships, physical movement changes of the objects. 4. background environment, light, style and atmosphere. 5. camera angles, movements, and transitions used in the video."
tokenizer instance-attribute ¶
tokenizer_2 instance-attribute ¶
transformer instance-attribute ¶
transformer = HunyuanVideo15Transformer3DModel(
od_config=od_config, **transformer_kwargs
)
vae_scale_factor_spatial instance-attribute ¶
vae_scale_factor_spatial = (
spatial_compression_ratio
if hasattr(vae, "spatial_compression_ratio")
else 16
)
vae_scale_factor_temporal instance-attribute ¶
vae_scale_factor_temporal = (
temporal_compression_ratio
if hasattr(vae, "temporal_compression_ratio")
else 4
)
weights_sources instance-attribute ¶
weights_sources = [
ComponentSource(
model_or_path=model,
subfolder="transformer",
revision=None,
prefix="transformer.",
fall_back_to_pt=True,
),
ComponentSource(
model_or_path=model,
subfolder="text_encoder_2",
revision=None,
prefix="text_encoder_2.",
fall_back_to_pt=True,
),
]
encode_prompt ¶
encode_prompt(
prompt: str | list[str],
device: device,
dtype: dtype,
negative_prompt: str | list[str] | None = None,
do_classifier_free_guidance: bool = False,
) -> tuple
forward ¶
forward(
req: OmniDiffusionRequest,
num_inference_steps: int = 50,
guidance_scale: float = 6.0,
height: int = 480,
width: int = 832,
num_frames: int = 121,
output_type: str | None = "np",
generator: Generator | list[Generator] | None = None,
**kwargs,
) -> DiffusionOutput
HunyuanVideo15Pipeline ¶
Bases: Module, CFGParallelMixin, ProgressBarMixin, DiffusionPipelineProfilerMixin
num_channels_latents instance-attribute ¶
num_channels_latents = (
latent_channels if hasattr(vae, "config") else 32
)
scheduler instance-attribute ¶
system_message instance-attribute ¶
system_message = "You are a helpful assistant. Describe the video by detailing the following aspects: 1. The main content and theme of the video. 2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects. 3. Actions, events, behaviors temporal relationships, physical movement changes of the objects. 4. background environment, light, style and atmosphere. 5. camera angles, movements, and transitions used in the video."
tokenizer instance-attribute ¶
tokenizer_2 instance-attribute ¶
transformer instance-attribute ¶
transformer = HunyuanVideo15Transformer3DModel(
od_config=od_config, **transformer_kwargs
)
vae_scale_factor_spatial instance-attribute ¶
vae_scale_factor_spatial = (
spatial_compression_ratio
if hasattr(vae, "spatial_compression_ratio")
else 16
)
vae_scale_factor_temporal instance-attribute ¶
vae_scale_factor_temporal = (
temporal_compression_ratio
if hasattr(vae, "temporal_compression_ratio")
else 4
)
weights_sources instance-attribute ¶
weights_sources = [
ComponentSource(
model_or_path=model,
subfolder="transformer",
revision=None,
prefix="transformer.",
fall_back_to_pt=True,
),
ComponentSource(
model_or_path=model,
subfolder="text_encoder_2",
revision=None,
prefix="text_encoder_2.",
fall_back_to_pt=True,
),
]
encode_prompt ¶
encode_prompt(
prompt: str | list[str],
device: device,
dtype: dtype,
negative_prompt: str | list[str] | None = None,
do_classifier_free_guidance: bool = False,
) -> tuple[
Tensor,
Tensor,
Tensor,
Tensor,
Tensor | None,
Tensor | None,
Tensor | None,
Tensor | None,
]
forward ¶
forward(
req: OmniDiffusionRequest,
num_inference_steps: int = 50,
guidance_scale: float = 6.0,
height: int = 480,
width: int = 832,
num_frames: int = 121,
output_type: str | None = "np",
generator: Generator | list[Generator] | None = None,
**kwargs,
) -> DiffusionOutput
HunyuanVideo15Transformer3DModel ¶
Bases: Module
HunyuanVideo-1.5 Transformer with TP-optimized dual-stream attention.
Ported from diffusers HunyuanVideo15Transformer3DModel with vllm-omni tensor-parallel layers for the 54 main transformer blocks.
context_embedder instance-attribute ¶
context_embedder = HunyuanVideo15TokenRefiner(
text_embed_dim,
num_attention_heads,
attention_head_dim,
num_layers=num_refiner_layers,
)
context_embedder_2 instance-attribute ¶
context_embedder_2 = HunyuanVideo15ByT5TextProjection(
text_embed_2_dim, 2048, inner_dim
)
image_embedder instance-attribute ¶
image_embedder = HunyuanVideo15ImageProjection(
image_embed_dim, inner_dim
)
norm_out instance-attribute ¶
packed_modules_mapping class-attribute instance-attribute ¶
packed_modules_mapping = {
"to_qkv": ["to_q", "to_k", "to_v"],
"add_kv_proj": [
"add_q_proj",
"add_k_proj",
"add_v_proj",
],
}
proj_out instance-attribute ¶
rope instance-attribute ¶
rope = HunyuanVideo15RotaryPosEmbed(
patch_size,
patch_size_t,
list(rope_axes_dim),
rope_theta,
)
time_embed instance-attribute ¶
time_embed = HunyuanVideo15TimeEmbedding(
inner_dim, use_meanflow=use_meanflow
)
transformer_blocks instance-attribute ¶
transformer_blocks = ModuleList(
[
(
HunyuanVideo15TransformerBlock(
num_attention_heads,
attention_head_dim,
mlp_ratio=mlp_ratio,
qk_norm=qk_norm,
)
)
for _ in (range(num_layers))
]
)
x_embedder instance-attribute ¶
x_embedder = HunyuanVideo15PatchEmbed(
(patch_size_t, patch_size, patch_size),
in_channels,
inner_dim,
)
forward ¶
forward(
hidden_states: Tensor,
timestep: LongTensor,
encoder_hidden_states: Tensor,
encoder_attention_mask: Tensor,
timestep_r: LongTensor | None = None,
encoder_hidden_states_2: Tensor | None = None,
encoder_attention_mask_2: Tensor | None = None,
image_embeds: Tensor | None = None,
image_embeds_mask: Tensor | None = None,
attention_kwargs: dict[str, Any] | None = None,
return_dict: bool = True,
) -> Tensor | Transformer2DModelOutput
get_hunyuan_video_15_i2v_post_process_func ¶
get_hunyuan_video_15_i2v_post_process_func(
od_config: OmniDiffusionConfig,
)
get_hunyuan_video_15_i2v_pre_process_func ¶
get_hunyuan_video_15_i2v_pre_process_func(
od_config: OmniDiffusionConfig,
)
Pre-process function for I2V: load and resize image.
get_hunyuan_video_15_post_process_func ¶
get_hunyuan_video_15_post_process_func(
od_config: OmniDiffusionConfig,
)