vllm_omni.model_executor.models.hunyuan_image3 ¶
Modules:
| Name | Description |
|---|---|
autoencoder_kl_3d | Reference code |
hunyuan_image3 | |
pipeline | HunyuanImage3 pipeline topology. |
siglip2 | SigLIP2 Vision Transformer for HunyuanImage3, rewritten in vLLM style. |
HunyuanImage3ForConditionalGeneration ¶
Bases: Module, SupportsMultiModal, SupportsLoRA, SupportsPP, SupportsMRoPE
HunyuanImage3.0 model for conditional image generation.
This is the main entry point for HunyuanImage3.0 in vLLM. It wraps: - HunyuanModel - VAE Encoder (AutoencoderKLConv3D + TimestepEmbedder + UNetDown) - ViT Encoder (Siglip2VisionTransformer + LightProjector) - LM Head for token prediction
Supports: - Text-to-Text and Image-to-Text generation - Tensor Parallelism
HunyuanImage3Inputs class-attribute instance-attribute ¶
HunyuanImage3Inputs: TypeAlias = HunyuanImage3PixelInputs
lm_head instance-attribute ¶
lm_head = ParallelLMHead(
unpadded_vocab_size,
hidden_size,
org_num_embeddings=vocab_size,
padding_size=DEFAULT_VOCAB_PADDING_SIZE,
quant_config=quant_config,
prefix=maybe_prefix(prefix, "lm_head"),
)
logits_processor instance-attribute ¶
packed_modules_mapping class-attribute instance-attribute ¶
packed_modules_mapping = {
"qkv_proj": ["q_proj", "k_proj", "v_proj"],
"gate_up_proj": ["gate_proj", "up_proj"],
}
patch_embed instance-attribute ¶
patch_embed = UNetDown(
patch_size=patch_size,
emb_channels=hidden_size,
in_channels=vae["latent_channels"],
hidden_channels=patch_embed_hidden_dim,
out_channels=hidden_size,
)
vision_model instance-attribute ¶
vision_model = Siglip2VisionTransformer(
vit, quant_config=quant_config, prefix="vision_model"
)
embed_input_ids ¶
embed_input_ids(
input_ids: Tensor,
multimodal_embeddings: MultiModalEmbeddings
| None = None,
*,
is_multimodal: Tensor | None = None,
) -> Tensor
Embed input IDs with optional multimodal embeddings.
embed_multimodal ¶
embed_multimodal(**kwargs: object) -> MultiModalEmbeddings
Get multimodal embeddings from input.
forward ¶
forward(
input_ids: Tensor,
positions: Tensor,
intermediate_tensors: IntermediateTensors | None = None,
inputs_embeds: Tensor | None = None,
sampling_metadata: SamplingMetadata | None = None,
logits_index: int | None = None,
sampler=None,
**kwargs: object,
) -> Tensor | IntermediateTensors
get_mrope_input_positions ¶
get_mrope_input_positions(
input_tokens: list[int],
mm_features: list[MultiModalFeatureSpec] | None = None,
*,
hf_config: PretrainedConfig | None = None,
image_grid_thw: list[list[int]] | Tensor | None = None,
video_grid_thw: list[list[int]] | Tensor | None = None,
second_per_grid_ts: list[float] | None = None,
context_len: int = 0,
seq_len: int | None = None,
audio_feature_lengths: Tensor | None = None,
use_audio_in_video: bool = False,
) -> tuple[Tensor, int]
Compute mRoPE positions for HunyuanImage-3.
Maps the original model's build_2d_rope logic into vLLM's 3-dim mRoPE position tensor [3, seq_len] where dim-1 is height and dim-2 is width. dim-0 is unused (temporal) and kept equal to 1D.
For text tokens and auxiliary image tokens (timestep, separators): All three dims get the same flat 1D position id. For VAE / ViT image tokens: dim-0 (T): flat 1D position id at the region start dim-1 (H): 2D y-position using build_2d_rope centering dim-2 (W): 2D x-position using build_2d_rope centering