Skip to content

vllm_omni.model_executor.models.hunyuan_image3

Modules:

Name Description
autoencoder_kl_3d

Reference code

hunyuan_image3
pipeline

HunyuanImage3 pipeline topology.

siglip2

SigLIP2 Vision Transformer for HunyuanImage3, rewritten in vLLM style.

HunyuanImage3ForConditionalGeneration

Bases: Module, SupportsMultiModal, SupportsLoRA, SupportsPP, SupportsMRoPE

HunyuanImage3.0 model for conditional image generation.

This is the main entry point for HunyuanImage3.0 in vLLM. It wraps: - HunyuanModel - VAE Encoder (AutoencoderKLConv3D + TimestepEmbedder + UNetDown) - ViT Encoder (Siglip2VisionTransformer + LightProjector) - LM Head for token prediction

Supports: - Text-to-Text and Image-to-Text generation - Tensor Parallelism

HunyuanImage3Inputs class-attribute instance-attribute

HunyuanImage3Inputs: TypeAlias = HunyuanImage3PixelInputs

config instance-attribute

config = config

lm_head instance-attribute

lm_head = ParallelLMHead(
    unpadded_vocab_size,
    hidden_size,
    org_num_embeddings=vocab_size,
    padding_size=DEFAULT_VOCAB_PADDING_SIZE,
    quant_config=quant_config,
    prefix=maybe_prefix(prefix, "lm_head"),
)

logits_processor instance-attribute

logits_processor = LogitsProcessor(
    unpadded_vocab_size, vocab_size, logit_scale
)

model instance-attribute

model = HunyuanModel(
    vllm_config=vllm_config, prefix="model"
)

multimodal_config instance-attribute

multimodal_config = multimodal_config

packed_modules_mapping class-attribute instance-attribute

packed_modules_mapping = {
    "qkv_proj": ["q_proj", "k_proj", "v_proj"],
    "gate_up_proj": ["gate_proj", "up_proj"],
}

patch_embed instance-attribute

patch_embed = UNetDown(
    patch_size=patch_size,
    emb_channels=hidden_size,
    in_channels=vae["latent_channels"],
    hidden_channels=patch_embed_hidden_dim,
    out_channels=hidden_size,
)

prefer_model_sampler class-attribute instance-attribute

prefer_model_sampler = True

quant_config instance-attribute

quant_config = quant_config

supports_encoder_tp_data class-attribute instance-attribute

supports_encoder_tp_data = True

time_embed instance-attribute

time_embed = TimestepEmbedder(hidden_size=hidden_size)

timestep_emb instance-attribute

timestep_emb = TimestepEmbedder(hidden_size=hidden_size)

unpadded_vocab_size instance-attribute

unpadded_vocab_size = vocab_size

use_data_parallel instance-attribute

use_data_parallel = mm_encoder_tp_mode == 'data'

vae instance-attribute

vae = from_config(vae)

vision_aligner instance-attribute

vision_aligner = LightProjector(vit_aligner)

vision_model instance-attribute

vision_model = Siglip2VisionTransformer(
    vit, quant_config=quant_config, prefix="vision_model"
)

vllm_config instance-attribute

vllm_config = vllm_config

compute_logits

compute_logits(hidden_states: Tensor) -> Tensor | None

embed_input_ids

embed_input_ids(
    input_ids: Tensor,
    multimodal_embeddings: MultiModalEmbeddings
    | None = None,
    *,
    is_multimodal: Tensor | None = None,
) -> Tensor

Embed input IDs with optional multimodal embeddings.

embed_multimodal

embed_multimodal(**kwargs: object) -> MultiModalEmbeddings

Get multimodal embeddings from input.

forward

forward(
    input_ids: Tensor,
    positions: Tensor,
    intermediate_tensors: IntermediateTensors | None = None,
    inputs_embeds: Tensor | None = None,
    sampling_metadata: SamplingMetadata | None = None,
    logits_index: int | None = None,
    sampler=None,
    **kwargs: object,
) -> Tensor | IntermediateTensors

get_language_model

get_language_model() -> Module

get_mrope_input_positions

get_mrope_input_positions(
    input_tokens: list[int],
    mm_features: list[MultiModalFeatureSpec] | None = None,
    *,
    hf_config: PretrainedConfig | None = None,
    image_grid_thw: list[list[int]] | Tensor | None = None,
    video_grid_thw: list[list[int]] | Tensor | None = None,
    second_per_grid_ts: list[float] | None = None,
    context_len: int = 0,
    seq_len: int | None = None,
    audio_feature_lengths: Tensor | None = None,
    use_audio_in_video: bool = False,
) -> tuple[Tensor, int]

Compute mRoPE positions for HunyuanImage-3.

Maps the original model's build_2d_rope logic into vLLM's 3-dim mRoPE position tensor [3, seq_len] where dim-1 is height and dim-2 is width. dim-0 is unused (temporal) and kept equal to 1D.

For text tokens and auxiliary image tokens (timestep, separators): All three dims get the same flat 1D position id. For VAE / ViT image tokens: dim-0 (T): flat 1D position id at the region start dim-1 (H): 2D y-position using build_2d_rope centering dim-2 (W): 2D x-position using build_2d_rope centering

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

make_empty_intermediate_tensors

make_empty_intermediate_tensors(
    batch_size: int, dtype: dtype, device: device
) -> IntermediateTensors

sample

sample(
    logits: Tensor, sampling_metadata: SamplingMetadata
) -> SamplerOutput | None