Skip to content

vllm_omni.model_executor.models.hunyuan_image3.siglip2

SigLIP2 Vision Transformer for HunyuanImage3, rewritten in vLLM style.

Key optimizations over the original HuggingFace-style implementation: - QKVParallelLinear: fused QKV projection with tensor parallelism support - MMEncoderAttention: FlashAttention / xFormers backend - ColumnParallelLinear / RowParallelLinear: TP-aware MLP layers - Packed sequence processing: eliminates padding waste via cu_seqlens - Data parallel support for multi-GPU ViT inference

Config

Convert dict config to object with attribute access.

LightProjector

Bases: Module

layers instance-attribute

layers = modules

forward

forward(x)

Siglip2Attention

Bases: Module

Multi-headed attention using QKVParallelLinear and MMEncoderAttention.

attn instance-attribute

attn = MMEncoderAttention(
    num_heads=num_heads_per_partition,
    head_size=head_dim,
    scale=scale,
    prefix=f"{prefix}.attn",
)

embed_dim instance-attribute

embed_dim = hidden_size

head_dim instance-attribute

head_dim = embed_dim // num_heads

num_heads instance-attribute

num_heads = num_attention_heads

num_heads_per_partition instance-attribute

num_heads_per_partition = divide(num_heads, tp_size)

out_proj instance-attribute

out_proj = RowParallelLinear(
    input_size=embed_dim,
    output_size=embed_dim,
    quant_config=quant_config,
    prefix=f"{prefix}.out_proj",
    disable_tp=use_data_parallel,
)

qkv_proj instance-attribute

qkv_proj = QKVParallelLinear(
    hidden_size=embed_dim,
    head_size=head_dim,
    total_num_heads=num_heads,
    quant_config=quant_config,
    prefix=f"{prefix}.qkv_proj",
    disable_tp=use_data_parallel,
)

scale instance-attribute

scale = head_dim ** -0.5

tp_size instance-attribute

tp_size = (
    1
    if use_data_parallel
    else get_tensor_model_parallel_world_size()
)

forward

forward(
    hidden_states: Tensor, cu_seqlens: Tensor
) -> Tensor

Parameters:

Name Type Description Default
hidden_states Tensor

Packed input (total_tokens, embed_dim)

required
cu_seqlens Tensor

Cumulative sequence lengths (B+1,)

required

Siglip2Encoder

Bases: Module

Transformer encoder with packed sequence processing.

layers instance-attribute

layers = ModuleList(
    [
        (
            Siglip2EncoderLayer(
                config,
                quant_config=quant_config,
                prefix=f"{prefix}.layers.{idx}",
            )
        )
        for idx in (range(num_hidden_layers))
    ]
)

forward

forward(
    hidden_states: Tensor, cu_seqlens: Tensor
) -> Tensor

Parameters:

Name Type Description Default
hidden_states Tensor

Packed input (total_tokens, embed_dim)

required
cu_seqlens Tensor

Cumulative sequence lengths (B+1,)

required

Siglip2EncoderLayer

Bases: Module

embed_dim instance-attribute

embed_dim = hidden_size

layer_norm1 instance-attribute

layer_norm1 = LayerNorm(embed_dim, eps=layer_norm_eps)

layer_norm2 instance-attribute

layer_norm2 = LayerNorm(embed_dim, eps=layer_norm_eps)

mlp instance-attribute

mlp = Siglip2MLP(
    config,
    quant_config=quant_config,
    prefix=f"{prefix}.mlp",
)

self_attn instance-attribute

self_attn = Siglip2Attention(
    config,
    quant_config=quant_config,
    prefix=f"{prefix}.self_attn",
)

forward

forward(
    hidden_states: Tensor, cu_seqlens: Tensor
) -> Tensor

Parameters:

Name Type Description Default
hidden_states Tensor

Packed input (total_tokens, embed_dim)

required
cu_seqlens Tensor

Cumulative sequence lengths (B+1,)

required

Siglip2MLP

Bases: Module

activation_fn instance-attribute

activation_fn = get_act_fn(hidden_act)

fc1 instance-attribute

fc1 = ColumnParallelLinear(
    hidden_size,
    intermediate_size,
    quant_config=quant_config,
    prefix=f"{prefix}.fc1",
    disable_tp=use_data_parallel,
)

fc2 instance-attribute

fc2 = RowParallelLinear(
    intermediate_size,
    hidden_size,
    quant_config=quant_config,
    prefix=f"{prefix}.fc2",
    disable_tp=use_data_parallel,
)

forward

forward(hidden_states: Tensor) -> Tensor

Siglip2VisionEmbeddings

Bases: Module

config instance-attribute

config = config

embed_dim instance-attribute

embed_dim = hidden_size

num_patches instance-attribute

num_patches = num_patches

patch_embedding instance-attribute

patch_embedding = Linear(
    in_features=num_channels * patch_size * patch_size,
    out_features=embed_dim,
)

patch_size instance-attribute

patch_size = patch_size

position_embedding instance-attribute

position_embedding = Embedding(num_patches, embed_dim)

position_embedding_size instance-attribute

position_embedding_size = int(num_patches ** 0.5)

forward

forward(
    pixel_values: FloatTensor, spatial_shapes: LongTensor
) -> Tensor

Process packed pixel values with per-image position embeddings.

Parameters:

Name Type Description Default
pixel_values FloatTensor

Packed pixel values (total_real_patches, num_channels * patch_size * patch_size)

required
spatial_shapes LongTensor

Per-image spatial shapes (B, 2) as [(h, w), ...]

required

Returns:

Type Description
Tensor

Packed embeddings (total_real_patches, embed_dim)

Siglip2VisionTransformer

Bases: Module

config instance-attribute

config = config

embed_dim instance-attribute

embed_dim = hidden_size

embeddings instance-attribute

embeddings = Siglip2VisionEmbeddings(config)

encoder instance-attribute

encoder = Siglip2Encoder(
    config,
    quant_config=quant_config,
    prefix=f"{prefix}.encoder" if prefix else "encoder",
)

post_layernorm instance-attribute

post_layernorm = LayerNorm(embed_dim, eps=layer_norm_eps)

forward

forward(
    pixel_values: FloatTensor,
    attention_mask: Tensor,
    spatial_shapes: LongTensor,
) -> Tensor

Parameters:

Name Type Description Default
pixel_values FloatTensor

Batched pixel values (B, max_num_patches, num_channels * patch_size * patch_size)

required
attention_mask Tensor

(B, max_num_patches) with 1 for real, 0 for padding

required
spatial_shapes LongTensor

(B, 2) with (height, width) per image

required

Returns:

Type Description
Tensor

(B, max_num_patches, hidden_size) with zeros at padding positions

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]