vllm_omni.model_executor.models.hunyuan_image3.siglip2 ¶
SigLIP2 Vision Transformer for HunyuanImage3, rewritten in vLLM style.
Key optimizations over the original HuggingFace-style implementation: - QKVParallelLinear: fused QKV projection with tensor parallelism support - MMEncoderAttention: FlashAttention / xFormers backend - ColumnParallelLinear / RowParallelLinear: TP-aware MLP layers - Packed sequence processing: eliminates padding waste via cu_seqlens - Data parallel support for multi-GPU ViT inference
Config ¶
Convert dict config to object with attribute access.
Siglip2Attention ¶
Bases: Module
Multi-headed attention using QKVParallelLinear and MMEncoderAttention.
attn instance-attribute ¶
attn = MMEncoderAttention(
num_heads=num_heads_per_partition,
head_size=head_dim,
scale=scale,
prefix=f"{prefix}.attn",
)
out_proj instance-attribute ¶
out_proj = RowParallelLinear(
input_size=embed_dim,
output_size=embed_dim,
quant_config=quant_config,
prefix=f"{prefix}.out_proj",
disable_tp=use_data_parallel,
)
qkv_proj instance-attribute ¶
qkv_proj = QKVParallelLinear(
hidden_size=embed_dim,
head_size=head_dim,
total_num_heads=num_heads,
quant_config=quant_config,
prefix=f"{prefix}.qkv_proj",
disable_tp=use_data_parallel,
)
tp_size instance-attribute ¶
forward ¶
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_states | Tensor | Packed input (total_tokens, embed_dim) | required |
cu_seqlens | Tensor | Cumulative sequence lengths (B+1,) | required |
Siglip2Encoder ¶
Bases: Module
Transformer encoder with packed sequence processing.
layers instance-attribute ¶
layers = ModuleList(
[
(
Siglip2EncoderLayer(
config,
quant_config=quant_config,
prefix=f"{prefix}.layers.{idx}",
)
)
for idx in (range(num_hidden_layers))
]
)
forward ¶
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_states | Tensor | Packed input (total_tokens, embed_dim) | required |
cu_seqlens | Tensor | Cumulative sequence lengths (B+1,) | required |
Siglip2EncoderLayer ¶
Bases: Module
mlp instance-attribute ¶
mlp = Siglip2MLP(
config,
quant_config=quant_config,
prefix=f"{prefix}.mlp",
)
self_attn instance-attribute ¶
self_attn = Siglip2Attention(
config,
quant_config=quant_config,
prefix=f"{prefix}.self_attn",
)
forward ¶
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_states | Tensor | Packed input (total_tokens, embed_dim) | required |
cu_seqlens | Tensor | Cumulative sequence lengths (B+1,) | required |
Siglip2MLP ¶
Bases: Module
fc1 instance-attribute ¶
fc1 = ColumnParallelLinear(
hidden_size,
intermediate_size,
quant_config=quant_config,
prefix=f"{prefix}.fc1",
disable_tp=use_data_parallel,
)
fc2 instance-attribute ¶
fc2 = RowParallelLinear(
intermediate_size,
hidden_size,
quant_config=quant_config,
prefix=f"{prefix}.fc2",
disable_tp=use_data_parallel,
)
Siglip2VisionEmbeddings ¶
Bases: Module
patch_embedding instance-attribute ¶
patch_embedding = Linear(
in_features=num_channels * patch_size * patch_size,
out_features=embed_dim,
)
forward ¶
Process packed pixel values with per-image position embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pixel_values | FloatTensor | Packed pixel values (total_real_patches, num_channels * patch_size * patch_size) | required |
spatial_shapes | LongTensor | Per-image spatial shapes (B, 2) as [(h, w), ...] | required |
Returns:
| Type | Description |
|---|---|
Tensor | Packed embeddings (total_real_patches, embed_dim) |
Siglip2VisionTransformer ¶
Bases: Module
encoder instance-attribute ¶
encoder = Siglip2Encoder(
config,
quant_config=quant_config,
prefix=f"{prefix}.encoder" if prefix else "encoder",
)
forward ¶
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pixel_values | FloatTensor | Batched pixel values (B, max_num_patches, num_channels * patch_size * patch_size) | required |
attention_mask | Tensor | (B, max_num_patches) with 1 for real, 0 for padding | required |
spatial_shapes | LongTensor | (B, 2) with (height, width) per image | required |
Returns:
| Type | Description |
|---|---|
Tensor | (B, max_num_patches, hidden_size) with zeros at padding positions |