vllm_omni.diffusion.models.sd3.sd3_transformer ¶
FeedForward ¶
GELU ¶
SD3CrossAttention ¶
Bases: Module
add_kv_proj instance-attribute ¶
add_kv_proj = QKVParallelLinear(
added_kv_proj_dim,
head_size=inner_kv_dim // num_heads,
total_num_heads=num_heads,
)
attn instance-attribute ¶
attn = Attention(
num_heads=num_heads,
head_size=head_dim,
softmax_scale=1.0 / head_dim**0.5,
causal=False,
)
inner_dim instance-attribute ¶
to_qkv instance-attribute ¶
SD3PatchEmbed ¶
Bases: Module
2D Image to Patch Embedding with support for SD3.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
patch_size | `int`, defaults to `16` | The size of the patches. | 16 |
in_channels | `int`, defaults to `3` | The number of input channels. | 3 |
embed_dim | `int`, defaults to `768` | The output dimension of the embedding. | 768 |
SD3Transformer2DModel ¶
Bases: Module
The Transformer model introduced in Stable Diffusion 3.
context_embedder instance-attribute ¶
dual_attention_layers instance-attribute ¶
dual_attention_layers = (
dual_attention_layers
if hasattr(model_config, "dual_attention_layers")
else ()
)
norm_out instance-attribute ¶
pos_embed instance-attribute ¶
pos_embed = PatchEmbed(
height=sample_size,
width=sample_size,
patch_size=patch_size,
in_channels=in_channels,
embed_dim=inner_dim,
pos_embed_max_size=pos_embed_max_size,
)
proj_out instance-attribute ¶
time_text_embed instance-attribute ¶
time_text_embed = CombinedTimestepTextProjEmbeddings(
embedding_dim=inner_dim,
pooled_projection_dim=pooled_projection_dim,
)
transformer_blocks instance-attribute ¶
transformer_blocks = ModuleList(
[
(
SD3TransformerBlock(
dim=inner_dim,
num_attention_heads=num_attention_heads,
attention_head_dim=attention_head_dim,
context_pre_only=i == num_layers - 1,
qk_norm=qk_norm,
use_dual_attention=True
if i in dual_attention_layers
else False,
)
)
for i in (range(num_layers))
]
)
forward ¶
forward(
hidden_states: Tensor,
encoder_hidden_states: Tensor,
pooled_projections: Tensor,
timestep: LongTensor,
return_dict: bool = True,
) -> Tensor | Transformer2DModelOutput
The [SD3Transformer2DModel] forward method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_states | `torch.Tensor` of shape `(batch_size, image_sequence_length, in_channels)` | Input | required |
encoder_hidden_states | `torch.Tensor` of shape `(batch_size, text_sequence_length, joint_attention_dim)` | Conditional embeddings (embeddings computed from the input conditions such as prompts) to use. | required |
pooled_projections | `torch.Tensor` of shape `(batch_size, projection_dim)` | Embeddings projected from the embeddings of input conditions. | required |
timestep | `torch.LongTensor` | Used to indicate denoising step. | required |
return_dict | `bool`, *optional*, defaults to `True` | Whether or not to return a [ | True |
Returns:
| Type | Description |
|---|---|
Tensor | Transformer2DModelOutput | If |
Tensor | Transformer2DModelOutput |
|
SD3TransformerBlock ¶
Bases: Module
A Transformer block following the MMDiT architecture, introduced in Stable Diffusion 3.
Reference: https://huggingface.co/papers/2403.03206
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dim | `int` | The number of channels in the input and output. | required |
num_attention_heads | `int` | The number of heads to use for multi-head attention. | required |
attention_head_dim | `int` | The number of channels in each head. | required |
context_pre_only | `bool` | Boolean to determine if we should add some blocks associated with the processing of | False |
attn instance-attribute ¶
attn = SD3CrossAttention(
dim=dim,
num_heads=num_attention_heads,
head_dim=attention_head_dim,
added_kv_proj_dim=dim,
context_pre_only=context_pre_only,
out_dim=dim,
qk_norm=True if qk_norm == "rms_norm" else False,
eps=1e-06,
)
attn2 instance-attribute ¶
attn2 = SD3CrossAttention(
dim=dim,
num_heads=num_attention_heads,
head_dim=attention_head_dim,
added_kv_proj_dim=None,
context_pre_only=True,
out_dim=dim,
qk_norm=True if qk_norm == "rms_norm" else False,
eps=1e-06,
)
ff_context instance-attribute ¶
ff_context = FeedForward(
dim=dim, dim_out=dim, activation_fn="gelu-approximate"
)
norm1_context instance-attribute ¶
norm1_context = AdaLayerNormContinuous(
dim,
dim,
elementwise_affine=False,
eps=1e-06,
bias=True,
norm_type="layer_norm",
)