vllm_omni.diffusion.models.cosyvoice3_audio.cosyvoice3_dit ¶
logger module-attribute ¶
in notation: b - batch n - sequence nt - text sequence nw - raw wave length d - dimension
AdaLayerNormZero_Final ¶
CausalConvPositionEmbedding ¶
DiT ¶
Bases: Module
Diffusion Transformer backbone using optimized attention backends.
This is a drop-in replacement for the original DiT that uses the vllm_omni diffusion infrastructure for FlashAttention/SageAttention/SDPA.
DiTAttention ¶
Bases: Module
Attention module using diffusion infrastructure for optimized backends.
This replaces the original Attention class to leverage FlashAttention, SageAttention, or SDPA backends automatically.
DiTBlock ¶
Bases: Module
DiT block with AdaLayerNorm modulation.
attn instance-attribute ¶
attn = DiTAttention(
dim=dim, heads=heads, dim_head=dim_head, dropout=dropout
)
ff instance-attribute ¶
ff = FeedForward(
dim=dim,
mult=ff_mult,
dropout=dropout,
approximate="tanh",
)
FeedForward ¶
GRN ¶
InputEmbedding ¶
Bases: Module
Input embedding combining noised audio, condition, text, and speaker.
TextEmbedding ¶
Bases: Module
Text embedding with optional ConvNeXt modeling.
text_blocks instance-attribute ¶
text_blocks = Sequential(
*[
(ConvNeXtV2Block(text_dim, text_dim * conv_mult))
for _ in (range(conv_layers))
]
)
get_pos_embed_indices ¶
Get position embedding indices.