vllm_omni.diffusion.models.stable_audio ¶
Stable Audio Open model support for vLLM-Omni.
Modules:
| Name | Description |
|---|---|
pipeline_stable_audio | Stable Audio Open Pipeline for vLLM-Omni. |
stable_audio_transformer | Stable Audio DiT Model for vLLM-Omni. |
StableAudioDiTModel ¶
Bases: Module
Optimized Stable Audio DiT model using vLLM layers.
This is an optimized version of the diffusers StableAudioDiTModel that uses vLLM's efficient linear layers and attention implementations.
Architecture: - Input: [B, in_channels, L] (e.g., [B, 64, L]) - preprocess_conv: residual conv layer (keeps 64 channels) - proj_in: projects 64 -> 1536 (inner_dim) - Global+time embeddings prepended to sequence - Transformer blocks work on 1536-dim - proj_out: projects 1536 -> 64 (out_channels) - postprocess_conv: residual conv layer (keeps 64 channels) - Output: [B, out_channels, L]
config instance-attribute ¶
config = type(
"Config",
(),
{
"sample_size": sample_size,
"in_channels": in_channels,
"out_channels": out_channels,
"num_layers": num_layers,
"attention_head_dim": attention_head_dim,
"num_attention_heads": num_attention_heads,
"num_key_value_attention_heads": num_key_value_attention_heads,
"cross_attention_dim": cross_attention_dim,
"time_proj_dim": time_proj_dim,
"global_states_input_dim": global_states_input_dim,
"cross_attention_input_dim": cross_attention_input_dim,
},
)()
cross_attention_proj instance-attribute ¶
cross_attention_proj = Sequential(
Linear(
cross_attention_input_dim,
cross_attention_dim,
bias=False,
),
SiLU(),
Linear(
cross_attention_dim, cross_attention_dim, bias=False
),
)
global_proj instance-attribute ¶
global_proj = Sequential(
Linear(global_states_input_dim, inner_dim, bias=False),
SiLU(),
Linear(inner_dim, inner_dim, bias=False),
)
postprocess_conv instance-attribute ¶
preprocess_conv instance-attribute ¶
time_proj instance-attribute ¶
time_proj = StableAudioGaussianFourierProjection(
embedding_size=time_proj_dim // 2
)
timestep_proj instance-attribute ¶
timestep_proj = Sequential(
Linear(time_proj_dim, inner_dim, bias=True),
SiLU(),
Linear(inner_dim, inner_dim, bias=True),
)
transformer_blocks instance-attribute ¶
transformer_blocks = ModuleList(
[
(
StableAudioDiTBlock(
dim=inner_dim,
num_attention_heads=num_attention_heads,
num_key_value_attention_heads=num_key_value_attention_heads,
attention_head_dim=attention_head_dim,
cross_attention_dim=cross_attention_dim,
)
)
for _ in (range(num_layers))
]
)
forward ¶
forward(
hidden_states: Tensor,
timestep: Tensor,
encoder_hidden_states: Tensor,
global_hidden_states: Tensor | None = None,
rotary_embedding: tuple[Tensor, Tensor] | None = None,
return_dict: bool = True,
attention_mask: Tensor | None = None,
encoder_attention_mask: Tensor | None = None,
) -> Tensor | Transformer2DModelOutput
Forward pass of the Stable Audio DiT model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_states | Tensor | Input latent tensor [B, C, L] (C=in_channels=64) | required |
timestep | Tensor | Timestep tensor [B] or [1] | required |
encoder_hidden_states | Tensor | Text/condition embeddings [B, S, D] | required |
global_hidden_states | Tensor | None | Global conditioning (duration) [B, 1, D] | None |
rotary_embedding | tuple[Tensor, Tensor] | None | Precomputed rotary embeddings (cos, sin) | None |
return_dict | bool | Whether to return a dataclass or tuple | True |
attention_mask | Tensor | None | Attention mask for self-attention | None |
encoder_attention_mask | Tensor | None | Attention mask for cross-attention | None |
Returns:
| Type | Description |
|---|---|
Tensor | Transformer2DModelOutput | Denoised latent tensor |
StableAudioPipeline ¶
Bases: Module, SupportAudioOutput, DiffusionPipelineProfilerMixin
Pipeline for text-to-audio generation using Stable Audio Open.
This pipeline generates audio from text prompts using the Stable Audio Open model from Stability AI, integrated with vLLM-Omni's diffusion framework.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
od_config | OmniDiffusionConfig | OmniDiffusion configuration object | required |
prefix | str | Weight prefix for loading (default: "") | '' |
scheduler instance-attribute ¶
scheduler = StableAudioSchedulerWrapper(
from_pretrained(
model,
subfolder="scheduler",
local_files_only=local_files_only,
)
)
tokenizer instance-attribute ¶
transformer instance-attribute ¶
transformer = StableAudioDiTModel(
od_config=od_config, **transformer_kwargs
)
weights_sources instance-attribute ¶
weights_sources = [
ComponentSource(
model_or_path=model,
subfolder="transformer",
revision=None,
prefix="transformer.",
fall_back_to_pt=True,
)
]
check_inputs ¶
check_inputs(
prompt: str | list[str] | None,
audio_start_in_s: float,
audio_end_in_s: float,
negative_prompt: str | list[str] | None = None,
prompt_embeds: Tensor | None = None,
negative_prompt_embeds: Tensor | None = None,
)
Validate input parameters.
encode_duration ¶
encode_duration(
audio_start_in_s: float,
audio_end_in_s: float,
device: device,
do_classifier_free_guidance: bool,
batch_size: int,
) -> tuple[Tensor, Tensor]
Encode audio duration to conditioning tensors.
encode_prompt ¶
encode_prompt(
prompt: str | list[str],
device: device,
do_classifier_free_guidance: bool,
negative_prompt: str | list[str] | None = None,
prompt_embeds: Tensor | None = None,
negative_prompt_embeds: Tensor | None = None,
attention_mask: Tensor | None = None,
negative_attention_mask: Tensor | None = None,
) -> Tensor
Encode text prompt to embeddings.
forward ¶
forward(
req: OmniDiffusionRequest,
prompt: str | list[str] | None = None,
negative_prompt: str | list[str] | None = None,
audio_end_in_s: float | None = None,
audio_start_in_s: float = 0.0,
guidance_scale: float = 7.0,
num_waveforms_per_prompt: int = 1,
generator: Generator | list[Generator] | None = None,
latents: Tensor | None = None,
prompt_embeds: Tensor | None = None,
negative_prompt_embeds: Tensor | None = None,
output_type: str = "np",
) -> DiffusionOutput
Generate audio from text prompt.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
req | OmniDiffusionRequest | OmniDiffusionRequest containing generation parameters | required |
prompt | str | list[str] | None | Text prompt for audio generation | None |
negative_prompt | str | list[str] | None | Negative prompt for CFG | None |
audio_end_in_s | float | None | Audio end time in seconds (max ~47s for stable-audio-open-1.0) | None |
audio_start_in_s | float | Audio start time in seconds | 0.0 |
guidance_scale | float | CFG scale | 7.0 |
num_waveforms_per_prompt | int | Number of audio outputs per prompt | 1 |
generator | Generator | list[Generator] | None | Random generator for reproducibility | None |
latents | Tensor | None | Pre-generated latents | None |
prompt_embeds | Tensor | None | Pre-computed prompt embeddings | None |
negative_prompt_embeds | Tensor | None | Pre-computed negative prompt embeddings | None |
output_type | str | Output format ("np", "pt", or "latent") | 'np' |
Returns:
| Type | Description |
|---|---|
DiffusionOutput | DiffusionOutput containing generated audio |
load_weights ¶
Load weights using AutoWeightsLoader for vLLM integration.
get_stable_audio_post_process_func ¶
get_stable_audio_post_process_func(
od_config: OmniDiffusionConfig,
)
Create post-processing function for Stable Audio output.
Converts raw audio tensor to numpy array for saving.