vllm_omni.diffusion.models.stable_audio.pipeline_stable_audio ¶
Stable Audio Open Pipeline for vLLM-Omni.
This module provides text-to-audio generation using the Stable Audio Open model from Stability AI, integrated with the vLLM-Omni diffusion framework.
StableAudioPipeline ¶
Bases: Module, SupportAudioOutput, DiffusionPipelineProfilerMixin
Pipeline for text-to-audio generation using Stable Audio Open.
This pipeline generates audio from text prompts using the Stable Audio Open model from Stability AI, integrated with vLLM-Omni's diffusion framework.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
od_config | OmniDiffusionConfig | OmniDiffusion configuration object | required |
prefix | str | Weight prefix for loading (default: "") | '' |
scheduler instance-attribute ¶
scheduler = StableAudioSchedulerWrapper(
from_pretrained(
model,
subfolder="scheduler",
local_files_only=local_files_only,
)
)
tokenizer instance-attribute ¶
transformer instance-attribute ¶
transformer = StableAudioDiTModel(
od_config=od_config, **transformer_kwargs
)
weights_sources instance-attribute ¶
weights_sources = [
ComponentSource(
model_or_path=model,
subfolder="transformer",
revision=None,
prefix="transformer.",
fall_back_to_pt=True,
)
]
check_inputs ¶
check_inputs(
prompt: str | list[str] | None,
audio_start_in_s: float,
audio_end_in_s: float,
negative_prompt: str | list[str] | None = None,
prompt_embeds: Tensor | None = None,
negative_prompt_embeds: Tensor | None = None,
)
Validate input parameters.
encode_duration ¶
encode_duration(
audio_start_in_s: float,
audio_end_in_s: float,
device: device,
do_classifier_free_guidance: bool,
batch_size: int,
) -> tuple[Tensor, Tensor]
Encode audio duration to conditioning tensors.
encode_prompt ¶
encode_prompt(
prompt: str | list[str],
device: device,
do_classifier_free_guidance: bool,
negative_prompt: str | list[str] | None = None,
prompt_embeds: Tensor | None = None,
negative_prompt_embeds: Tensor | None = None,
attention_mask: Tensor | None = None,
negative_attention_mask: Tensor | None = None,
) -> Tensor
Encode text prompt to embeddings.
forward ¶
forward(
req: OmniDiffusionRequest,
prompt: str | list[str] | None = None,
negative_prompt: str | list[str] | None = None,
audio_end_in_s: float | None = None,
audio_start_in_s: float = 0.0,
guidance_scale: float = 7.0,
num_waveforms_per_prompt: int = 1,
generator: Generator | list[Generator] | None = None,
latents: Tensor | None = None,
prompt_embeds: Tensor | None = None,
negative_prompt_embeds: Tensor | None = None,
output_type: str = "np",
) -> DiffusionOutput
Generate audio from text prompt.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
req | OmniDiffusionRequest | OmniDiffusionRequest containing generation parameters | required |
prompt | str | list[str] | None | Text prompt for audio generation | None |
negative_prompt | str | list[str] | None | Negative prompt for CFG | None |
audio_end_in_s | float | None | Audio end time in seconds (max ~47s for stable-audio-open-1.0) | None |
audio_start_in_s | float | Audio start time in seconds | 0.0 |
guidance_scale | float | CFG scale | 7.0 |
num_waveforms_per_prompt | int | Number of audio outputs per prompt | 1 |
generator | Generator | list[Generator] | None | Random generator for reproducibility | None |
latents | Tensor | None | Pre-generated latents | None |
prompt_embeds | Tensor | None | Pre-computed prompt embeddings | None |
negative_prompt_embeds | Tensor | None | Pre-computed negative prompt embeddings | None |
output_type | str | Output format ("np", "pt", or "latent") | 'np' |
Returns:
| Type | Description |
|---|---|
DiffusionOutput | DiffusionOutput containing generated audio |
load_weights ¶
Load weights using AutoWeightsLoader for vLLM integration.
get_stable_audio_post_process_func ¶
get_stable_audio_post_process_func(
od_config: OmniDiffusionConfig,
)
Create post-processing function for Stable Audio output.
Converts raw audio tensor to numpy array for saving.