Skip to content

vllm_omni.diffusion.models.stable_audio.pipeline_stable_audio

Stable Audio Open Pipeline for vLLM-Omni.

This module provides text-to-audio generation using the Stable Audio Open model from Stability AI, integrated with the vLLM-Omni diffusion framework.

logger module-attribute

logger = init_logger(__name__)

StableAudioPipeline

Bases: Module, SupportAudioOutput, DiffusionPipelineProfilerMixin

Pipeline for text-to-audio generation using Stable Audio Open.

This pipeline generates audio from text prompts using the Stable Audio Open model from Stability AI, integrated with vLLM-Omni's diffusion framework.

Parameters:

Name Type Description Default
od_config OmniDiffusionConfig

OmniDiffusion configuration object

required
prefix str

Weight prefix for loading (default: "")

''

audio_sample_rate class-attribute

audio_sample_rate: int = 44100

current_timestep property

current_timestep

device instance-attribute

device = get_local_device()

do_classifier_free_guidance property

do_classifier_free_guidance

guidance_scale property

guidance_scale

num_timesteps property

num_timesteps

od_config instance-attribute

od_config = od_config

projection_model instance-attribute

projection_model = to(device)

rotary_embed_dim instance-attribute

rotary_embed_dim = attention_head_dim // 2

scheduler instance-attribute

scheduler = StableAudioSchedulerWrapper(
    from_pretrained(
        model,
        subfolder="scheduler",
        local_files_only=local_files_only,
    )
)

support_audio_output class-attribute

support_audio_output: bool = True

text_encoder instance-attribute

text_encoder = to(device)

tokenizer instance-attribute

tokenizer = from_pretrained(
    model,
    subfolder="tokenizer",
    local_files_only=local_files_only,
)

transformer instance-attribute

transformer = StableAudioDiTModel(
    od_config=od_config, **transformer_kwargs
)

vae instance-attribute

vae = to(device)

weights_sources instance-attribute

weights_sources = [
    ComponentSource(
        model_or_path=model,
        subfolder="transformer",
        revision=None,
        prefix="transformer.",
        fall_back_to_pt=True,
    )
]

check_inputs

check_inputs(
    prompt: str | list[str] | None,
    audio_start_in_s: float,
    audio_end_in_s: float,
    negative_prompt: str | list[str] | None = None,
    prompt_embeds: Tensor | None = None,
    negative_prompt_embeds: Tensor | None = None,
)

Validate input parameters.

encode_duration

encode_duration(
    audio_start_in_s: float,
    audio_end_in_s: float,
    device: device,
    do_classifier_free_guidance: bool,
    batch_size: int,
) -> tuple[Tensor, Tensor]

Encode audio duration to conditioning tensors.

encode_prompt

encode_prompt(
    prompt: str | list[str],
    device: device,
    do_classifier_free_guidance: bool,
    negative_prompt: str | list[str] | None = None,
    prompt_embeds: Tensor | None = None,
    negative_prompt_embeds: Tensor | None = None,
    attention_mask: Tensor | None = None,
    negative_attention_mask: Tensor | None = None,
) -> Tensor

Encode text prompt to embeddings.

forward

forward(
    req: OmniDiffusionRequest,
    prompt: str | list[str] | None = None,
    negative_prompt: str | list[str] | None = None,
    audio_end_in_s: float | None = None,
    audio_start_in_s: float = 0.0,
    guidance_scale: float = 7.0,
    num_waveforms_per_prompt: int = 1,
    generator: Generator | list[Generator] | None = None,
    latents: Tensor | None = None,
    prompt_embeds: Tensor | None = None,
    negative_prompt_embeds: Tensor | None = None,
    output_type: str = "np",
) -> DiffusionOutput

Generate audio from text prompt.

Parameters:

Name Type Description Default
req OmniDiffusionRequest

OmniDiffusionRequest containing generation parameters

required
prompt str | list[str] | None

Text prompt for audio generation

None
negative_prompt str | list[str] | None

Negative prompt for CFG

None
audio_end_in_s float | None

Audio end time in seconds (max ~47s for stable-audio-open-1.0)

None
audio_start_in_s float

Audio start time in seconds

0.0
guidance_scale float

CFG scale

7.0
num_waveforms_per_prompt int

Number of audio outputs per prompt

1
generator Generator | list[Generator] | None

Random generator for reproducibility

None
latents Tensor | None

Pre-generated latents

None
prompt_embeds Tensor | None

Pre-computed prompt embeddings

None
negative_prompt_embeds Tensor | None

Pre-computed negative prompt embeddings

None
output_type str

Output format ("np", "pt", or "latent")

'np'

Returns:

Type Description
DiffusionOutput

DiffusionOutput containing generated audio

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Load weights using AutoWeightsLoader for vLLM integration.

prepare_latents

prepare_latents(
    batch_size: int,
    num_channels_vae: int,
    sample_size: int,
    dtype: dtype,
    device: device,
    generator: Generator | list[Generator] | None,
    latents: Tensor | None = None,
) -> Tensor

Prepare initial latent noise.

get_stable_audio_post_process_func

get_stable_audio_post_process_func(
    od_config: OmniDiffusionConfig,
)

Create post-processing function for Stable Audio output.

Converts raw audio tensor to numpy array for saving.