Skip to content

vllm_omni.diffusion.models.stable_audio

Stable Audio Open model support for vLLM-Omni.

Modules:

Name Description
pipeline_stable_audio

Stable Audio Open Pipeline for vLLM-Omni.

stable_audio_transformer

Stable Audio DiT Model for vLLM-Omni.

StableAudioDiTModel

Bases: Module

Optimized Stable Audio DiT model using vLLM layers.

This is an optimized version of the diffusers StableAudioDiTModel that uses vLLM's efficient linear layers and attention implementations.

Architecture: - Input: [B, in_channels, L] (e.g., [B, 64, L]) - preprocess_conv: residual conv layer (keeps 64 channels) - proj_in: projects 64 -> 1536 (inner_dim) - Global+time embeddings prepended to sequence - Transformer blocks work on 1536-dim - proj_out: projects 1536 -> 64 (out_channels) - postprocess_conv: residual conv layer (keeps 64 channels) - Output: [B, out_channels, L]

attention_head_dim instance-attribute

attention_head_dim = attention_head_dim

config instance-attribute

config = type(
    "Config",
    (),
    {
        "sample_size": sample_size,
        "in_channels": in_channels,
        "out_channels": out_channels,
        "num_layers": num_layers,
        "attention_head_dim": attention_head_dim,
        "num_attention_heads": num_attention_heads,
        "num_key_value_attention_heads": num_key_value_attention_heads,
        "cross_attention_dim": cross_attention_dim,
        "time_proj_dim": time_proj_dim,
        "global_states_input_dim": global_states_input_dim,
        "cross_attention_input_dim": cross_attention_input_dim,
    },
)()

cross_attention_proj instance-attribute

cross_attention_proj = Sequential(
    Linear(
        cross_attention_input_dim,
        cross_attention_dim,
        bias=False,
    ),
    SiLU(),
    Linear(
        cross_attention_dim, cross_attention_dim, bias=False
    ),
)

dtype property

dtype: dtype

Return the dtype of the model parameters.

global_proj instance-attribute

global_proj = Sequential(
    Linear(global_states_input_dim, inner_dim, bias=False),
    SiLU(),
    Linear(inner_dim, inner_dim, bias=False),
)

in_channels instance-attribute

in_channels = in_channels

inner_dim instance-attribute

inner_dim = num_attention_heads * attention_head_dim

num_attention_heads instance-attribute

num_attention_heads = num_attention_heads

num_layers instance-attribute

num_layers = num_layers

out_channels instance-attribute

out_channels = out_channels

postprocess_conv instance-attribute

postprocess_conv = Conv1d(
    out_channels, out_channels, 1, bias=False
)

preprocess_conv instance-attribute

preprocess_conv = Conv1d(
    in_channels, in_channels, 1, bias=False
)

proj_in instance-attribute

proj_in = Linear(in_channels, inner_dim, bias=False)

proj_out instance-attribute

proj_out = Linear(inner_dim, out_channels, bias=False)

sample_size instance-attribute

sample_size = sample_size

time_proj instance-attribute

time_proj = StableAudioGaussianFourierProjection(
    embedding_size=time_proj_dim // 2
)

timestep_proj instance-attribute

timestep_proj = Sequential(
    Linear(time_proj_dim, inner_dim, bias=True),
    SiLU(),
    Linear(inner_dim, inner_dim, bias=True),
)

transformer_blocks instance-attribute

transformer_blocks = ModuleList(
    [
        (
            StableAudioDiTBlock(
                dim=inner_dim,
                num_attention_heads=num_attention_heads,
                num_key_value_attention_heads=num_key_value_attention_heads,
                attention_head_dim=attention_head_dim,
                cross_attention_dim=cross_attention_dim,
            )
        )
        for _ in (range(num_layers))
    ]
)

forward

forward(
    hidden_states: Tensor,
    timestep: Tensor,
    encoder_hidden_states: Tensor,
    global_hidden_states: Tensor | None = None,
    rotary_embedding: tuple[Tensor, Tensor] | None = None,
    return_dict: bool = True,
    attention_mask: Tensor | None = None,
    encoder_attention_mask: Tensor | None = None,
) -> Tensor | Transformer2DModelOutput

Forward pass of the Stable Audio DiT model.

Parameters:

Name Type Description Default
hidden_states Tensor

Input latent tensor [B, C, L] (C=in_channels=64)

required
timestep Tensor

Timestep tensor [B] or [1]

required
encoder_hidden_states Tensor

Text/condition embeddings [B, S, D]

required
global_hidden_states Tensor | None

Global conditioning (duration) [B, 1, D]

None
rotary_embedding tuple[Tensor, Tensor] | None

Precomputed rotary embeddings (cos, sin)

None
return_dict bool

Whether to return a dataclass or tuple

True
attention_mask Tensor | None

Attention mask for self-attention

None
encoder_attention_mask Tensor | None

Attention mask for cross-attention

None

Returns:

Type Description
Tensor | Transformer2DModelOutput

Denoised latent tensor

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Load weights from a pretrained model.

Maps diffusers weight names to our module structure.

Returns:

Type Description
set[str]

Set of parameter names that were successfully loaded.

StableAudioPipeline

Bases: Module, SupportAudioOutput, DiffusionPipelineProfilerMixin

Pipeline for text-to-audio generation using Stable Audio Open.

This pipeline generates audio from text prompts using the Stable Audio Open model from Stability AI, integrated with vLLM-Omni's diffusion framework.

Parameters:

Name Type Description Default
od_config OmniDiffusionConfig

OmniDiffusion configuration object

required
prefix str

Weight prefix for loading (default: "")

''

audio_sample_rate class-attribute

audio_sample_rate: int = 44100

current_timestep property

current_timestep

device instance-attribute

device = get_local_device()

do_classifier_free_guidance property

do_classifier_free_guidance

guidance_scale property

guidance_scale

num_timesteps property

num_timesteps

od_config instance-attribute

od_config = od_config

projection_model instance-attribute

projection_model = to(device)

rotary_embed_dim instance-attribute

rotary_embed_dim = attention_head_dim // 2

scheduler instance-attribute

scheduler = StableAudioSchedulerWrapper(
    from_pretrained(
        model,
        subfolder="scheduler",
        local_files_only=local_files_only,
    )
)

support_audio_output class-attribute

support_audio_output: bool = True

text_encoder instance-attribute

text_encoder = to(device)

tokenizer instance-attribute

tokenizer = from_pretrained(
    model,
    subfolder="tokenizer",
    local_files_only=local_files_only,
)

transformer instance-attribute

transformer = StableAudioDiTModel(
    od_config=od_config, **transformer_kwargs
)

vae instance-attribute

vae = to(device)

weights_sources instance-attribute

weights_sources = [
    ComponentSource(
        model_or_path=model,
        subfolder="transformer",
        revision=None,
        prefix="transformer.",
        fall_back_to_pt=True,
    )
]

check_inputs

check_inputs(
    prompt: str | list[str] | None,
    audio_start_in_s: float,
    audio_end_in_s: float,
    negative_prompt: str | list[str] | None = None,
    prompt_embeds: Tensor | None = None,
    negative_prompt_embeds: Tensor | None = None,
)

Validate input parameters.

encode_duration

encode_duration(
    audio_start_in_s: float,
    audio_end_in_s: float,
    device: device,
    do_classifier_free_guidance: bool,
    batch_size: int,
) -> tuple[Tensor, Tensor]

Encode audio duration to conditioning tensors.

encode_prompt

encode_prompt(
    prompt: str | list[str],
    device: device,
    do_classifier_free_guidance: bool,
    negative_prompt: str | list[str] | None = None,
    prompt_embeds: Tensor | None = None,
    negative_prompt_embeds: Tensor | None = None,
    attention_mask: Tensor | None = None,
    negative_attention_mask: Tensor | None = None,
) -> Tensor

Encode text prompt to embeddings.

forward

forward(
    req: OmniDiffusionRequest,
    prompt: str | list[str] | None = None,
    negative_prompt: str | list[str] | None = None,
    audio_end_in_s: float | None = None,
    audio_start_in_s: float = 0.0,
    guidance_scale: float = 7.0,
    num_waveforms_per_prompt: int = 1,
    generator: Generator | list[Generator] | None = None,
    latents: Tensor | None = None,
    prompt_embeds: Tensor | None = None,
    negative_prompt_embeds: Tensor | None = None,
    output_type: str = "np",
) -> DiffusionOutput

Generate audio from text prompt.

Parameters:

Name Type Description Default
req OmniDiffusionRequest

OmniDiffusionRequest containing generation parameters

required
prompt str | list[str] | None

Text prompt for audio generation

None
negative_prompt str | list[str] | None

Negative prompt for CFG

None
audio_end_in_s float | None

Audio end time in seconds (max ~47s for stable-audio-open-1.0)

None
audio_start_in_s float

Audio start time in seconds

0.0
guidance_scale float

CFG scale

7.0
num_waveforms_per_prompt int

Number of audio outputs per prompt

1
generator Generator | list[Generator] | None

Random generator for reproducibility

None
latents Tensor | None

Pre-generated latents

None
prompt_embeds Tensor | None

Pre-computed prompt embeddings

None
negative_prompt_embeds Tensor | None

Pre-computed negative prompt embeddings

None
output_type str

Output format ("np", "pt", or "latent")

'np'

Returns:

Type Description
DiffusionOutput

DiffusionOutput containing generated audio

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Load weights using AutoWeightsLoader for vLLM integration.

prepare_latents

prepare_latents(
    batch_size: int,
    num_channels_vae: int,
    sample_size: int,
    dtype: dtype,
    device: device,
    generator: Generator | list[Generator] | None,
    latents: Tensor | None = None,
) -> Tensor

Prepare initial latent noise.

get_stable_audio_post_process_func

get_stable_audio_post_process_func(
    od_config: OmniDiffusionConfig,
)

Create post-processing function for Stable Audio output.

Converts raw audio tensor to numpy array for saving.