Skip to content

Stable Audio Open Usage Guide

This guide provides instructions for running Stable Audio Open text-to-audio generation using vLLM-Omni.

Supported Models

  • stabilityai/stable-audio-open-1.0: Text-to-Audio generation (44.1kHz, up to ~47s audio)

Installing vLLM-Omni

uv venv
source .venv/bin/activate
uv pip install vllm==0.14.1
uv pip install git+https://github.com/vllm-project/vllm-omni.git

For audio file saving, install one of these packages:

uv pip install soundfile  # Recommended
# or
uv pip install scipy

The CLI examples below are from the vLLM-Omni repo. If you want to run them directly, clone that repo and run the scripts from its examples/offline_inference directory.

Text-to-Audio Generation

Basic Usage

import torch
import soundfile as sf
from vllm_omni.entrypoints.omni import Omni

omni = Omni(model="stabilityai/stable-audio-open-1.0")

generator = torch.Generator(device="cuda").manual_seed(42)

audio = omni.generate(
    "The sound of a dog barking",
    negative_prompt="Low quality.",
    generator=generator,
    guidance_scale=7.0,
    num_inference_steps=100,
    extra={
        "audio_start_in_s": 0.0,
        "audio_end_in_s": 10.0,
    },
)

# Save audio output
audio_data = audio[0].cpu().float().numpy().T  # [samples, channels]
sf.write("output.wav", audio_data, 44100)

CLI Usage

python examples/offline_inference/text_to_audio/text_to_audio.py \
  --model stabilityai/stable-audio-open-1.0 \
  --prompt "The sound of a dog barking" \
  --audio-length 10.0 \
  --num-inference-steps 100 \
  --guidance-scale 7.0 \
  --output dog_barking.wav

More Examples

# Generate a piano melody
python examples/offline_inference/text_to_audio/text_to_audio.py \
  --prompt "A piano playing a gentle melody" \
  --audio-length 15.0 \
  --output piano_melody.wav

# Generate ambient sounds with negative prompt
python examples/offline_inference/text_to_audio/text_to_audio.py \
  --prompt "Thunder and rain sounds" \
  --negative-prompt "Low quality, distorted" \
  --audio-length 20.0 \
  --output thunder_rain.wav

# Generate multiple waveforms
python examples/offline_inference/text_to_audio/text_to_audio.py \
  --prompt "A bird singing in the forest" \
  --num-waveforms 3 \
  --output bird_singing.wav

Key Parameters

Parameter Default Description
audio_start_in_s 0.0 Audio start time in seconds
audio_end_in_s 10.0 Audio end time in seconds
audio_length 10.0 Audio duration (CLI convenience, sets end time)
num_inference_steps 100 Number of denoising steps
guidance_scale 7.0 Classifier-free guidance scale
negative_prompt "Low quality." Text describing unwanted audio characteristics
num_waveforms 1 Number of audio samples to generate per prompt
sample_rate 44100 Output sample rate in Hz
seed 42 Random seed for reproducibility

Notes

  • Maximum audio length: 47 seconds for stable-audio-open-1.0.
  • Output format: Stereo audio at 44.1kHz sample rate.
  • Inference steps: Higher num_inference_steps produces better quality but takes longer. The diffusers default is 200; vLLM-Omni example uses 100 for faster generation.
  • Negative prompts: Use to guide the model away from undesirable characteristics (e.g., "Low quality, distorted").
  • Model size: Approximately 1.2 billion parameters.

Limitations

  • No realistic vocals: The model cannot generate realistic singing or speech.
  • English only: Trained on English descriptions; performance degrades with other languages.
  • Sound effects over music: Better at generating sound effects than complex music.
  • Prompt engineering: May require experimentation with prompts for optimal results.

License

Stable Audio Open is released under the Stability AI Community License. Commercial use requires a separate license from Stability AI.

Additional Resources