Stable Audio Online Serving¶
Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/stable_audio.
Generate audio from text prompts using Stable Audio models via an OpenAI-compatible API endpoint.
Features¶
- OpenAI-compatible API: Use
/v1/audio/generateendpoint - Flexible control: Adjust audio length, guidance scale, inference steps
- Quality control: Use negative prompts to avoid unwanted characteristics
- Reproducible: Set random seed for deterministic generation
Quick Start¶
1. Start the Server¶
vllm-omni serve stabilityai/stable-audio-open-1.0 \
--host 0.0.0.0 \
--port 8091 \
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--enforce-eager \
--omni
2. Generate Audio¶
Using curl¶
curl -X POST http://localhost:8091/v1/audio/generate \
-H "Content-Type: application/json" \
-d '{
"input": "The sound of a cat purring",
"audio_length": 10.0
}' --output cat.wav
Using Python Client¶
python stable_audio_client.py \
--text "The sound of a cat purring" \
--audio_length 10.0 \
--output cat.wav
Using Bash Script¶
API Reference¶
Endpoint¶
Request Body¶
{
"input": "Text description of the audio",
"audio_length": 10.0,
"audio_start": 0.0,
"negative_prompt": "Low quality",
"guidance_scale": 7.0,
"num_inference_steps": 100,
"seed": 42,
"response_format": "wav"
}
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
input | string | required | Text prompt describing the audio to generate |
audio_length | float | ~47s | Audio duration in seconds (max ~47s for stable-audio-open-1.0) |
audio_start | float | 0.0 | Audio start time in seconds |
negative_prompt | string | null | Text describing what to avoid in generation |
guidance_scale | float | 7.0 | Classifier-free guidance scale (higher = more adherence to prompt) |
num_inference_steps | int | 50 | Number of denoising steps (higher = better quality, slower) |
seed | int | null | Random seed for reproducibility |
response_format | string | "wav" | Output format: wav, mp3, flac, pcm |
Response¶
Returns audio data in the requested format (default: WAV).
Usage Examples¶
Basic Generation¶
curl -X POST http://localhost:8091/v1/audio/generate \
-H "Content-Type: application/json" \
-d '{
"input": "The sound of ocean waves"
}' --output ocean.wav
Custom Duration¶
curl -X POST http://localhost:8091/v1/audio/generate \
-H "Content-Type: application/json" \
-d '{
"input": "A dog barking",
"audio_length": 5.0
}' --output dog_5s.wav
High Quality with Negative Prompt¶
curl -X POST http://localhost:8091/v1/audio/generate \
-H "Content-Type: application/json" \
-d '{
"input": "A piano playing a gentle melody",
"audio_length": 10.0,
"negative_prompt": "Low quality, distorted, noisy",
"guidance_scale": 8.0,
"num_inference_steps": 150
}' --output piano_hq.wav
Reproducible Generation¶
curl -X POST http://localhost:8091/v1/audio/generate \
-H "Content-Type: application/json" \
-d '{
"input": "Thunder and rain sounds",
"audio_length": 15.0,
"seed": 42
}' --output thunder.wav
Quick Generation (Fewer Steps)¶
For faster generation with slightly lower quality:
curl -X POST http://localhost:8091/v1/audio/generate \
-H "Content-Type: application/json" \
-d '{
"input": "Birds chirping in a forest",
"audio_length": 8.0,
"num_inference_steps": 50
}' --output birds_quick.wav
Python Client Examples¶
Simple Generation¶
Custom Parameters¶
python stable_audio_client.py \
--text "Thunder and rain" \
--audio_length 15.0 \
--negative_prompt "Low quality" \
--guidance_scale 7.0 \
--num_inference_steps 100 \
--seed 42 \
--output thunder.wav
Different Output Format¶
python stable_audio_client.py \
--text "Guitar playing" \
--response_format mp3 \
--output guitar.mp3
Tips¶
- Audio Length: Keep under 47 seconds for
stable-audio-open-1.0 - Quality vs Speed:
- 50 steps: Fast, decent quality
- 100 steps: Good balance (default)
- 150+ steps: High quality, slower
- Guidance Scale:
- Lower (3-5): More creative/varied
- Default (7): Good balance
- Higher (10+): More literal to prompt
- Negative Prompts: Use to avoid "Low quality", "distorted", "noisy", etc.
- Seeds: Use same seed for reproducible results
Performance¶
| Inference Steps | Quality | Speed | Use Case |
|---|---|---|---|
| 50 | Good | Fast | Quick previews |
| 100 (default) | Very Good | Medium | Production |
| 150+ | Excellent | Slow | Final/critical audio |
Troubleshooting¶
Server not responding¶
- Check if server is running:
curl http://localhost:8091/health - Check server logs for errors
Audio quality issues¶
- Increase
num_inference_steps(e.g., 150) - Add negative prompts:
"Low quality, distorted, noisy" - Increase
guidance_scalefor more prompt adherence
Generation timeout¶
- Reduce
num_inference_steps - Reduce
audio_length - Check GPU memory with
nvidia-smi
Wrong audio length¶
- Ensure
audio_lengthis within model limits (~47s max) - Adjust
audio_startif trimming is needed
See Also¶
Example materials¶
curl_examples.sh
#!/bin/bash
# Examples for using Stable Audio with curl via /v1/audio/generate endpoint
# Example 1: Simple request with default parameters
echo "Example 1: Simple request with default parameters"
curl -X POST http://localhost:8091/v1/audio/generate \
-H "Content-Type: application/json" \
-d '{
"input": "The sound audience clapping and cheering in a stadium"
}' --output stadium.wav
# Example 2: Request with custom audio_length
echo "Example 2: Custom audio length (5 seconds)"
curl -X POST http://localhost:8091/v1/audio/generate \
-H "Content-Type: application/json" \
-d '{
"input": "The sound of a dog barking",
"audio_length": 5.0
}' --output dog_5s.wav
# Example 3: Request with negative prompt for quality control
echo "Example 3: With negative prompt"
curl -X POST http://localhost:8091/v1/audio/generate \
-H "Content-Type: application/json" \
-d '{
"input": "A piano playing a gentle melody",
"audio_length": 10.0,
"negative_prompt": "Low quality, distorted, noisy"
}' --output piano.wav
# Example 4: Full control with all parameters
echo "Example 4: Full control (custom length, guidance, steps, seed)"
curl -X POST http://localhost:8091/v1/audio/generate \
-H "Content-Type: application/json" \
-d '{
"input": "Thunder and rain sounds",
"audio_length": 15.0,
"negative_prompt": "Low quality",
"guidance_scale": 7.0,
"num_inference_steps": 100,
"seed": 42
}' --output thunder_rain.wav
# Example 5: Quick generation with fewer steps (faster but lower quality)
echo "Example 5: Quick generation (fewer steps)"
curl -X POST http://localhost:8091/v1/audio/generate \
-H "Content-Type: application/json" \
-d '{
"input": "Ocean waves crashing on a beach",
"audio_length": 8.0,
"num_inference_steps": 50
}' --output ocean.wav
echo "All examples completed!"
stable_audio_client.py
#!/usr/bin/env python3
"""
OpenAI-compatible client for Stable Audio via /v1/audio/generate endpoint.
This script demonstrates how to use the OpenAI-compatible speech API
to generate audio from text using Stable Audio models.
Examples:
# Simple generation
python stable_audio_client.py --text "The sound of a cat purring"
# With custom duration
python stable_audio_client.py --text "A dog barking" --audio_length 5.0
# With all parameters
python stable_audio_client.py --text "Thunder and rain" \
--audio_length 15.0 \
--negative_prompt "Low quality" \
--guidance_scale 7.0 \
--num_inference_steps 100 \
--seed 42 \
--output thunder.wav
"""
import argparse
import sys
import requests
def parse_args():
parser = argparse.ArgumentParser(description="Generate audio with Stable Audio via OpenAI-compatible API")
parser.add_argument(
"--api_url",
default="http://localhost:8091/v1/audio/generate",
help="API endpoint URL",
)
parser.add_argument(
"--text",
default="The sound of a cat purring",
help="Text prompt for audio generation",
)
parser.add_argument(
"--audio_length",
type=float,
default=10.0,
help="Audio length in seconds (max ~47s for stable-audio-open-1.0)",
)
parser.add_argument(
"--audio_start",
type=float,
default=0.0,
help="Audio start time in seconds",
)
parser.add_argument(
"--negative_prompt",
default="Low quality",
help="Negative prompt for classifier-free guidance",
)
parser.add_argument(
"--guidance_scale",
type=float,
default=7.0,
help="Guidance scale for diffusion (higher = more adherence to prompt)",
)
parser.add_argument(
"--num_inference_steps",
type=int,
default=100,
help="Number of inference steps (higher = better quality, slower)",
)
parser.add_argument(
"--seed",
type=int,
default=None,
help="Random seed for reproducibility",
)
parser.add_argument(
"--output",
default="stable_audio_output.wav",
help="Output file path",
)
parser.add_argument(
"--response_format",
default="wav",
choices=["wav", "mp3", "flac", "pcm"],
help="Audio output format",
)
return parser.parse_args()
def generate_audio(args):
"""Generate audio using the API."""
# Build request payload
payload = {
"input": args.text,
"audio_length": args.audio_length,
"audio_start": args.audio_start,
"response_format": args.response_format,
}
# Add optional parameters
if args.negative_prompt:
payload["negative_prompt"] = args.negative_prompt
if args.guidance_scale:
payload["guidance_scale"] = args.guidance_scale
if args.num_inference_steps:
payload["num_inference_steps"] = args.num_inference_steps
if args.seed is not None:
payload["seed"] = args.seed
print(f"\n{'=' * 60}")
print("Stable Audio - Text-to-Audio Generation")
print(f"{'=' * 60}")
print(f"API URL: {args.api_url}")
print(f"Prompt: {args.text}")
print(f"Audio length: {args.audio_length}s")
print(f"Negative prompt: {args.negative_prompt}")
print(f"Guidance scale: {args.guidance_scale}")
print(f"Inference steps: {args.num_inference_steps}")
if args.seed is not None:
print(f"Seed: {args.seed}")
print(f"Output: {args.output}")
print(f"{'=' * 60}\n")
try:
# Make the API request
print("Generating audio...")
response = requests.post(
args.api_url,
json=payload,
headers={"Content-Type": "application/json"},
timeout=300, # 5 minute timeout for long generations
)
# Check for errors
if response.status_code != 200:
print(f"Error: API returned status code {response.status_code}")
print(f"Response: {response.text}")
return False
# Save the audio
with open(args.output, "wb") as f:
f.write(response.content)
print(f"✓ Audio saved to {args.output}")
print(f" File size: {len(response.content) / 1024:.1f} KB")
return True
except requests.exceptions.Timeout:
print("Error: Request timed out. Try reducing inference steps or audio length.")
return False
except requests.exceptions.ConnectionError:
print(f"Error: Could not connect to {args.api_url}")
print("Make sure the server is running.")
return False
except Exception as e:
print(f"Error: {e}")
return False
def main():
args = parse_args()
success = generate_audio(args)
sys.exit(0 if success else 1)
if __name__ == "__main__":
main()