Skip to content

Stable Audio Online Serving

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/stable_audio.

Generate audio from text prompts using Stable Audio models via an OpenAI-compatible API endpoint.

Features

  • OpenAI-compatible API: Use /v1/audio/generate endpoint
  • Flexible control: Adjust audio length, guidance scale, inference steps
  • Quality control: Use negative prompts to avoid unwanted characteristics
  • Reproducible: Set random seed for deterministic generation

Quick Start

1. Start the Server

vllm-omni serve stabilityai/stable-audio-open-1.0 \
    --host 0.0.0.0 \
    --port 8091 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --enforce-eager \
    --omni

2. Generate Audio

Using curl

curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "The sound of a cat purring",
        "audio_length": 10.0
    }' --output cat.wav

Using Python Client

python stable_audio_client.py \
    --text "The sound of a cat purring" \
    --audio_length 10.0 \
    --output cat.wav

Using Bash Script

bash curl_examples.sh

API Reference

Endpoint

POST /v1/audio/generate

Request Body

{
    "input": "Text description of the audio",
    "audio_length": 10.0,
    "audio_start": 0.0,
    "negative_prompt": "Low quality",
    "guidance_scale": 7.0,
    "num_inference_steps": 100,
    "seed": 42,
    "response_format": "wav"
}

Parameters

Parameter Type Default Description
input string required Text prompt describing the audio to generate
audio_length float ~47s Audio duration in seconds (max ~47s for stable-audio-open-1.0)
audio_start float 0.0 Audio start time in seconds
negative_prompt string null Text describing what to avoid in generation
guidance_scale float 7.0 Classifier-free guidance scale (higher = more adherence to prompt)
num_inference_steps int 50 Number of denoising steps (higher = better quality, slower)
seed int null Random seed for reproducibility
response_format string "wav" Output format: wav, mp3, flac, pcm

Response

Returns audio data in the requested format (default: WAV).

Usage Examples

Basic Generation

curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "The sound of ocean waves"
    }' --output ocean.wav

Custom Duration

curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "A dog barking",
        "audio_length": 5.0
    }' --output dog_5s.wav

High Quality with Negative Prompt

curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "A piano playing a gentle melody",
        "audio_length": 10.0,
        "negative_prompt": "Low quality, distorted, noisy",
        "guidance_scale": 8.0,
        "num_inference_steps": 150
    }' --output piano_hq.wav

Reproducible Generation

curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Thunder and rain sounds",
        "audio_length": 15.0,
        "seed": 42
    }' --output thunder.wav

Quick Generation (Fewer Steps)

For faster generation with slightly lower quality:

curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Birds chirping in a forest",
        "audio_length": 8.0,
        "num_inference_steps": 50
    }' --output birds_quick.wav

Python Client Examples

Simple Generation

python stable_audio_client.py \
    --text "The sound of a cat purring"

Custom Parameters

python stable_audio_client.py \
    --text "Thunder and rain" \
    --audio_length 15.0 \
    --negative_prompt "Low quality" \
    --guidance_scale 7.0 \
    --num_inference_steps 100 \
    --seed 42 \
    --output thunder.wav

Different Output Format

python stable_audio_client.py \
    --text "Guitar playing" \
    --response_format mp3 \
    --output guitar.mp3

Tips

  1. Audio Length: Keep under 47 seconds for stable-audio-open-1.0
  2. Quality vs Speed:
  3. 50 steps: Fast, decent quality
  4. 100 steps: Good balance (default)
  5. 150+ steps: High quality, slower
  6. Guidance Scale:
  7. Lower (3-5): More creative/varied
  8. Default (7): Good balance
  9. Higher (10+): More literal to prompt
  10. Negative Prompts: Use to avoid "Low quality", "distorted", "noisy", etc.
  11. Seeds: Use same seed for reproducible results

Performance

Inference Steps Quality Speed Use Case
50 Good Fast Quick previews
100 (default) Very Good Medium Production
150+ Excellent Slow Final/critical audio

Troubleshooting

Server not responding

  • Check if server is running: curl http://localhost:8091/health
  • Check server logs for errors

Audio quality issues

  • Increase num_inference_steps (e.g., 150)
  • Add negative prompts: "Low quality, distorted, noisy"
  • Increase guidance_scale for more prompt adherence

Generation timeout

  • Reduce num_inference_steps
  • Reduce audio_length
  • Check GPU memory with nvidia-smi

Wrong audio length

  • Ensure audio_length is within model limits (~47s max)
  • Adjust audio_start if trimming is needed

See Also

Example materials

curl_examples.sh
#!/bin/bash
# Examples for using Stable Audio with curl via /v1/audio/generate endpoint

# Example 1: Simple request with default parameters
echo "Example 1: Simple request with default parameters"
curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "The sound audience clapping and cheering in a stadium"
    }' --output stadium.wav

# Example 2: Request with custom audio_length
echo "Example 2: Custom audio length (5 seconds)"
curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "The sound of a dog barking",
        "audio_length": 5.0
    }' --output dog_5s.wav

# Example 3: Request with negative prompt for quality control
echo "Example 3: With negative prompt"
curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "A piano playing a gentle melody",
        "audio_length": 10.0,
        "negative_prompt": "Low quality, distorted, noisy"
    }' --output piano.wav

# Example 4: Full control with all parameters
echo "Example 4: Full control (custom length, guidance, steps, seed)"
curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Thunder and rain sounds",
        "audio_length": 15.0,
        "negative_prompt": "Low quality",
        "guidance_scale": 7.0,
        "num_inference_steps": 100,
        "seed": 42
    }' --output thunder_rain.wav

# Example 5: Quick generation with fewer steps (faster but lower quality)
echo "Example 5: Quick generation (fewer steps)"
curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Ocean waves crashing on a beach",
        "audio_length": 8.0,
        "num_inference_steps": 50
    }' --output ocean.wav

echo "All examples completed!"
stable_audio_client.py
#!/usr/bin/env python3
"""
OpenAI-compatible client for Stable Audio via /v1/audio/generate endpoint.

This script demonstrates how to use the OpenAI-compatible speech API
to generate audio from text using Stable Audio models.

Examples:
    # Simple generation
    python stable_audio_client.py --text "The sound of a cat purring"

    # With custom duration
    python stable_audio_client.py --text "A dog barking" --audio_length 5.0

    # With all parameters
    python stable_audio_client.py --text "Thunder and rain" \
        --audio_length 15.0 \
        --negative_prompt "Low quality" \
        --guidance_scale 7.0 \
        --num_inference_steps 100 \
        --seed 42 \
        --output thunder.wav
"""

import argparse
import sys

import requests


def parse_args():
    parser = argparse.ArgumentParser(description="Generate audio with Stable Audio via OpenAI-compatible API")
    parser.add_argument(
        "--api_url",
        default="http://localhost:8091/v1/audio/generate",
        help="API endpoint URL",
    )
    parser.add_argument(
        "--text",
        default="The sound of a cat purring",
        help="Text prompt for audio generation",
    )
    parser.add_argument(
        "--audio_length",
        type=float,
        default=10.0,
        help="Audio length in seconds (max ~47s for stable-audio-open-1.0)",
    )
    parser.add_argument(
        "--audio_start",
        type=float,
        default=0.0,
        help="Audio start time in seconds",
    )
    parser.add_argument(
        "--negative_prompt",
        default="Low quality",
        help="Negative prompt for classifier-free guidance",
    )
    parser.add_argument(
        "--guidance_scale",
        type=float,
        default=7.0,
        help="Guidance scale for diffusion (higher = more adherence to prompt)",
    )
    parser.add_argument(
        "--num_inference_steps",
        type=int,
        default=100,
        help="Number of inference steps (higher = better quality, slower)",
    )
    parser.add_argument(
        "--seed",
        type=int,
        default=None,
        help="Random seed for reproducibility",
    )
    parser.add_argument(
        "--output",
        default="stable_audio_output.wav",
        help="Output file path",
    )
    parser.add_argument(
        "--response_format",
        default="wav",
        choices=["wav", "mp3", "flac", "pcm"],
        help="Audio output format",
    )
    return parser.parse_args()


def generate_audio(args):
    """Generate audio using the API."""

    # Build request payload
    payload = {
        "input": args.text,
        "audio_length": args.audio_length,
        "audio_start": args.audio_start,
        "response_format": args.response_format,
    }

    # Add optional parameters
    if args.negative_prompt:
        payload["negative_prompt"] = args.negative_prompt
    if args.guidance_scale:
        payload["guidance_scale"] = args.guidance_scale
    if args.num_inference_steps:
        payload["num_inference_steps"] = args.num_inference_steps
    if args.seed is not None:
        payload["seed"] = args.seed

    print(f"\n{'=' * 60}")
    print("Stable Audio - Text-to-Audio Generation")
    print(f"{'=' * 60}")
    print(f"API URL: {args.api_url}")
    print(f"Prompt: {args.text}")
    print(f"Audio length: {args.audio_length}s")
    print(f"Negative prompt: {args.negative_prompt}")
    print(f"Guidance scale: {args.guidance_scale}")
    print(f"Inference steps: {args.num_inference_steps}")
    if args.seed is not None:
        print(f"Seed: {args.seed}")
    print(f"Output: {args.output}")
    print(f"{'=' * 60}\n")

    try:
        # Make the API request
        print("Generating audio...")
        response = requests.post(
            args.api_url,
            json=payload,
            headers={"Content-Type": "application/json"},
            timeout=300,  # 5 minute timeout for long generations
        )

        # Check for errors
        if response.status_code != 200:
            print(f"Error: API returned status code {response.status_code}")
            print(f"Response: {response.text}")
            return False

        # Save the audio
        with open(args.output, "wb") as f:
            f.write(response.content)

        print(f"✓ Audio saved to {args.output}")
        print(f"  File size: {len(response.content) / 1024:.1f} KB")
        return True

    except requests.exceptions.Timeout:
        print("Error: Request timed out. Try reducing inference steps or audio length.")
        return False
    except requests.exceptions.ConnectionError:
        print(f"Error: Could not connect to {args.api_url}")
        print("Make sure the server is running.")
        return False
    except Exception as e:
        print(f"Error: {e}")
        return False


def main():
    args = parse_args()
    success = generate_audio(args)
    sys.exit(0 if success else 1)


if __name__ == "__main__":
    main()