Skip to content

Audio Generate API

vLLM-Omni provides an API for text-to-audio generation using diffusion-based models such as Stable Audio.

Unlike the Speech API which targets text-to-speech synthesis, the Audio Generate API is designed for general-purpose audio generation from text descriptions (sound effects, music, ambient soundscapes, etc.).

Each server instance runs a single model (specified at startup via vllm-omni serve <model> --omni).

Quick Start

Start the Server

vllm-omni serve stabilityai/stable-audio-open-1.0 \
    --host 0.0.0.0 \
    --port 8091 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --enforce-eager \
    --omni

Generate Audio

Using curl:

curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "The sound of a cat purring",
        "audio_length": 10.0
    }' --output cat.wav

Using Python:

import httpx

response = httpx.post(
    "http://localhost:8091/v1/audio/generate",
    json={
        "input": "The sound of a cat purring",
        "audio_length": 10.0,
    },
    timeout=300.0,
)

with open("cat.wav", "wb") as f:
    f.write(response.content)

API Reference

Endpoint

POST /v1/audio/generate
Content-Type: application/json

Request Parameters

Parameter Type Default Description
input string required Text prompt describing the audio to generate
model string server's model Model to use (optional, should match server if specified)
response_format string "wav" Audio format: wav, mp3, flac, pcm, aac, opus
speed float 1.0 Playback speed (0.25 - 4.0)

Diffusion Parameters

Parameter Type Default Description
audio_length float null Audio duration in seconds (default value is the max ~47s for stable-audio-open-1.0)
audio_start float 0.0 Audio start time in seconds
negative_prompt string null Text describing what to avoid in generation
guidance_scale float model default Classifier-free guidance scale (higher = more adherence to prompt)
num_inference_steps int model default Number of denoising steps (higher = better quality, slower)
seed int null Random seed for reproducible generation

Response Format

Returns binary audio data with the appropriate Content-Type header:

response_format Content-Type
wav audio/wav
mp3 audio/mpeg
flac audio/flac
pcm audio/pcm
aac audio/aac
opus audio/opus

Examples

Basic Generation

Generate audio with only a text prompt (model defaults for all other parameters):

curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "The sound of ocean waves crashing on a beach"
    }' --output ocean.wav

Custom Duration

Specify an explicit audio length in seconds:

curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "A dog barking",
        "audio_length": 5.0
    }' --output dog_5s.wav

High Quality with Negative Prompt

Use a negative prompt to steer generation away from undesired characteristics, and increase inference steps for higher quality:

curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "A piano playing a gentle melody",
        "audio_length": 10.0,
        "negative_prompt": "Low quality, distorted, noisy",
        "guidance_scale": 8.0,
        "num_inference_steps": 150
    }' --output piano_hq.wav

Reproducible Generation

Set a seed to get deterministic results across runs:

curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Thunder and rain sounds",
        "audio_length": 15.0,
        "seed": 42
    }' --output thunder.wav

Full Control

Combine all parameters for precise control over generation:

curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Thunder and rain sounds",
        "audio_length": 15.0,
        "negative_prompt": "Low quality",
        "guidance_scale": 7.0,
        "num_inference_steps": 100,
        "seed": 42
    }' --output thunder_rain.wav

Quick Generation (Fewer Steps)

For faster generation with slightly lower quality:

curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Birds chirping in a forest",
        "audio_length": 8.0,
        "num_inference_steps": 50
    }' --output birds_quick.wav

Python Client

import httpx

response = httpx.post(
    "http://localhost:8091/v1/audio/generate",
    json={
        "input": "Thunder and rain",
        "audio_length": 15.0,
        "negative_prompt": "Low quality",
        "guidance_scale": 7.0,
        "num_inference_steps": 100,
        "seed": 42,
        "response_format": "wav",
    },
    timeout=300.0,
)

with open("thunder.wav", "wb") as f:
    f.write(response.content)

Parameter Tuning Guide

guidance_scale

Controls how closely the generated audio follows the text prompt.

Range Behaviour
3 - 5 More creative / varied output
7 (default) Balanced adherence
10+ Strict adherence to the prompt

num_inference_steps

Controls the number of denoising steps in the diffusion process.

Steps Quality Speed Use Case
50 Good Fast Quick previews
100 Very Good Medium General purpose
150+ Excellent Slow Final / critical audio

audio_length

Duration of the generated audio clip. For stable-audio-open-1.0, the maximum is approximately 47 seconds. If omitted, the model uses its own default length.

negative_prompt

Describes characteristics to avoid. Common negative prompts include:

  • "Low quality, distorted, noisy"
  • "Silence, static"
  • "Music" (when generating sound effects only)

Supported Models

Model Description
stabilityai/stable-audio-open-1.0 Open-source audio generation model, up to ~47 seconds, 44.1 kHz stereo

Error Responses

400 Bad Request

Invalid or missing parameters:

{
    "error": {
        "message": "Audio generation model did not produce audio output.",
        "type": "BadRequestError",
        "param": null,
        "code": 400
    }
}

404 Not Found

Model mismatch:

{
    "error": {
        "message": "The model `xxx` does not exist.",
        "type": "NotFoundError",
        "param": "model",
        "code": 404
    }
}

422 Unprocessable Entity

Pydantic validation failure (e.g. invalid response_format, speed out of range):

{
    "detail": [
        {
            "type": "literal_error",
            "msg": "Input should be 'wav', 'pcm', 'flac', 'mp3', 'aac' or 'opus'",
            ...
        }
    ]
}

Troubleshooting

"Audio generation model did not produce audio output"

The model finished but returned no audio data. Verify the server started successfully and the model loaded without errors.

Server Not Responding

# Check if the server is healthy
curl http://localhost:8091/health

Audio Quality Issues

  • Increase num_inference_steps (e.g. 150).
  • Add a negative prompt: "Low quality, distorted, noisy".
  • Increase guidance_scale for stronger prompt adherence.

Generation Timeout

  • Reduce num_inference_steps.
  • Reduce audio_length.
  • Check GPU memory with nvidia-smi.

Out of Memory

  • Lower --gpu-memory-utilization (e.g. 0.8).
  • Reduce audio_length.

Development

Enable debug logging:

vllm-omni serve stabilityai/stable-audio-open-1.0 \
    --host 0.0.0.0 \
    --port 8091 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --enforce-eager \
    --omni \
    --uvicorn-log-level debug