Audio Generate API¶

vLLM-Omni provides an API for text-to-audio generation using diffusion-based models such as Stable Audio.

Unlike the Speech API which targets text-to-speech synthesis, the Audio Generate API is designed for general-purpose audio generation from text descriptions (sound effects, music, ambient soundscapes, etc.).

Each server instance runs a single model (specified at startup via vllm-omni serve <model> --omni).

Quick Start¶

Start the Server¶

vllm-omni serve stabilityai/stable-audio-open-1.0 \
    --host 0.0.0.0 \
    --port 8091 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --enforce-eager \
    --omni

Generate Audio¶

Using curl:

curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "The sound of a cat purring",
        "audio_length": 10.0
    }' --output cat.wav

Using Python:

import httpx

response = httpx.post(
    "http://localhost:8091/v1/audio/generate",
    json={
        "input": "The sound of a cat purring",
        "audio_length": 10.0,
    },
    timeout=300.0,
)

with open("cat.wav", "wb") as f:
    f.write(response.content)

API Reference¶

Endpoint¶

POST /v1/audio/generate
Content-Type: application/json

Request Parameters¶

Parameter	Type	Default	Description
`input`	string	required	Text prompt describing the audio to generate
`model`	string	server's model	Model to use (optional, should match server if specified)
`response_format`	string	"wav"	Audio format: wav, mp3, flac, pcm, aac, opus
`speed`	float	1.0	Playback speed (0.25 - 4.0)

Diffusion Parameters¶

Parameter	Type	Default	Description
`audio_length`	float	null	Audio duration in seconds (default value is the max ~47s for `stable-audio-open-1.0`)
`audio_start`	float	0.0	Audio start time in seconds
`negative_prompt`	string	null	Text describing what to avoid in generation
`guidance_scale`	float	model default	Classifier-free guidance scale (higher = more adherence to prompt)
`num_inference_steps`	int	model default	Number of denoising steps (higher = better quality, slower)
`seed`	int	null	Random seed for reproducible generation

Response Format¶

Returns binary audio data with the appropriate Content-Type header:

`response_format`	Content-Type
`wav`	`audio/wav`
`mp3`	`audio/mpeg`
`flac`	`audio/flac`
`pcm`	`audio/pcm`
`aac`	`audio/aac`
`opus`	`audio/opus`

Examples¶

Basic Generation¶

Generate audio with only a text prompt (model defaults for all other parameters):

curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "The sound of ocean waves crashing on a beach"
    }' --output ocean.wav

Custom Duration¶

Specify an explicit audio length in seconds:

curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "A dog barking",
        "audio_length": 5.0
    }' --output dog_5s.wav

High Quality with Negative Prompt¶

Use a negative prompt to steer generation away from undesired characteristics, and increase inference steps for higher quality:

curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "A piano playing a gentle melody",
        "audio_length": 10.0,
        "negative_prompt": "Low quality, distorted, noisy",
        "guidance_scale": 8.0,
        "num_inference_steps": 150
    }' --output piano_hq.wav

Reproducible Generation¶

Set a seed to get deterministic results across runs:

curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Thunder and rain sounds",
        "audio_length": 15.0,
        "seed": 42
    }' --output thunder.wav

Full Control¶

Combine all parameters for precise control over generation:

curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Thunder and rain sounds",
        "audio_length": 15.0,
        "negative_prompt": "Low quality",
        "guidance_scale": 7.0,
        "num_inference_steps": 100,
        "seed": 42
    }' --output thunder_rain.wav

Quick Generation (Fewer Steps)¶

For faster generation with slightly lower quality:

curl -X POST http://localhost:8091/v1/audio/generate \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Birds chirping in a forest",
        "audio_length": 8.0,
        "num_inference_steps": 50
    }' --output birds_quick.wav

Python Client¶

import httpx

response = httpx.post(
    "http://localhost:8091/v1/audio/generate",
    json={
        "input": "Thunder and rain",
        "audio_length": 15.0,
        "negative_prompt": "Low quality",
        "guidance_scale": 7.0,
        "num_inference_steps": 100,
        "seed": 42,
        "response_format": "wav",
    },
    timeout=300.0,
)

with open("thunder.wav", "wb") as f:
    f.write(response.content)

Parameter Tuning Guide¶

`guidance_scale`¶

Controls how closely the generated audio follows the text prompt.

Range	Behaviour
3 - 5	More creative / varied output
7 (default)	Balanced adherence
10+	Strict adherence to the prompt

`num_inference_steps`¶

Controls the number of denoising steps in the diffusion process.

Steps	Quality	Speed	Use Case
50	Good	Fast	Quick previews
100	Very Good	Medium	General purpose
150+	Excellent	Slow	Final / critical audio

`audio_length`¶

Duration of the generated audio clip. For stable-audio-open-1.0, the maximum is approximately 47 seconds. If omitted, the model uses its own default length.

`negative_prompt`¶

Describes characteristics to avoid. Common negative prompts include:

"Low quality, distorted, noisy"
"Silence, static"
"Music" (when generating sound effects only)

Supported Models¶

Model	Description
`stabilityai/stable-audio-open-1.0`	Open-source audio generation model, up to ~47 seconds, 44.1 kHz stereo

Error Responses¶

400 Bad Request¶

Invalid or missing parameters:

{
    "error": {
        "message": "Audio generation model did not produce audio output.",
        "type": "BadRequestError",
        "param": null,
        "code": 400
    }
}

404 Not Found¶

Model mismatch:

{
    "error": {
        "message": "The model `xxx` does not exist.",
        "type": "NotFoundError",
        "param": "model",
        "code": 404
    }
}

422 Unprocessable Entity¶

Pydantic validation failure (e.g. invalid response_format, speed out of range):

{
    "detail": [
        {
            "type": "literal_error",
            "msg": "Input should be 'wav', 'pcm', 'flac', 'mp3', 'aac' or 'opus'",
            ...
        }
    ]
}

Troubleshooting¶

"Audio generation model did not produce audio output"¶

The model finished but returned no audio data. Verify the server started successfully and the model loaded without errors.

Server Not Responding¶

# Check if the server is healthy
curl http://localhost:8091/health

Audio Quality Issues¶

Increase num_inference_steps (e.g. 150).
Add a negative prompt: "Low quality, distorted, noisy".
Increase guidance_scale for stronger prompt adherence.

Generation Timeout¶

Reduce num_inference_steps.
Reduce audio_length.
Check GPU memory with nvidia-smi.

Out of Memory¶

Lower --gpu-memory-utilization (e.g. 0.8).
Reduce audio_length.

Development¶

Enable debug logging:

vllm-omni serve stabilityai/stable-audio-open-1.0 \
    --host 0.0.0.0 \
    --port 8091 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --enforce-eager \
    --omni \
    --uvicorn-log-level debug