Speech API¶
vLLM-Omni provides an OpenAI-compatible API for text-to-speech (TTS) generation. Supported TTS models include:
- Qwen3-TTS (
Qwen/Qwen3-TTS-12Hz-*) -- Qwen3-based TTS with CustomVoice, VoiceDesign, and Base (voice cloning) task types. Output: 24 kHz. - Fish Speech S2 Pro (
fishaudio/s2-pro) -- Dual-AR TTS with DAC codec. Supports text-to-speech and voice cloning via reference audio. Output: 44.1 kHz. - Voxtral TTS (
mistralai/Voxtral-4B-TTS-2603) -- AR + FlowMatching TTS with preset voices. Output: 24 kHz. - CosyVoice3 (
FunAudioLLM/Fun-CosyVoice3-0.5B-2512) -- 2-stage talker + flow-matching code2wav. Voice cloning viaref_audio+ref_text(no presets). Output: 24 kHz.
See the Supported Models section below for the full list, including OmniVoice, VoxCPM2, and MOSS-TTS-Nano.
Each server instance runs a single model (specified at startup via vllm serve <model> --omni).
Quick Start¶
Start the Server¶
# Qwen3-TTS: CustomVoice model (predefined speakers)
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
--deploy-config vllm_omni/deploy/qwen3_tts.yaml \
--omni \
--port 8091 \
--trust-remote-code \
--enforce-eager
# Fish Speech S2 Pro
vllm serve fishaudio/s2-pro --omni --port 8091
# Voxtral TTS
vllm serve mistralai/Voxtral-4B-TTS-2603 --omni --port 8091
# CosyVoice3 (voice cloning only — supply ref_audio + ref_text per request)
vllm serve FunAudioLLM/Fun-CosyVoice3-0.5B-2512 \
--omni --port 8091 --trust-remote-code
Generate Speech¶
Using curl:
curl -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Hello, how are you?",
"voice": "vivian",
"language": "English"
}' --output output.wav
Using Python:
import httpx
response = httpx.post(
"http://localhost:8091/v1/audio/speech",
json={
"input": "Hello, how are you?",
"voice": "vivian",
"language": "English",
},
timeout=300.0,
)
with open("output.wav", "wb") as f:
f.write(response.content)
Using OpenAI SDK:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8091/v1", api_key="none")
response = client.audio.speech.create(
model="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
voice="vivian",
input="Hello, how are you?",
)
response.stream_to_file("output.wav")
API Reference¶
Endpoint¶
Request Parameters¶
OpenAI Standard Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
input | string | required | The text to synthesize into speech |
model | string | server's model | Model to use (optional, should match server if specified) |
voice | string | "vivian" | Speaker name (e.g., vivian, ryan, aiden) |
response_format | string | "wav" | Audio format: wav, mp3, flac, pcm, aac, opus |
speed | float | 1.0 | Playback speed (0.25-4.0) |
vLLM-Omni Extension Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
task_type | string | "CustomVoice" | TTS task type: CustomVoice, VoiceDesign, or Base |
language | string | "Auto" | Language (see supported languages below) |
instructions | string | "" | Voice style/emotion instructions |
max_new_tokens | integer | 2048 | Maximum tokens to generate |
initial_codec_chunk_frames | integer | null | Per-request initial chunk size override for TTFA tuning. When null, IC is computed dynamically based on server load. |
stream | bool | false | Stream raw PCM chunks as they are decoded (requires response_format="pcm") |
Supported languages: Auto, Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
Voice Clone Parameters (Base task)¶
| Parameter | Type | Default | Description |
|---|---|---|---|
ref_audio | string | null | Reference audio (HTTP URL, base64 data URL, or file:// URI with --allowed-local-media-path) |
ref_text | string | null | Transcript of reference audio |
x_vector_only_mode | bool | null | Use speaker embedding only (no ICL) |
Response Format¶
Returns binary audio data with appropriate Content-Type header (e.g., audio/wav).
Voices Endpoint¶
Lists available voices for the loaded model.
{
"voices": ["aiden", "dylan", "eric", "ono_anna", "ryan", "serena", "sohee", "uncle_fu", "vivian", "custom_voice_1"],
"uploaded_voices": [
{
"name": "custom_voice_1",
"consent": "user_consent_id",
"created_at": 1738660000,
"file_size": 1024000,
"mime_type": "audio/wav",
"ref_text": "The exact transcript of the audio sample.",
"speaker_description": "warm narrator"
}
]
}
uploaded_voices is always present (empty list when no custom voices have been uploaded). Fields ref_text and speaker_description are omitted per-entry when not provided at upload time.
Upload a new voice sample for voice cloning in Base task TTS requests.
Form Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
audio_sample | file | Yes | Audio file (max 10MB, supported formats: wav, mp3, flac, ogg, aac, webm, mp4) |
consent | string | Yes | Consent recording ID |
name | string | Yes | Name for the new voice |
ref_text | string | No | Transcript of the audio. When provided, enables in-context voice cloning (higher quality). Without it, only the speaker embedding is extracted. |
speaker_description | string | No | Free-form description of the voice (e.g. "warm narrator", "energetic presenter"). Stored as metadata and returned in GET /v1/audio/voices. |
Response Example:
{
"success": true,
"voice": {
"name": "custom_voice_1",
"consent": "user_consent_id",
"created_at": 1738660000,
"mime_type": "audio/wav",
"file_size": 1024000,
"ref_text": "The exact transcript of the audio sample.",
"speaker_description": "warm narrator"
}
}
Fields ref_text and speaker_description are omitted when not provided at upload time.
Usage Example:
curl -X POST http://localhost:8091/v1/audio/voices \
-F "audio_sample=@/path/to/voice_sample.wav" \
-F "consent=user_consent_id" \
-F "name=custom_voice_1" \
-F "ref_text=The exact transcript of the audio sample." \
-F "speaker_description=warm narrator"
Streaming Text Input (WebSocket)¶
The /v1/audio/speech/stream WebSocket endpoint accepts text incrementally and generates audio per sentence as boundaries are detected.
Note: text input is always streamed incrementally. Audio output remains sentence-scoped: use
stream_audio=falsefor one binary frame per sentence, orstream_audio=truefor one or more PCM chunks per sentence.
WebSocket Protocol¶
Client -> Server:
| Message | Description |
|---|---|
{"type": "session.config", ...} | Session configuration (sent once, first message) |
{"type": "input.text", "text": "..."} | Text chunk |
{"type": "input.done"} | End of input, flushes remaining buffer |
Server -> Client:
| Message | Description |
|---|---|
{"type": "audio.start", "sentence_index": 0, "sentence_text": "...", "format": "pcm", "sample_rate": 24000} | Audio generation starting for a sentence |
| Binary frame | Raw audio bytes (one or more PCM chunks when stream_audio=true) |
{"type": "audio.done", "sentence_index": 0, "total_bytes": 96000, "error": false} | Audio complete for a sentence |
{"type": "session.done", "total_sentences": N} | Session complete |
{"type": "error", "message": "..."} | Non-fatal error |
Session Config Parameters¶
All REST API parameters are supported, plus:
| Parameter | Type | Default | Description |
|---|---|---|---|
stream_audio | bool | false | Stream one or more PCM chunks per sentence over WebSocket |
split_granularity | string | "sentence" | Text splitting granularity |
Delete an uploaded voice sample.
Path Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Name of the voice to delete |
Response Example:
Error Response (404 Not Found):
Usage Example:
Examples¶
CustomVoice with Style Instruction¶
curl -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "I am so excited!",
"voice": "vivian",
"instructions": "Speak with great enthusiasm"
}' --output excited.wav
VoiceDesign (Natural Language Voice Description)¶
# Start server with VoiceDesign model first
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign \
--deploy-config vllm_omni/deploy/qwen3_tts.yaml \
--omni \
--port 8091 \
--trust-remote-code \
--enforce-eager
curl -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Hello world",
"task_type": "VoiceDesign",
"instructions": "A warm, friendly female voice with a gentle tone"
}' --output designed.wav
Base (Voice Cloning)¶
# Start server with Base model first
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-Base \
--deploy-config vllm_omni/deploy/qwen3_tts.yaml \
--omni \
--port 8091 \
--trust-remote-code \
--enforce-eager
curl -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Hello, this is a cloned voice",
"task_type": "Base",
"ref_audio": "https://example.com/reference.wav",
"ref_text": "Original transcript of the reference audio"
}' --output cloned.wav
Upload Voice¶
Upload voice (speaker embedding only):
curl -X POST http://localhost:8091/v1/audio/voices \
-F "audio_sample=@/path/to/voice_sample.wav" \
-F "consent=user_consent_id" \
-F "name=custom_voice_1"
Upload voice with transcript (in-context cloning, higher quality):
curl -X POST http://localhost:8091/v1/audio/voices \
-F "audio_sample=@/path/to/voice_sample.wav" \
-F "consent=user_consent_id" \
-F "name=custom_voice_2" \
-F "ref_text=The exact transcript of the audio sample."
Use Uploaded Voice¶
curl -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Hello, this is a cloned voice",
"voice": "custom_voice_1"
}' --output cloned.wav
Voice Storage & Caching¶
Uploaded voices are persisted to disk as a single .safetensors file per voice (audio samples + metadata — name, consent, ref_text, sample_rate, created_at — in the file header). On server restart the directory is scanned and all previously uploaded voices are restored automatically, so uploads survive process restarts.
Uploading an existing name overwrites the previous entry (a warning is logged).
Feature extraction artifacts (ref_code, speaker_embedding, DAC codes, etc.) are cached in-process with a shared LRU so repeated requests with the same voice=... skip the extraction pipeline. The cache is a true singleton across all TTS model types; deleting a voice invalidates every model-type slot at once.
Precomputed Custom Voices¶
Qwen3-TTS Base and VoxCPM2 can load offline-precomputed voices at startup. Generate a directory containing custom_voice_manifest.json plus one .safetensors file per voice, then set the pipeline-wide deploy config field:
Qwen3-TTS profiles are created with:
python examples/online_serving/text_to_speech/qwen3_tts/precompute_custom_voice.py \
--model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
--voice-name alice \
--ref-audio /path/to/reference.wav \
--ref-text "Original transcript of the reference audio" \
--mode icl \
--output-dir /path/to/custom_voices
VoxCPM2 profiles are created with:
python examples/online_serving/text_to_speech/voxcpm2/precompute_custom_voice.py \
--model openbmb/VoxCPM2 \
--voice-name alice \
--ref-audio /path/to/reference.wav \
--mode ref_continuation \
--prompt-text "Original transcript of the reference audio" \
--output-dir /path/to/custom_voices
Only profiles whose safetensors payload can be loaded and validated are exposed by GET /v1/audio/voices. Valid precomputed voices can be used in POST /v1/audio/speech by passing voice="alice" without ref_audio.
Configuration (environment variables):
| Variable | Default | Description |
|---|---|---|
SPEAKER_SAMPLES_DIR | ~/.cache/vllm-omni/speakers | Directory for persisted uploaded speakers (.safetensors files). |
SPEAKER_MAX_UPLOADED | 1000 | Maximum number of uploaded speakers kept on disk. Upload requests past the cap return 400. |
The in-memory LRU has a fixed 512 MiB byte budget.
Batch Speech Generation¶
The batch endpoint synthesizes multiple texts in a single request, returning all results as JSON with base64-encoded audio.
Endpoint¶
Request Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
items | array | required | List of items to synthesize (1–32) |
model | string | server's model | Model to use |
voice | string | null | Default voice for all items |
response_format | string | "wav" | Default audio format for all items |
speed | float | 1.0 | Default playback speed (0.25–4.0) |
task_type | string | null | Default TTS task type |
language | string | null | Default language |
instructions | string | null | Default voice style instructions |
ref_audio | string | null | Default reference audio (Base task) |
ref_text | string | null | Default reference transcript (Base task) |
max_new_tokens | integer | null | Default max tokens |
Each item in the items array requires only input (the text). All other fields are optional and override the batch-level defaults when set:
| Field | Type | Description |
|---|---|---|
input | string | required — text to synthesize |
voice | string | Override voice for this item |
response_format | string | Override format for this item |
speed | float | Override speed for this item |
task_type | string | Override task type |
language | string | Override language |
instructions | string | Override instructions |
ref_audio | string | Override reference audio |
ref_text | string | Override reference transcript |
max_new_tokens | integer | Override max tokens |
Response Format¶
{
"id": "speech-batch-abc123",
"results": [
{
"index": 0,
"status": "success",
"audio_data": "<base64-encoded audio>",
"media_type": "audio/wav"
},
{
"index": 1,
"status": "error",
"error": "Input text cannot be empty"
}
],
"total": 2,
"succeeded": 1,
"failed": 1
}
Examples¶
Basic batch with shared defaults:
curl -X POST http://localhost:8091/v1/audio/speech/batch \
-H "Content-Type: application/json" \
-d '{
"items": [
{"input": "Hello, how are you?"},
{"input": "Goodbye, see you later!"}
],
"voice": "vivian",
"language": "English"
}'
Per-item overrides (different voices and formats):
curl -X POST http://localhost:8091/v1/audio/speech/batch \
-H "Content-Type: application/json" \
-d '{
"items": [
{"input": "Hello!", "voice": "vivian", "response_format": "mp3"},
{"input": "你好!", "voice": "ryan", "language": "Chinese"}
],
"response_format": "wav"
}'
Voice cloning with shared reference audio (Base task):
curl -X POST http://localhost:8091/v1/audio/speech/batch \
-H "Content-Type: application/json" \
-d '{
"items": [
{"input": "First sentence in the cloned voice."},
{"input": "Second sentence in the cloned voice."}
],
"task_type": "Base",
"ref_audio": "https://example.com/reference.wav",
"ref_text": "Transcript of the reference audio"
}'
Setting ref_audio at the batch level applies it to all items, avoiding the need to repeat it per item.
Decoding the response in Python:
import base64
import httpx
response = httpx.post(
"http://localhost:8091/v1/audio/speech/batch",
json={
"items": [
{"input": "First sentence."},
{"input": "Second sentence."},
],
"voice": "vivian",
},
timeout=300.0,
)
for result in response.json()["results"]:
if result["status"] == "success":
audio_bytes = base64.b64decode(result["audio_data"])
with open(f"output_{result['index']}.wav", "wb") as f:
f.write(audio_bytes)
Configuration¶
| Parameter | Source | Default | Description |
|---|---|---|---|
tts_batch_max_items | engine kwarg | 32 | Maximum number of items per batch request |
All items are fanned out to generate() concurrently. The engine's stage worker automatically batches them up to the configured max_batch_size and queues the rest — no client-side throttling needed.
For best throughput, set both stages' max_num_seqs above 1 via --stage-overrides. On the current Qwen3-TTS CustomVoice benchmark, stage 1 performed best at max_num_seqs: 10:
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
--omni --port 8091 --trust-remote-code --enforce-eager \
--stage-overrides '{"0":{"max_num_seqs":10,"gpu_memory_utilization":0.2},
"1":{"max_num_seqs":10,"gpu_memory_utilization":0.2}}'
The bundled qwen3_tts.yaml uses a multi-request default and lets stage 1 batch chunks across in-flight requests. For latency-sensitive deployments, avoid forcing stage 1 back to max_num_seqs: 1; benchmark before reducing it below 10.
The bundled config also sets initial_codec_chunk_frames: 1. This emits only the first audio chunk early for lower TTFA, then returns to the normal codec_chunk_frames window so Code2Wav does not repeatedly decode tiny overlapping chunks.
Supported Models¶
Qwen3-TTS¶
| Model | Task Type | Description |
|---|---|---|
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice | CustomVoice | Predefined speaker voices with optional style control |
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign | VoiceDesign | Natural language voice style description |
Qwen/Qwen3-TTS-12Hz-1.7B-Base | Base | Voice cloning from reference audio |
Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice | CustomVoice | Smaller/faster variant |
Qwen/Qwen3-TTS-12Hz-0.6B-Base | Base | Smaller/faster variant for voice cloning |
Fish Speech S2 Pro¶
| Model | Description |
|---|---|
fishaudio/s2-pro | 4B dual-AR TTS with DAC codec (44.1 kHz). Supports text-to-speech and voice cloning. |
Fish Speech uses ref_audio and ref_text for voice cloning (no task_type needed). The voice field should be set to "default". See the Fish Speech section of the online TTS hub for details.
Voxtral TTS¶
| Model | Description |
|---|---|
mistralai/Voxtral-4B-TTS-2603 | 3B AR + FlowMatching TTS. Supports text-to-speech with preset voices. |
CosyVoice3¶
| Model | Description |
|---|---|
FunAudioLLM/Fun-CosyVoice3-0.5B-2512 | Voice cloning from ref_audio + ref_text. No built-in voice presets — upload a voice or pass ref_audio/ref_text per request. |
OmniVoice¶
| Model | Description |
|---|---|
k2-fsa/OmniVoice | Pure-diffusion TTS. Supports voice cloning via ref_audio (with optional ref_text); no built-in voice presets. |
VoxCPM2¶
| Model | Description |
|---|---|
openbmb/VoxCPM2 | TTS + voice cloning with built-in speaker presets and uploaded-voice support. Accepts voice (preset or uploaded) or ref_audio + optional ref_text. |
MOSS-TTS-Nano¶
| Model | Description |
|---|---|
OpenMOSS-Team/MOSS-TTS-Nano | Voice cloning only. Requires ref_audio (or an uploaded voice); no built-in voice presets. ref_text is accepted but ignored — upstream's voice_clone mode does not consume a transcript. |
Error Responses¶
400 Bad Request¶
Invalid parameters:
{
"error": {
"message": "Input text cannot be empty",
"type": "BadRequestError",
"param": null,
"code": 400
}
}
404 Not Found¶
Model not found:
{
"error": {
"message": "The model `xxx` does not exist.",
"type": "NotFoundError",
"param": "model",
"code": 404
}
}
Troubleshooting¶
"TTS model did not produce audio output"¶
Ensure you're using the correct model variant for your task type: - CustomVoice task → CustomVoice model - VoiceDesign task → VoiceDesign model - Base task → Base model
Server Not Running¶
Out of Memory¶
If you encounter OOM errors: 1. Use smaller model variant: Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice 2. Reduce --gpu-memory-utilization
Unsupported Speaker¶
Use /v1/audio/voices to list available voices for the loaded model.
Development¶
Enable debug logging: