Text-To-Speech (Online Serving)¶
Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/text_to_speech.
vLLM-Omni exposes TTS models through the OpenAI-compatible POST /v1/audio/speech endpoint, launched with vllm serve <model> --omni. Each TTS model has its own subdirectory containing client snippets, gradio demos, and helper scripts; this README is the single doc entry point for all of them.
For offline inference, see examples/offline_inference/text_to_speech. For the full list of supported architectures across all modalities, see Supported Models.
Supported Models¶
| Model | HuggingFace repo | Voice cloning | Streaming | Voice presets / upload | Gradio demo |
|---|---|---|---|---|---|
| Fish Speech S2 Pro | fishaudio/s2-pro | ✓ (ref_audio+ref_text) | ✓ (PCM stream) | — | ✓ |
| GLM-TTS | zai-org/GLM-TTS | ✓ (ref_audio+ref_text, required) | ✓ (PCM stream) | — | ✓ |
| Ming-flash-omni-TTS | Jonathan1909/Ming-flash-omni-2.0 | — (caption-controlled) | — | caption fields (instructions) | — |
| MOSS-TTS-Nano | OpenMOSS-Team/MOSS-TTS-Nano | ✓ (ref_audio required) | ✓ (PCM stream) | — | ✓ |
| OmniVoice | k2-fsa/OmniVoice | ✓ | — | — | — |
| Qwen3-TTS | Qwen/Qwen3-TTS-12Hz-1.7B-{CustomVoice,VoiceDesign,Base} | ✓ (Base) | ✓ (PCM + WebSocket) | ✓ (presets + /v1/audio/voices upload) | ✓ (standard + FastRTC) |
| VoxCPM2 | openbmb/VoxCPM2 | ✓ | ✓ (AudioWorklet via gradio) | — | ✓ |
| Voxtral TTS | mistralai/Voxtral-4B-TTS-2603 | ✓ (gated upstream) | ✓ | ✓ (presets) | ✓ |
CosyVoice3 is intentionally absent: no online example exists for it yet. See its offline section instead.
Common Quick Start¶
Launch the server (defaults shown — adjust --port, --gpu-memory-utilization, etc. as needed):
Send a TTS request via curl:
curl -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Hello, how are you?",
"voice": "default",
"response_format": "wav"
}' --output output.wav
Or via Python httpx:
import httpx
response = httpx.post(
"http://localhost:8091/v1/audio/speech",
json={
"input": "Hello, how are you?",
"voice": "default",
"response_format": "wav",
},
timeout=300.0,
)
open("output.wav", "wb").write(response.content)
Or via the OpenAI SDK:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8091/v1", api_key="none")
response = client.audio.speech.create(
model="<hf-repo>",
voice="default",
input="Hello, how are you?",
)
response.stream_to_file("output.wav")
Streaming PCM output (where supported) — set stream=true with response_format="pcm":
curl -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Hello, how are you?",
"voice": "default",
"stream": true,
"response_format": "pcm"
}' --no-buffer | play -t raw -r 24000 -e signed -b 16 -c 1 -
Adjust the player's sample rate to match the model (44.1 kHz for Fish Speech, 48 kHz for VoxCPM2, 24 kHz for the others).
For full request-shape documentation (all parameters, response formats, error codes), see the Speech API reference.
GLM-TTS¶
2-stage TTS (AR + DiT flow-matching) at 24 kHz. Every request requires ref_audio + ref_text.
Launch¶
vllm serve zai-org/GLM-TTS --omni --trust-remote-code --port 8091
# or:
bash examples/online_serving/text_to_speech/glm_tts/run_server.sh /path/to/GLM-TTS
Sending requests¶
# Voice cloning (required)
python examples/online_serving/text_to_speech/glm_tts/openai_speech_client.py \
--text "你好,这是语音克隆测试。" \
--ref-audio file:///path/to/ref.wav \
--ref-text "这是参考音频的文本内容。"
# Custom format
python examples/online_serving/text_to_speech/glm_tts/openai_speech_client.py \
--text "Hello, this is a voice cloning test." \
--ref-audio file:///path/to/ref.wav \
--ref-text "Transcript of the reference audio." \
--response-format mp3 -o output.mp3
Gradio demo¶
Notes¶
- Output: 24 kHz mono WAV via HiFT vocoder.
ref_audio+ref_textare required together on every request. Reference audio should be 3-10 seconds.- Voice cloning feature extraction (WhisperVQ, CampPlus, mel) runs on the model side — no external dependency on the serving layer.
Fish Speech S2 Pro¶
4B dual-AR TTS at 44.1 kHz. Server uses the DAC codec.
Prerequisites¶
Kvcache attention fast path¶
Fish Speech S2 Pro uses a Triton decode-only kvcache attention fast path by default on CUDA builds. Set VLLM_OMNI_FISH_KVCACHE_ATTN=0 to disable it, or VLLM_OMNI_FISH_KVCACHE_ATTN=required to fail fast if the fast path cannot be installed.
# Verify fast path availability.
python - <<'PY'
from vllm_omni.attention import fish_kvcache_attn
print(fish_kvcache_attn.is_available())
print(fish_kvcache_attn.load_error())
PY
# Optional: disable the runtime fast path.
export VLLM_OMNI_FISH_KVCACHE_ATTN=0
Launch¶
The deploy config auto-loads fromvllm_omni/deploy/fish_qwen3_omni.yaml (the HF model_type on the fishaudio checkpoint is fish_qwen3_omni). Voice cloning¶
curl -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Hello, this is a cloned voice.",
"voice": "default",
"ref_audio": "https://example.com/reference.wav",
"ref_text": "Transcript of the reference audio."
}' --output cloned.wav
CLI client¶
cd examples/online_serving/text_to_speech/fish_speech
python speech_client.py --text "Hello, how are you?"
python speech_client.py --text "Hello world" --stream --output output.pcm
Gradio demo¶
./fish_speech/run_gradio_demo.sh # launches server + Gradio
python fish_speech/gradio_demo.py --api-base http://localhost:8091 # if server already running
Notes¶
- Output: 44.1 kHz mono.
- Streaming PCM player command must use
-r 44100.
Ming-flash-omni-TTS¶
Standalone talker-only deployment of Ming-flash-omni-2.0. Voice is controlled through caption text passed via instructions.
Launch¶
Equivalent manual command:vllm serve Jonathan1909/Ming-flash-omni-2.0 \
--deploy-config vllm_omni/deploy/ming_flash_omni_tts.yaml \
--host 0.0.0.0 --port 8091 \
--trust-remote-code --omni
Sending requests¶
python examples/online_serving/text_to_speech/ming_flash_omni_tts/speech_client.py \
--text "我们当迎着阳光辛勤耕作,去摘取,去制作,去品尝,去馈赠。" \
--output ming_online.wav
ASMR-style caption via instructions:
python examples/online_serving/text_to_speech/ming_flash_omni_tts/speech_client.py \
--text "我会一直在这里陪着你,直到你慢慢、慢慢地沉入那个最温柔的梦里……好吗?" \
--instructions "这是一种ASMR耳语,属于一种旨在引发特殊感官体验的创意风格。这个女性使用轻柔的普通话进行耳语,声音气音成分重。" \
--output ming_online_asmr.wav
Notes¶
- Server uses
use_zero_spk_emb=Trueand the cookbook decode defaults (max_decode_steps=200,cfg=2.0,sigma=0.25,temperature=0.0). For other caption fields (语速,基频,IP, BGM, etc.) or overriding decode args, use the offline example whereadditional_informationis set explicitly. - This is the online counterpart of
examples/offline_inference/text_to_speech/ming_flash_omni_tts/. - For multimodal Ming-flash-omni online serving, see
examples/online_serving/ming_flash_omni/.
MOSS-TTS-Nano¶
Single-stage 0.1B AR LM + MOSS-Audio-Tokenizer-Nano codec at 48 kHz mono. Every request must include ref_audio; there are no built-in speaker presets.
The OpenAI-schema
voiceandref_textfields are accepted but ignored —voice_clonedoes not consume a transcript, and upstream'scontinuationmode (the only path that acceptsprompt_text) emits near-silent output, so it is not exposed here. Sample reference clips ship in the upstream repo underassets/audio/.
Launch¶
The deploy config atvllm_omni/deploy/moss_tts_nano.yaml auto-loads; no --stage-configs-path, --trust-remote-code, or --enforce-eager flags are needed. Sending requests¶
# One-off fetch of a sample reference clip; cache under XDG_CACHE_HOME.
REF_DIR="${XDG_CACHE_HOME:-$HOME/.cache}/moss-tts-nano"
mkdir -p "$REF_DIR"
REF_WAV="$REF_DIR/zh_1.wav"
[ -s "$REF_WAV" ] || curl -L -o "$REF_WAV" https://raw.githubusercontent.com/OpenMOSS/MOSS-TTS-Nano/main/assets/audio/zh_1.wav
REF_AUDIO=$(base64 -w 0 "$REF_WAV")
curl -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d "{
\"input\": \"你好,这是语音合成测试。\",
\"ref_audio\": \"data:audio/wav;base64,${REF_AUDIO}\",
\"response_format\": \"wav\"
}" --output output.wav
Streaming PCM¶
curl -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d "{
\"input\": \"Hello, streaming output from MOSS-TTS-Nano.\",
\"ref_audio\": \"data:audio/wav;base64,${REF_AUDIO}\",
\"stream\": true,
\"response_format\": \"pcm\"
}" --no-buffer | play -t raw -r 48000 -e signed -b 16 -c 1 -
Gradio demo¶
# Option 1: launch server + Gradio together
./moss_tts_nano/run_gradio_demo.sh
# Option 2: server already running
python moss_tts_nano/gradio_demo.py --api-base http://localhost:8091
Notes¶
- Output is 48 kHz mono PCM (the upstream tokenizer is internally stereo at 48 kHz; the wrapper averages to mono before reaching the engine).
- Standard
/v1/audio/speechrequest shape:input,ref_audio(base64 data URL),response_format,stream,max_new_tokens. Thevoiceandref_textfields from the OpenAI schema are accepted but ignored.
OmniVoice¶
Zero-shot multilingual TTS (600+ languages). Online serving currently exposes auto voice only; voice cloning and voice design are available offline.
Prerequisites¶
Voice cloning (offline) needstransformers>=5.3.0; auto voice works with transformers>=4.57.0. Launch¶
CLI client¶
cd examples/online_serving/text_to_speech/omnivoice
# Text-only (auto voice)
python speech_client.py --text "Hello, how are you?"
# Language hint
python speech_client.py --text "Bonjour, comment allez-vous?" --language French
# Voice cloning (reference audio + optional ref_text)
python speech_client.py \
--text "Bonjour, comment allez-vous?" \
--ref-audio /path/to/ref_audio.wav \
--ref-text "Bonjour, comment allez-vous?"
# Style instruction (voice design-style control)
python speech_client.py \
--text "Bonjour, comment allez-vous?" \
--language French \
--instructions "loud voice"
# Deterministic output with seed parameter
python speech_client.py --text "Hello, how are you?" --seed 42
The client supports --api-base, --model, --text, --response-format, --language, --voice, --ref-audio, --ref-text, --instructions, --seed, and --output.
Qwen3-TTS¶
Three model variants exposed via separate checkpoints:
| Variant | HF repo | Use |
|---|---|---|
| CustomVoice | Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice | Predefined speakers (vivian, ryan, …) with optional style instructions |
| VoiceDesign | Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign | Natural-language voice style description |
| Base | Qwen/Qwen3-TTS-12Hz-1.7B-Base | Voice cloning from a reference audio |
Each variant ships smaller 0.6B companions where available.
Launch¶
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --omni --port 8091
# or:
./qwen3_tts/run_server.sh # default: CustomVoice
./qwen3_tts/run_server.sh VoiceDesign
./qwen3_tts/run_server.sh Base
Executor backend¶
Single-GPU serves now default to the uniproc executor (lower IPC overhead, the Base cloning use case from #2603 / #2604). vllm_omni/deploy/qwen3_tts.yaml is the only Qwen3-TTS deploy config; pass --deploy-config <path> to override.
To opt out of chunked streaming, pass --no-async-chunk — the pipeline auto-dispatches to the end-to-end codec processor.
Sending requests¶
# CustomVoice with a predefined speaker
python qwen3_tts/openai_speech_client.py \
--model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
--text "今天天气真好" \
--speaker ryan \
--instructions "用开心的语气说"
# VoiceDesign with a style description
python qwen3_tts/openai_speech_client.py \
--model Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign \
--task-type VoiceDesign \
--text "哥哥,你回来啦" \
--instructions "体现撒娇稚嫩的萝莉女声,音调偏高"
# Base voice cloning
python qwen3_tts/openai_speech_client.py \
--model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
--task-type Base \
--text "Hello, this is a cloned voice" \
--ref-audio /path/to/reference.wav \
--ref-text "Original transcript of the reference audio"
Voices endpoint¶
List available voices, or upload a custom one for Base cloning:
# List
curl http://localhost:8091/v1/audio/voices
# Upload
curl -X POST http://localhost:8091/v1/audio/voices \
-F "audio_sample=@/path/to/voice_sample.wav" \
-F "consent=user_consent_id" \
-F "name=custom_voice_1" \
-F "ref_text=The exact transcript of the audio sample." \
-F "speaker_description=warm narrator"
voice="custom_voice_1" on subsequent requests. Precomputed custom voices¶
For reused Base voice-cloning speakers, precompute the reference artifacts once and load them at server startup:
python qwen3_tts/precompute_custom_voice.py \
--model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
--voice-name alice \
--ref-audio /path/to/reference.wav \
--ref-text "Original transcript of the reference audio" \
--mode icl \
--output-dir /path/to/custom_voices
--mode icl stores both speaker_embedding and ref_code; --mode xvec stores only the speaker embedding. Add the output directory to a deploy config: Then start the server with that config and call the Speech API with only the voice name: vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-Base --omni --deploy-config /path/to/qwen3_tts_custom_voice.yaml
curl -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input":"Hello from a precomputed voice.","voice":"alice","task_type":"Base"}' \
--output alice.wav
Streaming PCM¶
curl -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Hello, how are you?",
"voice": "vivian",
"language": "English",
"stream": true,
"response_format": "pcm"
}' --no-buffer | play -t raw -r 24000 -e signed -b 16 -c 1 -
response_format="pcm" and async_chunk: true on the stage config (default in qwen3_tts.yaml). speed is not supported when streaming. Streaming WebSocket¶
The /v1/audio/speech/stream endpoint accepts text incrementally, splits it at sentence boundaries, and emits one PCM stream per sentence:
python qwen3_tts/streaming_speech_client.py --text "Hello world. How are you? I am fine."
python qwen3_tts/streaming_speech_client.py --text "..." --simulate-stt --stt-delay 0.1
Gradio demos¶
./qwen3_tts/run_gradio_demo.sh # CustomVoice (default)
./qwen3_tts/run_gradio_demo.sh --task-type VoiceDesign
./qwen3_tts/run_gradio_demo.sh --task-type Base
Speaker embedding interpolation¶
qwen3_tts/speaker_embedding_interpolation.py blends two predefined speakers' embeddings to produce intermediate voices. See the script for usage.
Batch client¶
qwen3_tts/batch_speech_client.py issues many concurrent requests for throughput measurement.
Notes¶
- Base voice cloning has uniproc-vs-mp tradeoffs depending on per-request reference audio cost; see the executor-backend section above.
- With async chunking, Qwen3-TTS Base voice cloning sends the full reference context in the first Code2Wav packet, then caches that prefix on the Code2Wav stage for follow-up chunks in the same request.
vllm_omni/deploy/qwen3_tts.yamlis the default deploy config (loaded by HFmodel_type); per-stage runtime overrides are available via--stage-N-<field> <value>.
VoxCPM2¶
Single-stage native AR TTS at 48 kHz.
Launch¶
Deploy config auto-loads fromvllm_omni/deploy/voxcpm2.yaml. Pass --deploy-config <path> to override or --stage-N-<field> <value> for per-stage runtime tweaks. Sending requests¶
# Zero-shot synthesis
python voxcpm2/openai_speech_client.py --text "Hello, this is VoxCPM2."
# Voice cloning
python voxcpm2/openai_speech_client.py \
--text "This should sound like the reference speaker." \
--ref-audio /path/to/reference.wav
ref_audio field accepts local file paths (auto-base64), HTTP URLs, or data:audio/wav;base64,... data URIs. Precomputed custom voices¶
For repeated VoxCPM2 speakers, precompute the prompt cache and load it through custom_voice_dir:
python voxcpm2/precompute_custom_voice.py \
--model openbmb/VoxCPM2 \
--voice-name alice \
--ref-audio /path/to/reference.wav \
--mode ref_continuation \
--prompt-text "Original transcript of the reference audio" \
--output-dir /path/to/custom_voices
/v1/audio/voices lists alice, and /v1/audio/speech can use voice="alice" without sending ref_audio. Gradio demo (gapless streaming via AudioWorklet)¶
Uses an AudioWorklet-based player adapted from the Qwen3-TTS demo for gap-free playback. Audio is streamed from the OpenAI Speech endpoint withstream=true. Voxtral TTS¶
Voxtral-4B-TTS (Mistral). Uses the mistral_common SpeechRequest protocol; voice presets are model-specific.
Prerequisites¶
Latest mistral_common with SpeechRequest support:
Launch¶
Deploy config auto-loads fromvllm_omni/deploy/voxtral_tts.yaml. Gradio demo¶
The demo handles voice-preset selection and reference-audio upload.voxtral_tts/text_preprocess.py provides the text-normalization helpers used by the demo (also available for other clients). Notes¶
- Voice presets are listed on the HF model card (
mistralai/Voxtral-4B-TTS-2603). - Voice cloning is gated upstream and may require a recent
mistral_common. - A standalone CLI client is not yet shipped; the gradio demo is the canonical reference for now.
Example materials¶
cosyvoice3/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for CosyVoice3 TTS
#
# Usage:
# ./run_server.sh
# CUDA_VISIBLE_DEVICES=0 ./run_server.sh
#
# Streaming (async-chunk) is on by default via vllm_omni/deploy/cosyvoice3.yaml.
# Set NO_ASYNC_CHUNK=1 to use the legacy synchronous path.
set -e
MODEL="${MODEL:-FunAudioLLM/Fun-CosyVoice3-0.5B-2512}"
PORT="${PORT:-8091}"
EXTRA_ARGS=()
if [[ -n "${NO_ASYNC_CHUNK:-}" ]]; then
EXTRA_ARGS+=(--no-async-chunk)
fi
echo "Starting CosyVoice3 server with model: $MODEL"
vllm serve "$MODEL" \
--host 0.0.0.0 \
--port "$PORT" \
--trust-remote-code \
--omni \
"${EXTRA_ARGS[@]}"
cosyvoice3/speech_client.py
"""Client for CosyVoice3 TTS via /v1/audio/speech endpoint.
CosyVoice3 has no built-in voice presets: every request is voice cloning
driven by ``ref_audio`` + ``ref_text``. The defaults below point at the
official upstream zero-shot prompt so the script runs out of the box.
Examples:
# Voice cloning with the default upstream prompt
python speech_client.py --text "收到好友从远方寄来的生日礼物。"
# Custom reference clip + transcript
python speech_client.py --text "Hello, this is a cloned voice." \
--ref-audio /path/to/reference.wav \
--ref-text "Transcript of the reference audio."
# Streaming PCM output
python speech_client.py --text "Hello world" --stream --output output.pcm
"""
import argparse
import base64
import os
import httpx
DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"
DEFAULT_MODEL = "FunAudioLLM/Fun-CosyVoice3-0.5B-2512"
# Official CosyVoice zero-shot prompt and its transcript.
DEFAULT_REF_AUDIO = "https://raw.githubusercontent.com/FunAudioLLM/CosyVoice/main/asset/zero_shot_prompt.wav"
DEFAULT_REF_TEXT = "希望你以后能够做的比我还好呦。"
def encode_audio_to_base64(audio_path: str) -> str:
"""Encode a local audio file to a base64 data URL."""
if not os.path.exists(audio_path):
raise FileNotFoundError(f"Audio file not found: {audio_path}")
ext = audio_path.lower().rsplit(".", 1)[-1]
mime_map = {"wav": "audio/wav", "mp3": "audio/mpeg", "flac": "audio/flac", "ogg": "audio/ogg"}
mime_type = mime_map.get(ext, "audio/wav")
with open(audio_path, "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode("utf-8")
return f"data:{mime_type};base64,{audio_b64}"
def run_tts(args) -> None:
"""Generate speech via the /v1/audio/speech API."""
payload = {
"model": args.model,
"input": args.text,
"response_format": args.response_format,
}
if args.ref_audio.startswith(("http://", "https://")):
payload["ref_audio"] = args.ref_audio
else:
payload["ref_audio"] = encode_audio_to_base64(args.ref_audio)
payload["ref_text"] = args.ref_text
if args.stream:
payload["stream"] = True
payload["response_format"] = "pcm"
print(f"Model: {args.model}")
print(f"Text: {args.text}")
print(f"Voice cloning: ref_audio={args.ref_audio}, ref_text={args.ref_text}")
print("Generating audio...")
api_url = f"{args.api_base}/v1/audio/speech"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {args.api_key}",
}
if args.stream:
output_path = args.output or "output.pcm"
with httpx.Client(timeout=300.0) as client:
with client.stream("POST", api_url, json=payload, headers=headers) as resp:
if resp.status_code != 200:
print(f"Error: {resp.status_code}")
print(resp.read().decode())
return
total_bytes = 0
with open(output_path, "wb") as f:
for chunk in resp.iter_bytes():
f.write(chunk)
total_bytes += len(chunk)
print(f"Streamed {total_bytes} bytes to: {output_path}")
else:
with httpx.Client(timeout=300.0) as client:
response = client.post(api_url, json=payload, headers=headers)
if response.status_code != 200:
print(f"Error: {response.status_code}")
print(response.text)
return
try:
text = response.content.decode("utf-8")
if text.startswith('{"error"'):
print(f"Error: {text}")
return
except UnicodeDecodeError:
pass
output_path = args.output or "output.wav"
with open(output_path, "wb") as f:
f.write(response.content)
print(f"Audio saved to: {output_path}")
def main():
parser = argparse.ArgumentParser(description="CosyVoice3 TTS client")
parser.add_argument("--api-base", default=DEFAULT_API_BASE, help="API base URL")
parser.add_argument("--api-key", default=DEFAULT_API_KEY, help="API key")
parser.add_argument("--model", "-m", default=DEFAULT_MODEL, help="Model name")
parser.add_argument("--text", required=True, help="Text to synthesize")
parser.add_argument(
"--ref-audio",
default=DEFAULT_REF_AUDIO,
help="Reference audio for voice cloning (path or URL)",
)
parser.add_argument(
"--ref-text",
default=DEFAULT_REF_TEXT,
help="Transcript of the reference audio",
)
parser.add_argument("--stream", action="store_true", help="Enable streaming (PCM output)")
parser.add_argument(
"--response-format",
default="wav",
choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
help="Audio format (default: wav)",
)
parser.add_argument("--output", "-o", default=None, help="Output file path")
args = parser.parse_args()
run_tts(args)
if __name__ == "__main__":
main()
fish_speech/gradio_demo.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/fish_speech/gradio_demo.py.
fish_speech/run_gradio_demo.sh
#!/bin/bash
# Launch Fish Speech S2 Pro server + Gradio demo together.
#
# Usage:
# ./run_gradio_demo.sh
# CUDA_VISIBLE_DEVICES=0 PORT=8091 GRADIO_PORT=7860 ./run_gradio_demo.sh
set -e
MODEL="${MODEL:-fishaudio/s2-pro}"
PORT="${PORT:-8091}"
GRADIO_PORT="${GRADIO_PORT:-7860}"
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
echo "Starting Fish Speech S2 Pro server (port $PORT)..."
FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm serve "$MODEL" \
--omni \
--host 0.0.0.0 \
--port "$PORT" &
SERVER_PID=$!
cleanup() {
echo "Stopping server (PID $SERVER_PID)..."
kill $SERVER_PID 2>/dev/null
wait $SERVER_PID 2>/dev/null
}
trap cleanup EXIT
# Wait for server to be ready.
echo "Waiting for server to start..."
for i in $(seq 1 120); do
if curl -s "http://localhost:$PORT/health" > /dev/null 2>&1; then
echo "Server ready."
break
fi
sleep 2
done
echo "Starting Gradio demo (port $GRADIO_PORT)..."
python "$SCRIPT_DIR/gradio_demo.py" \
--api-base "http://localhost:$PORT" \
--port "$GRADIO_PORT"
fish_speech/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for Fish Speech S2 Pro
#
# Usage:
# ./run_server.sh
# CUDA_VISIBLE_DEVICES=0 ./run_server.sh
set -e
MODEL="${MODEL:-fishaudio/s2-pro}"
PORT="${PORT:-8091}"
echo "Starting Fish Speech S2 Pro server with model: $MODEL"
FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm serve "$MODEL" \
--omni \
--host 0.0.0.0 \
--port "$PORT"
fish_speech/speech_client.py
"""Client for Fish Speech S2 Pro via /v1/audio/speech endpoint.
Examples:
# Basic TTS
python speech_client.py --text "Hello, how are you?"
# Voice cloning
python speech_client.py --text "Hello, how are you?" \
--ref-audio ref.wav --ref-text "This is the reference transcript."
# Streaming PCM output
python speech_client.py --text "Hello world" --stream --output output.pcm
"""
import argparse
import base64
import os
import httpx
DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"
def encode_audio_to_base64(audio_path: str) -> str:
"""Encode a local audio file to base64 data URL."""
if not os.path.exists(audio_path):
raise FileNotFoundError(f"Audio file not found: {audio_path}")
ext = audio_path.lower().rsplit(".", 1)[-1]
mime_map = {"wav": "audio/wav", "mp3": "audio/mpeg", "flac": "audio/flac", "ogg": "audio/ogg"}
mime_type = mime_map.get(ext, "audio/wav")
with open(audio_path, "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode("utf-8")
return f"data:{mime_type};base64,{audio_b64}"
def run_tts(args) -> None:
"""Generate speech via /v1/audio/speech API."""
payload = {
"model": args.model,
"input": args.text,
"voice": "default",
"response_format": args.response_format,
}
# Voice cloning parameters.
if args.ref_audio:
if args.ref_audio.startswith(("http://", "https://")):
payload["ref_audio"] = args.ref_audio
else:
payload["ref_audio"] = encode_audio_to_base64(args.ref_audio)
if args.ref_text:
payload["ref_text"] = args.ref_text
if args.stream:
payload["stream"] = True
payload["response_format"] = "pcm"
print(f"Model: {args.model}")
print(f"Text: {args.text}")
if args.ref_audio:
print(f"Voice cloning: ref_audio={args.ref_audio}, ref_text={args.ref_text}")
print("Generating audio...")
api_url = f"{args.api_base}/v1/audio/speech"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {args.api_key}",
}
if args.stream:
output_path = args.output or "output.pcm"
with httpx.Client(timeout=300.0) as client:
with client.stream("POST", api_url, json=payload, headers=headers) as resp:
if resp.status_code != 200:
print(f"Error: {resp.status_code}")
print(resp.read().decode())
return
total_bytes = 0
with open(output_path, "wb") as f:
for chunk in resp.iter_bytes():
f.write(chunk)
total_bytes += len(chunk)
print(f"Streamed {total_bytes} bytes to: {output_path}")
else:
with httpx.Client(timeout=300.0) as client:
response = client.post(api_url, json=payload, headers=headers)
if response.status_code != 200:
print(f"Error: {response.status_code}")
print(response.text)
return
try:
text = response.content.decode("utf-8")
if text.startswith('{"error"'):
print(f"Error: {text}")
return
except UnicodeDecodeError:
pass
output_path = args.output or "output.wav"
with open(output_path, "wb") as f:
f.write(response.content)
print(f"Audio saved to: {output_path}")
def main():
parser = argparse.ArgumentParser(description="Fish Speech S2 Pro TTS client")
parser.add_argument("--api-base", default=DEFAULT_API_BASE, help="API base URL")
parser.add_argument("--api-key", default=DEFAULT_API_KEY, help="API key")
parser.add_argument("--model", "-m", default="fishaudio/s2-pro", help="Model name")
parser.add_argument("--text", required=True, help="Text to synthesize")
parser.add_argument("--ref-audio", default=None, help="Reference audio for voice cloning (path or URL)")
parser.add_argument("--ref-text", default=None, help="Transcript of reference audio")
parser.add_argument("--stream", action="store_true", help="Enable streaming (PCM output)")
parser.add_argument(
"--response-format",
default="wav",
choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
help="Audio format (default: wav)",
)
parser.add_argument("--output", "-o", default=None, help="Output file path")
args = parser.parse_args()
run_tts(args)
if __name__ == "__main__":
main()
glm_tts/gradio_demo.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/glm_tts/gradio_demo.py.
glm_tts/openai_speech_client.py
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""OpenAI-compatible client for GLM-TTS via /v1/audio/speech endpoint.
GLM-TTS is a two-stage TTS system (AR + DiT) that generates audio from text
conditioned on reference speech. Each request requires ref_audio + ref_text.
Usage:
# Voice cloning
python openai_speech_client.py --text "你好" --ref-audio file:///path/to/ref.wav --ref-text "参考文本"
# Streaming response, for async_chunk server mode
python openai_speech_client.py --text "你好" --stream --ref-audio file:///path/to/ref.wav --ref-text "参考文本"
# Specify output format
python openai_speech_client.py --text "你好" --ref-audio file:///path/to/ref.wav \
--ref-text "参考文本" --response-format mp3 -o output.mp3
"""
import argparse
import httpx
# Default server configuration
DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"
def run_tts_generation(args) -> None:
"""Run TTS generation via OpenAI-compatible /v1/audio/speech API."""
if not args.ref_audio or not args.ref_text:
raise ValueError("GLM-TTS requires --ref-audio and --ref-text for voice cloning.")
payload = {
"model": args.model,
"voice": "default",
"input": args.text,
"response_format": args.response_format,
"stream": bool(args.stream),
"ref_audio": args.ref_audio,
"ref_text": args.ref_text,
}
if args.max_new_tokens:
payload["max_new_tokens"] = args.max_new_tokens
print(f"Model: {args.model}")
print(f"Text: {args.text}")
print(f"Voice cloning: ref_audio={args.ref_audio}, ref_text={args.ref_text}")
print(f"Stream: {args.stream}")
print("Generating audio...")
api_url = f"{args.api_base}/v1/audio/speech"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {args.api_key}",
}
if args.stream:
output_path = args.output or "tts_output.pcm"
with httpx.Client(timeout=300.0) as client, open(output_path, "wb") as f:
with client.stream("POST", api_url, json=payload, headers=headers) as response:
if response.status_code != 200:
print(f"Error: {response.status_code}")
response.read()
print(response.text)
return
for chunk in response.iter_bytes():
f.write(chunk)
print(f"Streaming audio saved to: {output_path}")
else:
with httpx.Client(timeout=300.0) as client:
response = client.post(api_url, json=payload, headers=headers)
if response.status_code != 200:
print(f"Error: {response.status_code}")
print(response.text)
return
try:
text = response.content.decode("utf-8")
if text.startswith('{"error"'):
print(f"Error: {text}")
return
except UnicodeDecodeError:
pass
output_path = args.output or f"tts_output.{args.response_format}"
with open(output_path, "wb") as f:
f.write(response.content)
print(f"Audio saved to: {output_path}")
def parse_args():
"""Parse command line arguments."""
parser = argparse.ArgumentParser(
description="OpenAI-compatible client for GLM-TTS via /v1/audio/speech",
)
# Server configuration
parser.add_argument(
"--api-base",
type=str,
default=DEFAULT_API_BASE,
help=f"API base URL (default: {DEFAULT_API_BASE})",
)
parser.add_argument(
"--api-key",
type=str,
default=DEFAULT_API_KEY,
help="API key (default: EMPTY)",
)
parser.add_argument(
"--model",
"-m",
type=str,
default="glm-tts",
help="Model name/path",
)
# Input text
parser.add_argument(
"--text",
type=str,
required=True,
help="Text to synthesize",
)
# Generation parameters
parser.add_argument(
"--max-new-tokens",
type=int,
default=None,
help="Maximum new tokens to generate (default: model default)",
)
# Output
parser.add_argument(
"--response-format",
type=str,
default="wav",
choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
help="Audio output format (default: wav)",
)
parser.add_argument(
"--stream",
action="store_true",
help="Request a streaming audio response (use with async_chunk server mode).",
)
parser.add_argument(
"--output",
"-o",
type=str,
default=None,
help="Output audio file path (default: tts_output.<format>)",
)
# Voice cloning parameters
parser.add_argument(
"--ref-audio",
type=str,
default=None,
help="Reference audio URL, file:// URI, or base64 data URL for voice cloning",
)
parser.add_argument(
"--ref-text",
type=str,
default=None,
help="Transcript of the reference audio (required with --ref-audio)",
)
return parser.parse_args()
if __name__ == "__main__":
args = parse_args()
run_tts_generation(args)
glm_tts/run_gradio_demo.sh
#!/bin/bash
# Launch GLM-TTS server + Gradio demo together.
#
# Usage:
# ./run_gradio_demo.sh
# CUDA_VISIBLE_DEVICES=0 PORT=8091 GRADIO_PORT=7860 ./run_gradio_demo.sh
set -e
MODEL="${MODEL:-zai-org/GLM-TTS}"
PORT="${PORT:-8091}"
GRADIO_PORT="${GRADIO_PORT:-7860}"
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../../../.." && pwd)"
echo "Starting GLM-TTS server (port $PORT)..."
FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm-omni serve "$MODEL" \
--deploy-config "$REPO_ROOT/vllm_omni/deploy/glm_tts.yaml" \
--host 0.0.0.0 \
--port "$PORT" \
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--enforce-eager \
--omni &
SERVER_PID=$!
cleanup() {
echo "Stopping server (PID $SERVER_PID)..."
kill $SERVER_PID 2>/dev/null
wait $SERVER_PID 2>/dev/null
}
trap cleanup EXIT
# Wait for server to be ready.
echo "Waiting for server to start..."
for i in $(seq 1 120); do
if curl -s "http://localhost:$PORT/health" > /dev/null 2>&1; then
echo "Server ready."
break
fi
sleep 2
done
echo "Starting Gradio demo (port $GRADIO_PORT)..."
python "$SCRIPT_DIR/gradio_demo.py" \
--api-base "http://localhost:$PORT" \
--port "$GRADIO_PORT"
glm_tts/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for GLM-TTS models
#
# Usage:
# ./run_server.sh # Default model path, async_chunk mode
# ./run_server.sh /path/to/GLM-TTS # Custom model path, async_chunk mode
# ./run_server.sh /path/to/GLM-TTS sync # Sync two-stage mode
#
# NOTE: The model path should point to the repo ROOT (not llm/ subdirectory).
# model_subdir/tokenizer_subdir in the pipeline config resolve subdirectories.
set -e
MODEL="${1:-zai-org/GLM-TTS}"
MODE="${2:-async}"
EXTRA_ARGS=()
case "$MODE" in
async|async_chunk)
;;
sync|no_async_chunk)
EXTRA_ARGS+=("--no-async-chunk")
;;
*)
echo "Unknown mode: $MODE (expected async or sync)" >&2
exit 1
;;
esac
echo "Starting GLM-TTS server with model: $MODEL (mode: $MODE)"
vllm-omni serve "$MODEL" \
--deploy-config vllm_omni/deploy/glm_tts.yaml \
--host 0.0.0.0 \
--port 8091 \
--trust-remote-code \
--omni \
"${EXTRA_ARGS[@]}"
higgs_audio_v2/README.md
higgs-audio v2 online example¶
This directory contains the online-serving entry points for boson-ai's higgs-audio v2 as integrated by vllm-omni: a 2-stage TTS pipeline (Llama-3.2-3B talker with DualFFN audio expert + HiggsAudio codec decoder) emitting 24 kHz mono speech.
Prerequisites¶
Voice clone uses HF's HiggsAudioV2TokenizerModel loaded from k2-fsa/OmniVoice/audio_tokenizer/ (the boson-ai standalone tokenizer Hub repo's model.safetensors is the 3B talker LM, not the codec). Only that ~806 MB subdir is downloaded.
Files¶
run_server.sh— launch the vllm-omni server with the bundledvllm_omni/deploy/higgs_audio_v2.yamldeploy config.batch_speech_client.py— send a list of prompts to/v1/audio/speechand save the returned WAV / PCM bytes to a directory; optionally passes--ref-audio+--ref-textfor shallow voice clone.
Launching the server¶
Environment overrides:
MODEL— HF id of the talker (defaultbosonai/higgs-audio-v2-generation-3B-base).PORT— server port (default8094).GPUS—CUDA_VISIBLE_DEVICESvalue (default6,7).GPU_UTIL—--gpu-memory-utilization(default0.4).
The script also exports VLLM_USE_DEEP_GEMM=0 / VLLM_MOE_USE_DEEP_GEMM=0 so the example works on images without the optional deep_gemm backend.
The deploy YAML ships with async_chunk: false and codec_streaming: true, i.e. Stage 0 finishes its codec frames before Stage 1 starts decoding, and Stage 1 streams WAV/PCM bytes to the client chunk-by-chunk.
Driving the server¶
Plain TTS:
python examples/online_serving/text_to_speech/higgs_audio_v2/batch_speech_client.py \
--base-url http://localhost:8094 \
--model bosonai/higgs-audio-v2-generation-3B-base \
--output-dir /tmp/higgs_audio_v2_batch \
--prompts "Hello world." \
"The quick brown fox jumps over the lazy dog."
Voice clone — pass a reference clip and its transcript (both required together):
python examples/online_serving/text_to_speech/higgs_audio_v2/batch_speech_client.py \
--base-url http://localhost:8094 \
--model bosonai/higgs-audio-v2-generation-3B-base \
--output-dir /tmp/higgs_audio_v2_clone \
--ref-audio /path/to/reference.wav \
--ref-text "Exact transcript spoken in reference.wav." \
--prompts "Hello, this is a cloned voice."
Notes¶
--ref-textmust be the real transcript of--ref-audio; mismatched text degrades cloned-voice quality.- Out of scope (rejected with explicit 4xx by the request validator): multi-speaker
[SPEAKERn]tags insideinput,profile:text-only speaker descriptions, theref_audio_in_system_messagesystem-block variant, chunked long-form generation, and per-requestvoice/instructions/task_type/language/speed != 1.0/x_vector_only_mode/speaker_embedding.
higgs_audio_v2/batch_speech_client.py
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""Batch client for the higgs-audio v2 online server.
Sends a fixed list of prompts to ``/v1/audio/speech`` and saves the returned
WAV files (or raw PCM bytes when ``--format pcm``) into ``--output-dir``.
Usage (plain text -> speech):
python examples/online_serving/text_to_speech/higgs_audio_v2/batch_speech_client.py \
--base-url http://localhost:8094 \
--output-dir /tmp/higgs_audio_v2_batch \
--prompts "Hello world." "The quick brown fox jumps over the lazy dog."
Usage (shallow voice clone — pass a reference clip + its transcript):
python examples/online_serving/text_to_speech/higgs_audio_v2/batch_speech_client.py \
--base-url http://localhost:8094 \
--output-dir /tmp/higgs_audio_v2_clone \
--ref-audio path/to/reference.wav \
--ref-text "the transcript of the reference clip" \
--prompts "Hello world."
"""
from __future__ import annotations
import argparse
import base64
import sys
from pathlib import Path
DEFAULT_PROMPTS = (
"Hello world.",
"The quick brown fox jumps over the lazy dog.",
"It was the night before my birthday.",
"Innovation distinguishes between a leader and a follower.",
)
def _slug(text: str) -> str:
import re
s = re.sub(r"\s+", "_", text.strip().lower())
return re.sub(r"[^a-z0-9_]+", "", s)[:32] or "prompt"
def main() -> int:
parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument("--base-url", default="http://localhost:8094")
parser.add_argument("--model", default="higgs_audio_v2")
parser.add_argument("--prompts", nargs="+", default=list(DEFAULT_PROMPTS))
parser.add_argument("--output-dir", type=Path, default=Path("/tmp/higgs_audio_v2_batch"))
parser.add_argument("--format", choices=("wav", "pcm"), default="wav")
parser.add_argument("--max-new-tokens", type=int, default=300)
parser.add_argument("--seed", type=int, default=42)
parser.add_argument("--timeout-s", type=float, default=120.0)
parser.add_argument(
"--ref-audio",
type=Path,
default=None,
help="Reference clip for voice clone (path to a WAV file). Must be paired with --ref-text.",
)
parser.add_argument(
"--ref-text",
type=str,
default=None,
help="Transcript of the reference clip. Required when --ref-audio is set.",
)
args = parser.parse_args()
if (args.ref_audio is None) != (args.ref_text is None):
print("--ref-audio and --ref-text must be supplied together", file=sys.stderr)
return 2
ref_audio_data_url: str | None = None
if args.ref_audio is not None:
if not args.ref_audio.exists():
print(f"ref-audio file not found: {args.ref_audio}", file=sys.stderr)
return 2
mime = "audio/wav" if args.ref_audio.suffix.lower() == ".wav" else "audio/mpeg"
ref_b64 = base64.b64encode(args.ref_audio.read_bytes()).decode("ascii")
ref_audio_data_url = f"data:{mime};base64,{ref_b64}"
try:
import httpx
except ImportError:
print(
"this client needs `httpx`. Install with `pip install httpx`.",
file=sys.stderr,
)
return 2
args.output_dir.mkdir(parents=True, exist_ok=True)
url = args.base_url.rstrip("/") + "/v1/audio/speech"
failures = 0
with httpx.Client(timeout=args.timeout_s) as client:
for prompt in args.prompts:
payload = {
"model": args.model,
"input": prompt,
"response_format": args.format,
"max_new_tokens": args.max_new_tokens,
"seed": args.seed,
}
if ref_audio_data_url is not None:
payload["ref_audio"] = ref_audio_data_url
payload["ref_text"] = args.ref_text
resp = client.post(url, json=payload)
if resp.status_code != 200:
print(f"[FAIL] {prompt!r} -> {resp.status_code}: {resp.text[:200]}", file=sys.stderr)
failures += 1
continue
suffix = ".wav" if args.format == "wav" else ".pcm"
out = args.output_dir / f"{_slug(prompt)}{suffix}"
out.write_bytes(resp.content)
print(f"[ ok ] {prompt!r} -> {out} ({len(resp.content)} bytes)")
return 1 if failures else 0
if __name__ == "__main__":
sys.exit(main())
higgs_audio_v2/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for higgs-audio v2.
#
# v1 scope: plain text -> 24 kHz speech only. Voice cloning, multi-speaker,
# ChatML rich content, and language overrides are rejected by the validator
# with explicit 4xx (see vllm_omni/entrypoints/openai/serving_speech.py).
#
# Usage:
# ./run_server.sh # default port 8094, GPUs 6 and 7
# PORT=8095 GPUS=6,7 ./run_server.sh
# MODEL=bosonai/higgs-audio-v2-generation-3B-base ./run_server.sh
set -e
MODEL="${MODEL:-bosonai/higgs-audio-v2-generation-3B-base}"
PORT="${PORT:-8094}"
GPUS="${GPUS:-6,7}"
GPU_UTIL="${GPU_UTIL:-0.4}"
echo "Starting higgs-audio v2 server"
echo " MODEL=$MODEL"
echo " PORT=$PORT"
echo " CUDA_VISIBLE_DEVICES=$GPUS"
# DeepGEMM FP8 kernels are optional and trip warmup on builds without
# the deep_gemm backend; disable them so the example works out of the box.
# Users with deep_gemm installed can re-enable via the same env vars.
CUDA_VISIBLE_DEVICES="$GPUS" \
VLLM_USE_DEEP_GEMM=0 \
VLLM_MOE_USE_DEEP_GEMM=0 \
vllm-omni serve "$MODEL" \
--deploy-config vllm_omni/deploy/higgs_audio_v2.yaml \
--host 0.0.0.0 \
--port "$PORT" \
--gpu-memory-utilization "$GPU_UTIL" \
--trust-remote-code \
--omni
ming_flash_omni_tts/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for Ming-flash-omni-2.0 standalone talker (TTS).
#
# Usage:
# ./run_server.sh
# MODEL=/path/to/local/model ./run_server.sh
# PORT=8091 ./run_server.sh
# HOST=127.0.0.1 ./run_server.sh # bind only to loopback
set -e
MODEL="${MODEL:-Jonathan1909/Ming-flash-omni-2.0}"
HOST="${HOST:-0.0.0.0}"
PORT="${PORT:-8091}"
DEPLOY_CONFIG="${DEPLOY_CONFIG:-vllm_omni/deploy/ming_flash_omni_tts.yaml}"
echo "Starting Ming standalone TTS server with model: $MODEL"
echo "Deploy config: $DEPLOY_CONFIG"
vllm serve "$MODEL" \
--deploy-config "$DEPLOY_CONFIG" \
--host "$HOST" \
--port "$PORT" \
--trust-remote-code \
--omni
ming_flash_omni_tts/speech_client.py
"""Client for Ming standalone TTS via /v1/audio/speech endpoint."""
import argparse
import json
import sys
import httpx
DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"
DEFAULT_MODEL = "Jonathan1909/Ming-flash-omni-2.0"
def run_tts(args) -> None:
payload = {
"model": args.model,
"input": args.text,
"response_format": args.response_format,
}
instructions = args.instructions
if args.instruction_json:
if instructions:
sys.exit("--instructions and --instruction-json are mutually exclusive")
try:
parsed = json.loads(args.instruction_json)
except json.JSONDecodeError as exc:
sys.exit(f"--instruction-json must be valid JSON: {exc}")
if not isinstance(parsed, dict):
sys.exit("--instruction-json must decode to a JSON object")
# Re-encode with ensure_ascii=False so UTF-8 Chinese keys/values
# arrive at the server intact rather than as \\uXXXX escapes.
instructions = json.dumps(parsed, ensure_ascii=False)
if instructions:
payload["instructions"] = instructions
print(f"Model: {args.model}")
print(f"Text: {args.text}")
print("Generating audio...")
api_url = f"{args.api_base}/v1/audio/speech"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {args.api_key}",
}
with httpx.Client(timeout=300.0) as client:
response = client.post(api_url, json=payload, headers=headers)
if response.status_code != 200:
print(f"Error: {response.status_code}")
print(response.text)
return
output_path = args.output or "ming_tts_output.wav"
with open(output_path, "wb") as f:
f.write(response.content)
print(f"Audio saved to: {output_path}")
def main():
parser = argparse.ArgumentParser(description="Ming standalone TTS speech client")
parser.add_argument("--api-base", default=DEFAULT_API_BASE, help="API base URL")
parser.add_argument("--api-key", default=DEFAULT_API_KEY, help="API key")
parser.add_argument("--model", "-m", default=DEFAULT_MODEL, help="Model name or local path")
parser.add_argument("--text", required=True, help="Text to synthesize")
parser.add_argument(
"--response-format",
default="wav",
choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
help="Audio format (default: wav)",
)
parser.add_argument("--output", "-o", default=None, help="Output file path")
parser.add_argument(
"--instructions",
default=None,
help="Free-form style description (mapped to caption 风格 on the server).",
)
parser.add_argument(
"--instruction-json",
default=None,
help=(
"Structured caption JSON forwarded as `instructions`. Accepts Ming "
"caption keys: 方言, 风格, 语速, 基频, 音量, 情感, IP, 说话人, BGM. "
),
)
args = parser.parse_args()
run_tts(args)
if __name__ == "__main__":
main()
moss_tts_nano/gradio_demo.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/moss_tts_nano/gradio_demo.py.
moss_tts_nano/run_gradio_demo.sh
#!/bin/bash
# Launch MOSS-TTS-Nano server + Gradio demo together.
#
# Usage:
# ./run_gradio_demo.sh
# CUDA_VISIBLE_DEVICES=0 PORT=8091 GRADIO_PORT=7860 ./run_gradio_demo.sh
set -e
MODEL="${MODEL:-OpenMOSS-Team/MOSS-TTS-Nano}"
PORT="${PORT:-8091}"
GRADIO_PORT="${GRADIO_PORT:-7860}"
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
echo "Starting MOSS-TTS-Nano server (port $PORT)..."
FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm serve "$MODEL" \
--host 0.0.0.0 \
--port "$PORT" \
--omni &
SERVER_PID=$!
cleanup() {
echo "Stopping server (PID $SERVER_PID)..."
kill $SERVER_PID 2>/dev/null
wait $SERVER_PID 2>/dev/null
}
trap cleanup EXIT
# Wait for server to be ready.
echo "Waiting for server to start..."
for i in $(seq 1 120); do
if curl -s "http://localhost:$PORT/health" > /dev/null 2>&1; then
echo "Server ready."
break
fi
sleep 2
done
echo "Starting Gradio demo (port $GRADIO_PORT)..."
python "$SCRIPT_DIR/gradio_demo.py" \
--api-base "http://localhost:$PORT" \
--port "$GRADIO_PORT"
moss_tts_nano/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for MOSS-TTS-Nano
#
# Usage:
# ./run_server.sh
# CUDA_VISIBLE_DEVICES=0 PORT=8091 ./run_server.sh
set -e
MODEL="${MODEL:-OpenMOSS-Team/MOSS-TTS-Nano}"
PORT="${PORT:-8091}"
echo "Starting MOSS-TTS-Nano server with model: $MODEL"
FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm serve "$MODEL" \
--host 0.0.0.0 \
--port "$PORT" \
--omni
omnivoice/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for OmniVoice TTS
#
# Usage:
# ./run_server.sh
# CUDA_VISIBLE_DEVICES=0 ./run_server.sh
set -e
MODEL="${MODEL:-k2-fsa/OmniVoice}"
PORT="${PORT:-8091}"
echo "Starting OmniVoice server with model: $MODEL"
vllm serve "$MODEL" \
--host 0.0.0.0 \
--port "$PORT" \
--trust-remote-code \
--omni
omnivoice/speech_client.py
"""Client for OmniVoice TTS via /v1/audio/speech endpoint.
Examples:
# Basic TTS (auto voice)
python speech_client.py --text "Hello, how are you?"
# Specify language
python speech_client.py --text "Bonjour, comment allez-vous?" --language French
# Use a specific uploaded/supported voice
python speech_client.py --text "Hello" --voice my_uploaded_voice
"""
import argparse
import base64
import os
import httpx
DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"
def encode_audio_to_base64(audio_path: str) -> str:
"""Encode a local audio file to a base64 data URL."""
if not os.path.exists(audio_path):
raise FileNotFoundError(f"Audio file not found: {audio_path}")
ext = audio_path.lower().rsplit(".", 1)[-1]
mime = {
"wav": "audio/wav",
"mp3": "audio/mpeg",
"flac": "audio/flac",
"ogg": "audio/ogg",
}.get(ext, "audio/wav")
with open(audio_path, "rb") as f:
b64 = base64.b64encode(f.read()).decode("utf-8")
return f"data:{mime};base64,{b64}"
def run_tts(args) -> None:
"""Generate speech via /v1/audio/speech API."""
payload = {
"model": args.model,
"input": args.text,
"response_format": args.response_format,
}
if args.seed is not None:
payload["extra_params"] = {}
payload["extra_params"]["seed"] = args.seed
if args.voice:
payload["voice"] = args.voice
if args.language:
payload["language"] = args.language
if args.ref_audio:
ref = args.ref_audio
if ref.startswith(("http://", "https://", "data:")):
payload["ref_audio"] = ref
else:
payload["ref_audio"] = encode_audio_to_base64(ref)
if args.ref_text:
payload["ref_text"] = args.ref_text
if args.instructions:
payload["instructions"] = args.instructions
print(f"Model: {args.model}")
print(f"Text: {args.text}")
if args.seed:
print(f"Seed: {args.seed}")
if args.voice:
print(f"Voice: {args.voice}")
if args.language:
print(f"Language: {args.language}")
print("Generating audio...")
api_url = f"{args.api_base}/v1/audio/speech"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {args.api_key}",
}
with httpx.Client(timeout=300.0) as client:
response = client.post(api_url, json=payload, headers=headers)
if response.status_code != 200:
print(f"Error: {response.status_code}")
print(response.text)
return
try:
text = response.content.decode("utf-8")
if text.startswith('{"error"'):
print(f"Error: {text}")
return
except UnicodeDecodeError:
pass
output_path = args.output or "omnivoice_output.wav"
with open(output_path, "wb") as f:
f.write(response.content)
print(f"Audio saved to: {output_path}")
def main():
parser = argparse.ArgumentParser(description="OmniVoice TTS client")
parser.add_argument("--api-base", default=DEFAULT_API_BASE, help="API base URL")
parser.add_argument("--api-key", default=DEFAULT_API_KEY, help="API key")
parser.add_argument("--model", "-m", default="k2-fsa/OmniVoice", help="Model name")
parser.add_argument("--text", required=True, help="Text to synthesize")
parser.add_argument(
"--voice",
default=None,
help="Voice name (omit for auto voice; must match a supported or uploaded speaker if set)",
)
parser.add_argument("--language", default=None, help="Language hint (e.g., English, Chinese, French)")
parser.add_argument(
"--response-format",
default="wav",
choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
help="Audio format (default: wav)",
)
parser.add_argument(
"--ref-audio",
type=str,
default=None,
help="Reference audio for voice cloning (local path, URL, or data: URI)",
)
parser.add_argument(
"--ref-text",
type=str,
default=None,
help="Reference text for voice cloning",
)
parser.add_argument(
"--instructions",
type=str,
default=None,
help="Voice style/emotion instructions",
)
parser.add_argument(
"--seed",
type=int,
default=None,
help="Random seed for generation, default: None for stochastic output)",
)
parser.add_argument("--output", "-o", default=None, help="Output file path")
args = parser.parse_args()
run_tts(args)
if __name__ == "__main__":
main()
qwen3_tts/batch_speech_client.py
"""Batch speech client for Qwen3-TTS via /v1/audio/speech/batch endpoint.
This script demonstrates how to synthesize multiple texts in a single request.
A particularly useful scenario is voice cloning: set ref_audio once at the
batch level and generate many utterances in the cloned voice without repeating
the reference for each item.
Start the server (with batch-optimized stage settings for best throughput):
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
--omni \
--trust-remote-code \
--stage-overrides '{"0":{"max_num_seqs":4,"gpu_memory_utilization":0.2},
"1":{"max_num_seqs":4,"gpu_memory_utilization":0.2}}'
Examples:
# Batch with a predefined voice
python batch_speech_client.py \
--texts "Hello, how are you?" "Goodbye, see you later!"
# Voice cloning: one ref_audio, many outputs
python batch_speech_client.py \
--task-type Base \
--ref-audio /path/to/reference.wav \
--ref-text "Transcript of the reference audio" \
--texts "First cloned sentence." "Second cloned sentence." \
"Third cloned sentence."
"""
import argparse
import base64
import os
import httpx
DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"
def encode_audio_to_base64(audio_path: str) -> str:
"""Encode a local audio file to a base64 data URL."""
if not os.path.exists(audio_path):
raise FileNotFoundError(f"Audio file not found: {audio_path}")
ext = os.path.splitext(audio_path)[1].lower()
mime_map = {".wav": "audio/wav", ".mp3": "audio/mpeg", ".flac": "audio/flac", ".ogg": "audio/ogg"}
mime_type = mime_map.get(ext, "audio/wav")
with open(audio_path, "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode("utf-8")
return f"data:{mime_type};base64,{audio_b64}"
def run_batch(args) -> None:
"""Send a batch TTS request and save each result to a file."""
items = [{"input": text} for text in args.texts]
payload: dict = {
"items": items,
"response_format": args.response_format,
}
if args.voice:
payload["voice"] = args.voice
if args.language:
payload["language"] = args.language
if args.task_type:
payload["task_type"] = args.task_type
if args.instructions:
payload["instructions"] = args.instructions
if args.max_new_tokens:
payload["max_new_tokens"] = args.max_new_tokens
# Voice cloning parameters (shared across all items)
if args.ref_audio:
if args.ref_audio.startswith(("http://", "https://")):
payload["ref_audio"] = args.ref_audio
else:
payload["ref_audio"] = encode_audio_to_base64(args.ref_audio)
if args.ref_text:
payload["ref_text"] = args.ref_text
print(f"Sending batch of {len(items)} item(s) to {args.api_base}")
if args.ref_audio:
print("Voice cloning mode — ref_audio applied to all items")
url = f"{args.api_base}/v1/audio/speech/batch"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {args.api_key}",
}
with httpx.Client(timeout=300.0) as client:
response = client.post(url, json=payload, headers=headers)
if response.status_code != 200:
print(f"Error {response.status_code}: {response.text}")
return
data = response.json()
print(f"Total: {data['total']} Succeeded: {data['succeeded']} Failed: {data['failed']}")
os.makedirs(args.output_dir, exist_ok=True)
for result in data["results"]:
idx = result["index"]
if result["status"] == "success":
audio_bytes = base64.b64decode(result["audio_data"])
out_path = os.path.join(args.output_dir, f"batch_{idx}.{args.response_format}")
with open(out_path, "wb") as f:
f.write(audio_bytes)
print(f" [{idx}] saved {len(audio_bytes)} bytes -> {out_path}")
else:
print(f" [{idx}] FAILED: {result['error']}")
def parse_args():
parser = argparse.ArgumentParser(
description="Batch speech client for /v1/audio/speech/batch",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=__doc__,
)
parser.add_argument("--api-base", default=DEFAULT_API_BASE, help="API base URL")
parser.add_argument("--api-key", default=DEFAULT_API_KEY, help="API key")
# Texts to synthesize
parser.add_argument(
"--texts",
nargs="+",
required=True,
help="One or more texts to synthesize",
)
# Shared voice settings
parser.add_argument("--voice", default="vivian", help="Speaker name (default: vivian)")
parser.add_argument("--language", default=None, help="Language: Auto, Chinese, English, etc.")
parser.add_argument("--instructions", default=None, help="Voice style/emotion instructions")
parser.add_argument(
"--task-type",
default=None,
choices=["CustomVoice", "VoiceDesign", "Base"],
help="TTS task type (default: CustomVoice)",
)
# Voice cloning (Base task)
parser.add_argument("--ref-audio", default=None, help="Reference audio path or URL for voice cloning")
parser.add_argument("--ref-text", default=None, help="Reference audio transcript for voice cloning")
# Generation
parser.add_argument("--max-new-tokens", type=int, default=None, help="Max new tokens per item")
parser.add_argument(
"--response-format",
default="wav",
choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
help="Audio format (default: wav)",
)
parser.add_argument("--output-dir", "-o", default="batch_output", help="Output directory (default: batch_output)")
return parser.parse_args()
if __name__ == "__main__":
args = parse_args()
run_batch(args)
qwen3_tts/gradio_demo.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/qwen3_tts/gradio_demo.py.
qwen3_tts/openai_speech_client.py
"""OpenAI-compatible client for Qwen3-TTS via /v1/audio/speech endpoint.
This script demonstrates how to use the OpenAI-compatible speech API
to generate audio from text using Qwen3-TTS models.
Examples:
# CustomVoice task (predefined speaker)
python openai_speech_client.py --text "Hello, how are you?" --voice vivian
# CustomVoice with emotion instruction
python openai_speech_client.py --text "I'm so happy!" --voice vivian \
--instructions "Speak with excitement"
# VoiceDesign task (voice from description)
python openai_speech_client.py --text "Hello world" \
--task-type VoiceDesign \
--instructions "A warm, friendly female voice"
# Base task (voice cloning)
python openai_speech_client.py --text "Hello world" \
--task-type Base \
--ref-audio "https://example.com/reference.wav" \
--ref-text "This is the reference transcript"
# Base task with pre-computed speaker embedding
python openai_speech_client.py --text "Hello world" \
--task-type Base \
--speaker-embedding embedding.json
"""
import argparse
import base64
import json
import os
import httpx
# Default server configuration
DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"
def encode_audio_to_base64(audio_path: str) -> str:
"""Encode a local audio file to base64 data URL."""
if not os.path.exists(audio_path):
raise FileNotFoundError(f"Audio file not found: {audio_path}")
# Detect MIME type from extension
audio_path_lower = audio_path.lower()
if audio_path_lower.endswith(".wav"):
mime_type = "audio/wav"
elif audio_path_lower.endswith((".mp3", ".mpeg")):
mime_type = "audio/mpeg"
elif audio_path_lower.endswith(".flac"):
mime_type = "audio/flac"
elif audio_path_lower.endswith(".ogg"):
mime_type = "audio/ogg"
else:
mime_type = "audio/wav" # Default
with open(audio_path, "rb") as f:
audio_bytes = f.read()
audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")
return f"data:{mime_type};base64,{audio_b64}"
def run_tts_generation(args) -> None:
"""Run TTS generation via OpenAI-compatible /v1/audio/speech API."""
# Build request payload
payload = {
"model": args.model,
"input": args.text,
"voice": args.speaker,
"response_format": args.response_format,
}
# Add optional parameters
if args.instructions:
payload["instructions"] = args.instructions
if args.task_type:
payload["task_type"] = args.task_type
if args.language:
payload["language"] = args.language
if args.max_new_tokens:
payload["max_new_tokens"] = args.max_new_tokens
# Voice clone parameters (Base task)
if args.ref_audio:
if args.ref_audio.startswith(("http://", "https://")):
payload["ref_audio"] = args.ref_audio
elif args.ref_audio.startswith("data:"):
payload["ref_audio"] = args.ref_audio
else:
payload["ref_audio"] = encode_audio_to_base64(args.ref_audio)
if args.ref_text:
payload["ref_text"] = args.ref_text
if args.x_vector_only:
payload["x_vector_only_mode"] = True
if args.speaker_embedding:
with open(args.speaker_embedding) as f:
payload["speaker_embedding"] = json.load(f)
print(f"Model: {args.model}")
print(f"Task type: {args.task_type or 'CustomVoice'}")
print(f"Text: {args.text}")
print(f"Speaker: {args.speaker}")
print("Generating audio...")
# Make the API call
api_url = f"{args.api_base}/v1/audio/speech"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {args.api_key}",
}
with httpx.Client(timeout=300.0) as client:
response = client.post(api_url, json=payload, headers=headers)
if response.status_code != 200:
print(f"Error: {response.status_code}")
print(response.text)
return
# Check for JSON error response (only if content is valid UTF-8 text)
try:
text = response.content.decode("utf-8")
if text.startswith('{"error"'):
print(f"Error: {text}")
return
except UnicodeDecodeError:
pass # Binary audio data, not an error
# Save audio response
output_path = args.output or "tts_output.wav"
with open(output_path, "wb") as f:
f.write(response.content)
print(f"Audio saved to: {output_path}")
def parse_args():
"""Parse command line arguments."""
parser = argparse.ArgumentParser(
description="OpenAI-compatible client for Qwen3-TTS via /v1/audio/speech",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=__doc__,
)
# Server configuration
parser.add_argument(
"--api-base",
type=str,
default=DEFAULT_API_BASE,
help=f"API base URL (default: {DEFAULT_API_BASE})",
)
parser.add_argument(
"--api-key",
type=str,
default=DEFAULT_API_KEY,
help="API key (default: EMPTY)",
)
parser.add_argument(
"--model",
"-m",
type=str,
default="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
help="Model name/path",
)
# Task configuration
parser.add_argument(
"--task-type",
"-t",
type=str,
default=None,
choices=["CustomVoice", "VoiceDesign", "Base"],
help="TTS task type (default: CustomVoice)",
)
# Input text
parser.add_argument(
"--text",
type=str,
required=True,
help="Text to synthesize",
)
# Voice/speaker
parser.add_argument(
"--speaker",
type=str,
default="vivian",
help="Speaker name (default: vivian). Options: vivian, ryan, aiden, etc.",
)
parser.add_argument(
"--language",
type=str,
default=None,
help="Language: Auto, Chinese, English, etc.",
)
parser.add_argument(
"--instructions",
type=str,
default=None,
help="Voice style/emotion instructions",
)
# Base (voice clone) parameters
parser.add_argument(
"--ref-audio",
type=str,
default=None,
help="Reference audio file path, URL, or base64 for voice cloning (Base task)",
)
parser.add_argument(
"--ref-text",
type=str,
default=None,
help="Reference audio transcript for voice cloning (Base task)",
)
parser.add_argument(
"--x-vector-only",
action="store_true",
help="Use x-vector only mode for voice cloning (no ICL)",
)
parser.add_argument(
"--speaker-embedding",
type=str,
default=None,
help="Path to JSON file containing a pre-computed speaker embedding vector (1024-dim for 0.6B, 2048-dim for 1.7B)",
)
# Generation parameters
parser.add_argument(
"--max-new-tokens",
type=int,
default=None,
help="Maximum new tokens to generate",
)
# Output
parser.add_argument(
"--response-format",
type=str,
default="wav",
choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
help="Audio output format (default: wav)",
)
parser.add_argument(
"--output",
"-o",
type=str,
default=None,
help="Output audio file path (default: tts_output.wav)",
)
return parser.parse_args()
if __name__ == "__main__":
args = parse_args()
run_tts_generation(args)
qwen3_tts/precompute_custom_voice.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/qwen3_tts/precompute_custom_voice.py.
qwen3_tts/run_gradio_demo.sh
#!/bin/bash
# Launch both vLLM server and Gradio demo for Qwen3-TTS
#
# Usage:
# ./run_gradio_demo.sh # Default: CustomVoice
# ./run_gradio_demo.sh --task-type VoiceDesign # VoiceDesign model
# ./run_gradio_demo.sh --task-type Base --gradio-port 7861
#
# Options:
# --task-type TYPE Task type: CustomVoice, VoiceDesign, Base (default: CustomVoice)
# --server-port PORT Port for vLLM server (default: 8000)
# --gradio-port PORT Port for Gradio demo (default: 7860)
# --server-host HOST Host for vLLM server (default: 0.0.0.0)
# --gradio-ip IP IP for Gradio demo (default: 127.0.0.1)
# --share Share Gradio demo publicly
set -e
# Default values
TASK_TYPE="CustomVoice"
SERVER_PORT=8000
GRADIO_PORT=7860
SERVER_HOST="0.0.0.0"
GRADIO_IP="127.0.0.1"
GRADIO_SHARE=false
# Parse command line arguments
while [[ $# -gt 0 ]]; do
case $1 in
--task-type)
TASK_TYPE="$2"
shift 2
;;
--server-port)
SERVER_PORT="$2"
shift 2
;;
--gradio-port)
GRADIO_PORT="$2"
shift 2
;;
--server-host)
SERVER_HOST="$2"
shift 2
;;
--gradio-ip)
GRADIO_IP="$2"
shift 2
;;
--share)
GRADIO_SHARE=true
shift
;;
--help)
echo "Usage: $0 [OPTIONS]"
echo ""
echo "Options:"
echo " --task-type TYPE Task type: CustomVoice, VoiceDesign, Base (default: CustomVoice)"
echo " --server-port PORT Port for vLLM server (default: 8000)"
echo " --gradio-port PORT Port for Gradio demo (default: 7860)"
echo " --server-host HOST Host for vLLM server (default: 0.0.0.0)"
echo " --gradio-ip IP IP for Gradio demo (default: 127.0.0.1)"
echo " --share Share Gradio demo publicly"
echo ""
exit 0
;;
*)
echo "Unknown option: $1"
echo "Use --help for usage information"
exit 1
;;
esac
done
# Map task type to model
case "$TASK_TYPE" in
CustomVoice)
MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"
;;
VoiceDesign)
MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign"
;;
Base)
MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-Base"
;;
*)
echo "Unknown task type: $TASK_TYPE"
echo "Supported: CustomVoice, VoiceDesign, Base"
exit 1
;;
esac
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
API_BASE="http://localhost:${SERVER_PORT}"
echo "=========================================="
echo "Qwen3-TTS Gradio Demo"
echo "=========================================="
echo "Task Type : $TASK_TYPE"
echo "Model : $MODEL"
echo "Server : http://${SERVER_HOST}:${SERVER_PORT}"
echo "Gradio : http://${GRADIO_IP}:${GRADIO_PORT}"
echo "=========================================="
# Cleanup on exit
cleanup() {
echo ""
echo "Shutting down..."
if [ -n "$SERVER_PID" ]; then
echo "Stopping vLLM server (PID: $SERVER_PID)..."
kill "$SERVER_PID" 2>/dev/null || true
wait "$SERVER_PID" 2>/dev/null || true
fi
if [ -n "$GRADIO_PID" ]; then
echo "Stopping Gradio demo (PID: $GRADIO_PID)..."
kill "$GRADIO_PID" 2>/dev/null || true
wait "$GRADIO_PID" 2>/dev/null || true
fi
echo "Cleanup complete"
exit 0
}
trap cleanup SIGINT SIGTERM
# Start vLLM server
echo ""
echo "Starting vLLM server..."
LOG_FILE="/tmp/vllm_tts_server_${SERVER_PORT}.log"
vllm-omni serve "$MODEL" \
--deploy-config vllm_omni/deploy/qwen3_tts.yaml \
--host "$SERVER_HOST" \
--port "$SERVER_PORT" \
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--omni 2>&1 | tee "$LOG_FILE" &
SERVER_PID=$!
# Wait for server startup
echo ""
echo "Waiting for vLLM server to be ready..."
STARTUP_FLAG="/tmp/vllm_tts_startup_flag_${SERVER_PORT}.tmp"
rm -f "$STARTUP_FLAG"
(
tail -f "$LOG_FILE" 2>/dev/null | grep -m 1 "Application startup complete" > /dev/null && touch "$STARTUP_FLAG"
) &
TAIL_PID=$!
MAX_WAIT=300
ELAPSED=0
while [ $ELAPSED -lt $MAX_WAIT ]; do
if [ -f "$STARTUP_FLAG" ]; then
kill "$TAIL_PID" 2>/dev/null || true
wait "$TAIL_PID" 2>/dev/null || true
echo ""
echo "vLLM server is ready!"
break
fi
if ! kill -0 "$SERVER_PID" 2>/dev/null; then
kill "$TAIL_PID" 2>/dev/null || true
echo ""
echo "Error: vLLM server failed to start"
exit 1
fi
sleep 1
ELAPSED=$((ELAPSED + 1))
done
rm -f "$STARTUP_FLAG"
if [ $ELAPSED -ge $MAX_WAIT ]; then
kill "$TAIL_PID" 2>/dev/null || true
echo "Error: Server startup timed out after ${MAX_WAIT}s"
kill "$SERVER_PID" 2>/dev/null || true
exit 1
fi
# Start Gradio demo
echo ""
echo "Starting Gradio demo..."
cd "$SCRIPT_DIR"
GRADIO_CMD=("python" "gradio_demo.py" "--api-base" "$API_BASE" "--host" "$GRADIO_IP" "--port" "$GRADIO_PORT")
if [ "$GRADIO_SHARE" = true ]; then
GRADIO_CMD+=("--share")
fi
"${GRADIO_CMD[@]}" &
GRADIO_PID=$!
echo ""
echo "=========================================="
echo "Both services are running!"
echo "=========================================="
echo "vLLM Server : http://${SERVER_HOST}:${SERVER_PORT}"
echo "Gradio Demo : http://${GRADIO_IP}:${GRADIO_PORT}"
echo ""
echo "Press Ctrl+C to stop both services"
echo "=========================================="
echo ""
wait $SERVER_PID $GRADIO_PID || true
cleanup
qwen3_tts/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for Qwen3-TTS models
#
# Usage:
# ./run_server.sh # Default: CustomVoice model
# ./run_server.sh CustomVoice # CustomVoice model
# ./run_server.sh VoiceDesign # VoiceDesign model
# ./run_server.sh Base # Base (voice clone) model
set -e
TASK_TYPE="${1:-CustomVoice}"
case "$TASK_TYPE" in
CustomVoice)
MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"
;;
VoiceDesign)
MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign"
;;
Base)
MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-Base"
;;
*)
echo "Unknown task type: $TASK_TYPE"
echo "Supported: CustomVoice, VoiceDesign, Base"
exit 1
;;
esac
echo "Starting Qwen3-TTS server with model: $MODEL"
vllm-omni serve "$MODEL" \
--deploy-config vllm_omni/deploy/qwen3_tts.yaml \
--host 0.0.0.0 \
--port 8091 \
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--omni
qwen3_tts/speaker_embedding_interpolation.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/qwen3_tts/speaker_embedding_interpolation.py.
qwen3_tts/streaming_speech_client.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/qwen3_tts/streaming_speech_client.py.
qwen3_tts/tts_common.py
"""Shared constants, helpers, and payload building for Qwen3-TTS Gradio demos."""
import base64
import io
try:
import gradio as gr
except ImportError:
raise ImportError("gradio is required to run this demo. Install it with: pip install 'vllm-omni[demo]'") from None
import httpx
import numpy as np
import soundfile as sf
SUPPORTED_LANGUAGES = [
"Auto",
"Chinese",
"English",
"Japanese",
"Korean",
"German",
"French",
"Russian",
"Portuguese",
"Spanish",
"Italian",
]
TASK_TYPES = ["CustomVoice", "VoiceDesign", "Base"]
PCM_SAMPLE_RATE = 24000
DEFAULT_API_BASE = "http://localhost:8000"
def fetch_voices(api_base: str) -> list[str]:
"""Fetch available voices from the server."""
try:
with httpx.Client(timeout=10.0) as client:
resp = client.get(
f"{api_base}/v1/audio/voices",
headers={"Authorization": "Bearer EMPTY"},
)
if resp.status_code == 200:
data = resp.json()
voices = data.get("voices") or []
if voices:
return voices
except Exception:
pass
return ["Vivian", "Ryan"]
def encode_audio_to_base64(audio_data: tuple) -> str:
"""Encode Gradio audio input (sample_rate, numpy_array) to base64 data URL."""
sample_rate, audio_np = audio_data
if audio_np.dtype != np.int16:
if audio_np.dtype in (np.float32, np.float64):
audio_np = np.clip(audio_np, -1.0, 1.0)
audio_np = (audio_np * 32767).astype(np.int16)
else:
audio_np = audio_np.astype(np.int16)
buf = io.BytesIO()
sf.write(buf, audio_np, sample_rate, format="WAV")
wav_b64 = base64.b64encode(buf.getvalue()).decode("utf-8")
return f"data:audio/wav;base64,{wav_b64}"
def build_payload(
text: str,
task_type: str,
voice: str,
language: str,
instructions: str,
ref_audio: tuple | None,
ref_audio_url: str,
ref_text: str,
x_vector_only: bool,
response_format: str = "pcm",
speed: float = 1.0,
stream: bool = True,
) -> dict:
"""Build the /v1/audio/speech request payload.
Raises gr.Error for invalid input so callers don't need to validate.
"""
if not text or not text.strip():
raise gr.Error("Please enter text to synthesize.")
payload: dict = {
"input": text.strip(),
"response_format": "pcm" if stream else response_format,
"stream": stream,
}
if not stream:
payload["speed"] = speed
if task_type:
payload["task_type"] = task_type
if language:
payload["language"] = language
if task_type == "CustomVoice":
if voice:
payload["voice"] = voice
if instructions and instructions.strip():
payload["instructions"] = instructions.strip()
elif task_type == "VoiceDesign":
if not instructions or not instructions.strip():
raise gr.Error("VoiceDesign task requires voice style instructions.")
payload["instructions"] = instructions.strip()
elif task_type == "Base":
ref_audio_url_stripped = ref_audio_url.strip() if ref_audio_url else ""
if ref_audio_url_stripped:
payload["ref_audio"] = ref_audio_url_stripped
elif ref_audio is not None:
payload["ref_audio"] = encode_audio_to_base64(ref_audio)
else:
raise gr.Error("Base (voice clone) task requires reference audio. Upload a file or provide a URL.")
if ref_text and ref_text.strip():
payload["ref_text"] = ref_text.strip()
if x_vector_only:
payload["x_vector_only_mode"] = True
return payload
def on_task_type_change(task_type: str):
"""Update UI visibility based on selected task type."""
if task_type == "CustomVoice":
return (
gr.update(visible=True), # voice dropdown
gr.update(visible=True, info="Optional style/emotion instructions"),
gr.update(visible=False), # ref_audio
gr.update(visible=False), # ref_audio_url
gr.update(visible=False), # ref_text
gr.update(visible=False), # x_vector_only
)
elif task_type == "VoiceDesign":
return (
gr.update(visible=False),
gr.update(visible=True, info="Required: describe the voice style"),
gr.update(visible=False),
gr.update(visible=False),
gr.update(visible=False),
gr.update(visible=False),
)
elif task_type == "Base":
return (
gr.update(visible=False),
gr.update(visible=False),
gr.update(visible=True),
gr.update(visible=True),
gr.update(visible=True),
gr.update(visible=True),
)
return (
gr.update(visible=True),
gr.update(visible=True),
gr.update(visible=False),
gr.update(visible=False),
gr.update(visible=False),
gr.update(visible=False),
)
def stream_pcm_chunks(api_base: str, payload: dict):
"""Stream raw PCM bytes from the server, yielding int16 numpy arrays.
Handles odd-byte boundaries between network chunks.
"""
leftover = b""
with httpx.Client(timeout=300.0) as client:
with client.stream(
"POST",
f"{api_base}/v1/audio/speech",
json=payload,
headers={
"Content-Type": "application/json",
"Authorization": "Bearer EMPTY",
},
) as resp:
if resp.status_code != 200:
resp.read()
raise gr.Error(f"Server error ({resp.status_code}): {resp.text}")
for chunk in resp.iter_bytes():
if not chunk:
continue
raw = leftover + chunk
usable = len(raw) - (len(raw) % 2)
leftover = raw[usable:]
if usable == 0:
continue
yield np.frombuffer(raw[:usable], dtype=np.int16).copy()
def add_common_args(parser):
"""Add CLI arguments shared by both demos."""
parser.add_argument(
"--api-base",
default=DEFAULT_API_BASE,
help=f"Base URL for the vLLM API server (default: {DEFAULT_API_BASE}).",
)
parser.add_argument(
"--host",
default="0.0.0.0",
help="Host/IP for Gradio server (default: 0.0.0.0).",
)
parser.add_argument(
"--port",
type=int,
default=7860,
help="Port for Gradio server (default: 7860).",
)
parser.add_argument(
"--share",
action="store_true",
help="Share the Gradio demo publicly.",
)
return parser
voxcpm2/gradio_demo.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/voxcpm2/gradio_demo.py.
voxcpm2/openai_speech_client.py
"""OpenAI-compatible client for VoxCPM2 TTS via /v1/audio/speech endpoint.
Examples:
# Zero-shot synthesis
python openai_speech_client.py --text "Hello, this is VoxCPM2."
# Voice cloning with a local reference audio file
python openai_speech_client.py --text "Hello world" \
--ref-audio /path/to/reference.wav
# Voice cloning with a URL
python openai_speech_client.py --text "Hello world" \
--ref-audio "https://example.com/reference.wav"
Server setup:
vllm serve openbmb/VoxCPM2 --omni --host 0.0.0.0 --port 8000
"""
from __future__ import annotations
import argparse
import base64
import os
import httpx
DEFAULT_API_BASE = "http://localhost:8000"
DEFAULT_API_KEY = "sk-empty"
def encode_audio_to_base64(audio_path: str) -> str:
"""Encode a local audio file to a base64 data URL."""
if not os.path.exists(audio_path):
raise FileNotFoundError(f"Audio file not found: {audio_path}")
ext = audio_path.lower().rsplit(".", 1)[-1]
mime = {
"wav": "audio/wav",
"mp3": "audio/mpeg",
"flac": "audio/flac",
"ogg": "audio/ogg",
}.get(ext, "audio/wav")
with open(audio_path, "rb") as f:
b64 = base64.b64encode(f.read()).decode("utf-8")
return f"data:{mime};base64,{b64}"
def main() -> None:
parser = argparse.ArgumentParser(description="VoxCPM2 OpenAI speech client")
parser.add_argument("--text", type=str, required=True, help="Text to synthesize")
parser.add_argument(
"--ref-audio",
type=str,
default=None,
help="Reference audio for voice cloning (local path, URL, or data: URI)",
)
parser.add_argument("--model", type=str, default="voxcpm2")
parser.add_argument("--output", type=str, default="output.wav")
parser.add_argument("--api-base", type=str, default=DEFAULT_API_BASE)
parser.add_argument("--api-key", type=str, default=DEFAULT_API_KEY)
parser.add_argument("--response-format", type=str, default="wav")
args = parser.parse_args()
# VoxCPM2 has no predefined voices. The "voice" field is required by
# the OpenAI API schema but ignored by VoxCPM2 — use any placeholder.
# For voice cloning, pass --ref-audio instead.
payload: dict = {
"model": args.model,
"input": args.text,
"voice": "default",
"response_format": args.response_format,
}
if args.ref_audio:
ref = args.ref_audio
if ref.startswith(("http://", "https://", "data:")):
payload["ref_audio"] = ref
else:
payload["ref_audio"] = encode_audio_to_base64(ref)
url = f"{args.api_base}/v1/audio/speech"
print(f"POST {url}")
print(f" text: {args.text}")
if args.ref_audio:
print(f" ref_audio: {args.ref_audio[:80]}...")
with httpx.Client(timeout=300) as client:
resp = client.post(
url,
json=payload,
headers={"Authorization": f"Bearer {args.api_key}"},
)
if resp.status_code != 200:
print(f"Error {resp.status_code}: {resp.text[:500]}")
return
with open(args.output, "wb") as f:
f.write(resp.content)
print(f"Saved: {args.output} ({len(resp.content):,} bytes)")
if __name__ == "__main__":
main()
voxcpm2/precompute_custom_voice.py
"""Pre-compute VoxCPM2 custom voice profiles.
The generated directory can be passed to the server via
``custom_voice_dir`` in ``vllm_omni/deploy/voxcpm2.yaml``. Requests can then
use ``/v1/audio/speech`` with ``voice="<name>"`` and no per-request ref_audio.
"""
from __future__ import annotations
import argparse
import json
import sys
from pathlib import Path
from typing import Any
import torch
from safetensors.torch import save_file
REPO_ROOT = Path(__file__).resolve().parents[4]
if str(REPO_ROOT) not in sys.path:
sys.path.insert(0, str(REPO_ROOT))
from vllm_omni.utils.custom_voice_io import safe_voice_stem # noqa: E402
MANIFEST_NAME = "custom_voice_manifest.json"
def _load_tts(model: str, device: torch.device):
from vllm_omni.model_executor.models.voxcpm2.voxcpm2_import_utils import import_voxcpm2_core
VoxCPM = import_voxcpm2_core()
native = VoxCPM.from_pretrained(model, load_denoiser=False, optimize=False)
return native.tts_model.to(device).eval()
def _load_manifest(output_dir: Path, model: str) -> dict[str, Any]:
path = output_dir / MANIFEST_NAME
if path.exists():
return json.loads(path.read_text(encoding="utf-8"))
return {
"schema_version": 1,
"model_type": "voxcpm2",
"model": model,
"voices": {},
}
def _write_voice(
*,
model: str,
output_dir: Path,
voice_name: str,
ref_audio: str,
prompt_text: str | None,
mode: str,
speaker_description: str | None,
device: torch.device,
) -> None:
if mode in ("continuation", "ref_continuation") and not prompt_text:
raise ValueError("--prompt-text is required for continuation/ref_continuation modes")
tts = _load_tts(model, device)
tensors: dict[str, torch.Tensor] = {}
with torch.inference_mode():
if mode in ("reference", "ref_continuation"):
tensors["ref_audio_feat"] = tts._encode_wav(ref_audio, padding_mode="right").float().cpu().contiguous()
if mode in ("continuation", "ref_continuation"):
tensors["audio_feat"] = tts._encode_wav(ref_audio, padding_mode="left").float().cpu().contiguous()
output_dir.mkdir(parents=True, exist_ok=True)
filename = f"{safe_voice_stem(voice_name)}.safetensors"
save_file(tensors, str(output_dir / filename))
manifest = _load_manifest(output_dir, model)
entry: dict[str, Any] = {
"name": voice_name,
"file": filename,
"mode": mode,
}
if "ref_audio_feat" in tensors:
entry["ref_audio_feat_len"] = int(tensors["ref_audio_feat"].shape[0])
if "audio_feat" in tensors:
entry["audio_feat_len"] = int(tensors["audio_feat"].shape[0])
if prompt_text:
entry["prompt_text"] = prompt_text
if speaker_description:
entry["speaker_description"] = speaker_description
manifest.setdefault("voices", {})[voice_name] = entry
(output_dir / MANIFEST_NAME).write_text(json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8")
print(f"Wrote {output_dir / filename}")
print(f"Updated {output_dir / MANIFEST_NAME}")
def main() -> None:
parser = argparse.ArgumentParser(description="Pre-compute VoxCPM2 custom voice profile")
parser.add_argument("--model", default="openbmb/VoxCPM2", help="VoxCPM2 model path or Hugging Face ID")
parser.add_argument("--voice-name", required=True)
parser.add_argument("--ref-audio", required=True)
parser.add_argument(
"--prompt-text",
default=None,
help="Transcript of ref audio for continuation/ref_continuation modes",
)
parser.add_argument(
"--mode",
choices=["reference", "continuation", "ref_continuation"],
default="reference",
)
parser.add_argument("--speaker-description", default=None)
parser.add_argument("--output-dir", required=True)
parser.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu")
args = parser.parse_args()
_write_voice(
model=args.model,
output_dir=Path(args.output_dir),
voice_name=args.voice_name,
ref_audio=args.ref_audio,
prompt_text=args.prompt_text,
mode=args.mode,
speaker_description=args.speaker_description,
device=torch.device(args.device),
)
if __name__ == "__main__":
main()
voxtral_tts/gradio_demo.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/voxtral_tts/gradio_demo.py.
voxtral_tts/text_preprocess.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/voxtral_tts/text_preprocess.py.