Text-To-Speech (Online Serving)¶

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/text_to_speech.

vLLM-Omni exposes TTS models through the OpenAI-compatible POST /v1/audio/speech endpoint, launched with vllm serve <model> --omni. Each TTS model has its own subdirectory containing client snippets, gradio demos, and helper scripts; this README is the single doc entry point for all of them.

For offline inference, see examples/offline_inference/text_to_speech. For the full list of supported architectures across all modalities, see Supported Models.

Supported Models¶

Model	HuggingFace repo	Voice cloning	Streaming	Voice presets / upload	Gradio demo
Fish Speech S2 Pro	`fishaudio/s2-pro`	✓ (`ref_audio`+`ref_text`)	✓ (PCM stream)	—	✓
GLM-TTS	`zai-org/GLM-TTS`	✓ (`ref_audio`+`ref_text`, required)	✓ (PCM stream)	—	✓
IndexTTS-2	`IndexTeam/IndexTTS-2`	✓ (`ref_audio` or uploaded `voice`)	compat only, non-chunk	uploaded audio voice only; no presets	—
Ming-omni-tts	`inclusionAI/Ming-omni-tts-0.5B`	✓ (`ref_audio` / `speaker_embedding`)	✓ (PCM stream)	IP labels + structured `instructions`	—
Ming-flash-omni-TTS	`Jonathan1909/Ming-flash-omni-2.0`	— (caption-controlled)	—	caption fields (`instructions`)	—
MOSS-TTS-Nano	`OpenMOSS-Team/MOSS-TTS-Nano`	✓ (`ref_audio` required)	✓ (PCM stream)	—	✓
OmniVoice	`k2-fsa/OmniVoice`	✓	—	—	—
Qwen3-TTS	`Qwen/Qwen3-TTS-12Hz-1.7B-{CustomVoice,VoiceDesign,Base}`	✓ (Base)	✓ (PCM + WebSocket)	✓ (presets + `/v1/audio/voices` upload)	✓ (standard + FastRTC)
VoxCPM2	`openbmb/VoxCPM2`	✓	✓ (AudioWorklet via gradio)	—	✓
Voxtral TTS	`mistralai/Voxtral-4B-TTS-2603`	✓ (gated upstream)	✓	✓ (presets)	✓
SoulX-Singer	`Soul-AILab/SoulX-Singer`	✓ (prompt audio)	— (batch only)	— (prompt + target audio)	— (chat client)

CosyVoice3 is intentionally absent: no online example exists for it yet. See its offline section instead.

Common Quick Start¶

Launch the server (defaults shown — adjust --port, --gpu-memory-utilization, etc. as needed):

vllm serve <hf-repo-or-local-path> --omni --port 8091

Send a TTS request via curl. These generic snippets assume a model with a preset/default voice; voice-cloning-only models such as IndexTTS-2 require ref_audio or an uploaded audio voice (see model-specific sections below).

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Hello, how are you?",
        "voice": "default",
        "response_format": "wav"
    }' --output output.wav

Or via Python httpx:

import httpx

response = httpx.post(
    "http://localhost:8091/v1/audio/speech",
    json={
        "input": "Hello, how are you?",
        "voice": "default",
        "response_format": "wav",
    },
    timeout=300.0,
)
open("output.wav", "wb").write(response.content)

Or via the OpenAI SDK:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8091/v1", api_key="none")
response = client.audio.speech.create(
    model="<hf-repo>",
    voice="default",
    input="Hello, how are you?",
)
response.stream_to_file("output.wav")

Streaming PCM output (where supported) — set stream=true, stream_format="audio", and response_format="pcm":

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Hello, how are you?",
        "voice": "default",
        "stream": true,
        "stream_format": "audio",
        "response_format": "pcm"
    }' --no-buffer | play -t raw -r 24000 -e signed -b 16 -c 1 -

Adjust the player's sample rate to match the model (44.1 kHz for Fish Speech, 48 kHz for VoxCPM2, 22.05 kHz for IndexTTS-2, and 24 kHz for many others).

For full request-shape documentation (all parameters, response formats, error codes), see the Speech API reference.

GLM-TTS¶

2-stage TTS (AR + DiT flow-matching) at 24 kHz. Every request requires ref_audio + ref_text.

Launch¶

vllm serve zai-org/GLM-TTS --omni --trust-remote-code --port 8091
# or:
bash examples/online_serving/text_to_speech/glm_tts/run_server.sh /path/to/GLM-TTS

Sending requests¶

# Voice cloning (required)
python examples/online_serving/text_to_speech/glm_tts/openai_speech_client.py \
    --text "你好，这是语音克隆测试。" \
    --ref-audio file:///path/to/ref.wav \
    --ref-text "这是参考音频的文本内容。"

# Custom format
python examples/online_serving/text_to_speech/glm_tts/openai_speech_client.py \
    --text "Hello, this is a voice cloning test." \
    --ref-audio file:///path/to/ref.wav \
    --ref-text "Transcript of the reference audio." \
    --response-format mp3 -o output.mp3

Gradio demo¶

bash examples/online_serving/text_to_speech/glm_tts/run_gradio_demo.sh

Notes¶

Output: 24 kHz mono WAV via HiFT vocoder.
ref_audio + ref_text are required together on every request. Reference audio should be 3-10 seconds.
Voice cloning feature extraction (WhisperVQ, CampPlus, mel) runs on the model side — no external dependency on the serving layer.

IndexTTS-2¶

2-stage TTS (GPT AR + S2Mel CFM DiT + BigVGAN) at 22.05 kHz. Requests use ref_audio for voice cloning, or an uploaded audio voice from /v1/audio/voices. Supports emotion conditioning via emo_audio, emo_text, or emo_vector passed in extra_params.

Launch¶

vllm serve IndexTeam/IndexTTS-2 --omni --trust-remote-code --port 8092
# or, to pass the bundled deploy config explicitly:
bash examples/online_serving/text_to_speech/indextts2/run_server.sh

Sending requests¶

# Voice cloning (ref_audio required)
python examples/online_serving/text_to_speech/indextts2/speech_client.py \
    --text "你好，世界！" \
    --ref-audio /path/to/reference.wav

# With emotion audio
python examples/online_serving/text_to_speech/indextts2/speech_client.py \
    --text "今天心情很好！" \
    --ref-audio /path/to/ref.wav \
    --emo-audio /path/to/happy.wav

Notes¶

Output: 22.05 kHz mono WAV.
Provide ref_audio on the documented raw request path, or pass voice only when it names an uploaded audio voice; IndexTTS-2 does not provide a built-in text-only preset voice.
Emotion params (emo_audio, emo_text, emo_vector, emo_alpha, use_emo_text, use_random) are passed via the extra_params field. Official precedence is use_emo_text > emo_vector > emo_audio > same emotion as the speaker reference.
stream=true is accepted as an OpenAI-compatible response path, but IndexTTS-2 is not async-chunk streaming; audio is produced after S2Mel receives the full mel-code sequence.
Deploy config: vllm_omni/deploy/indextts2.yaml (auto-loaded).

Fish Speech S2 Pro¶

4B dual-AR TTS at 44.1 kHz. Server uses the DAC codec.

Prerequisites¶

pip install fish-speech

Kvcache attention fast path¶

Fish Speech S2 Pro uses a Triton decode-only kvcache attention fast path by default on CUDA builds. Set VLLM_OMNI_FISH_KVCACHE_ATTN=0 to disable it, or VLLM_OMNI_FISH_KVCACHE_ATTN=required to fail fast if the fast path cannot be installed.

# Verify fast path availability.
python - <<'PY'
from vllm_omni.attention import fish_kvcache_attn

print(fish_kvcache_attn.is_available())
print(fish_kvcache_attn.load_error())
PY

# Optional: disable the runtime fast path.
export VLLM_OMNI_FISH_KVCACHE_ATTN=0

Launch¶

vllm serve fishaudio/s2-pro --omni --port 8091
# or:
./fish_speech/run_server.sh

The deploy config auto-loads from vllm_omni/deploy/fish_qwen3_omni.yaml (the HF model_type on the fishaudio checkpoint is fish_qwen3_omni).

Voice cloning¶

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Hello, this is a cloned voice.",
        "voice": "default",
        "ref_audio": "https://example.com/reference.wav",
        "ref_text": "Transcript of the reference audio."
    }' --output cloned.wav

CLI client¶

cd examples/online_serving/text_to_speech/fish_speech
python speech_client.py --text "Hello, how are you?"
python speech_client.py --text "Hello world" --stream --output output.pcm

Gradio demo¶

./fish_speech/run_gradio_demo.sh             # launches server + Gradio
python fish_speech/gradio_demo.py --api-base http://localhost:8091  # if server already running

Notes¶

Output: 44.1 kHz mono.
Streaming PCM player command must use -r 44100.

Ming-omni-tts¶

Dense 0.5B two-stage TTS served through /v1/audio/speech. Ming uses the standard speech endpoint plus structured controls in instructions, voice, language, ref_audio, ref_text, and speaker_embedding.

Launch¶

bash examples/online_serving/text_to_speech/ming_tts/run_server.sh

Equivalent manual command:

vllm-omni serve inclusionAI/Ming-omni-tts-0.5B \
    --deploy-config vllm_omni/deploy/ming_tts.yaml \
    --host 0.0.0.0 --port 8091 \
    --enforce-eager --omni

Sending requests¶

python examples/online_serving/text_to_speech/ming_tts/openai_speech_client.py \
    --text "你好，这是 Ming 在线语音合成测试。"

Structured dialect control:

python examples/online_serving/text_to_speech/ming_tts/openai_speech_client.py \
    --text "我觉得社会企业同个人都有责任" \
    --instruction-json '{"方言":"广粤话"}' \
    --ref-audio /path/to/yue_prompt.wav

Zero-shot cloning:

python examples/online_serving/text_to_speech/ming_tts/openai_speech_client.py \
    --text "我们的愿景是构建未来服务业的数字化基础设施，为世界带来更多微小而美好的改变。" \
    --ref-audio /path/to/10002287-00000094.wav \
    --ref-text "在此奉劝大家别乱打美白针。"

Notes¶

run_curl.sh keeps a small sanity subset; use the Ming README for the broader request cookbook.
Online serving is speech-shaped today; music-only bgm and text-to-audio tta remain offline examples.
Full request details live in ming_tts/README.md.

Ming-flash-omni-TTS¶

Standalone talker-only deployment of Ming-flash-omni-2.0. Voice is controlled through caption text passed via instructions.

Launch¶

# from repo root
bash examples/online_serving/text_to_speech/ming_flash_omni_tts/run_server.sh

Equivalent manual command:

vllm serve Jonathan1909/Ming-flash-omni-2.0 \
    --deploy-config vllm_omni/deploy/ming_flash_omni_tts.yaml \
    --host 0.0.0.0 --port 8091 \
    --trust-remote-code --omni

Sending requests¶

python examples/online_serving/text_to_speech/ming_flash_omni_tts/speech_client.py \
    --text "我们当迎着阳光辛勤耕作，去摘取，去制作，去品尝，去馈赠。" \
    --output ming_online.wav

ASMR-style caption via instructions:

python examples/online_serving/text_to_speech/ming_flash_omni_tts/speech_client.py \
    --text "我会一直在这里陪着你，直到你慢慢、慢慢地沉入那个最温柔的梦里……好吗？" \
    --instructions "这是一种ASMR耳语，属于一种旨在引发特殊感官体验的创意风格。这个女性使用轻柔的普通话进行耳语，声音气音成分重。" \
    --output ming_online_asmr.wav

Notes¶

Server uses use_zero_spk_emb=True and the cookbook decode defaults (max_decode_steps=200, cfg=2.0, sigma=0.25, temperature=0.0). For other caption fields (语速, 基频, IP, BGM, etc.) or overriding decode args, use the offline example where additional_information is set explicitly.
This is the online counterpart of examples/offline_inference/text_to_speech/ming_flash_omni_tts/.
For multimodal Ming-flash-omni online serving, see examples/online_serving/ming_flash_omni/.

MOSS-TTS-Nano¶

Single-stage 0.1B AR LM + MOSS-Audio-Tokenizer-Nano codec at 48 kHz mono. Every request must include ref_audio; there are no built-in speaker presets.

The OpenAI-schema voice and ref_text fields are accepted but ignored — voice_clone does not consume a transcript, and upstream's continuation mode (the only path that accepts prompt_text) emits near-silent output, so it is not exposed here. Sample reference clips ship in the upstream repo under assets/audio/.

Launch¶

vllm serve OpenMOSS-Team/MOSS-TTS-Nano --omni --port 8091
# or:
./moss_tts_nano/run_server.sh

The deploy config at vllm_omni/deploy/moss_tts_nano.yaml auto-loads; no --stage-configs-path, --trust-remote-code, or --enforce-eager flags are needed.

Sending requests¶

# One-off fetch of a sample reference clip; cache under XDG_CACHE_HOME.
REF_DIR="${XDG_CACHE_HOME:-$HOME/.cache}/moss-tts-nano"
mkdir -p "$REF_DIR"
REF_WAV="$REF_DIR/zh_1.wav"
[ -s "$REF_WAV" ] || curl -L -o "$REF_WAV" https://raw.githubusercontent.com/OpenMOSS/MOSS-TTS-Nano/main/assets/audio/zh_1.wav
REF_AUDIO=$(base64 -w 0 "$REF_WAV")

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d "{
        \"input\": \"你好，这是语音合成测试。\",
        \"ref_audio\": \"data:audio/wav;base64,${REF_AUDIO}\",
        \"response_format\": \"wav\"
    }" --output output.wav

Streaming PCM¶

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d "{
        \"input\": \"Hello, streaming output from MOSS-TTS-Nano.\",
        \"ref_audio\": \"data:audio/wav;base64,${REF_AUDIO}\",
        \"stream\": true,
        \"stream_format\": \"audio\",
        \"response_format\": \"pcm\"
    }" --no-buffer | play -t raw -r 48000 -e signed -b 16 -c 1 -

Gradio demo¶

# Option 1: launch server + Gradio together
./moss_tts_nano/run_gradio_demo.sh

# Option 2: server already running
python moss_tts_nano/gradio_demo.py --api-base http://localhost:8091

Then open http://localhost:7860 in your browser.

Notes¶

Output is 48 kHz mono PCM (the upstream tokenizer is internally stereo at 48 kHz; the wrapper averages to mono before reaching the engine).
Standard /v1/audio/speech request shape: input, ref_audio (base64 data URL), response_format, stream, max_new_tokens. The voice and ref_text fields from the OpenAI schema are accepted but ignored.

OmniVoice¶

Zero-shot multilingual TTS (600+ languages). Online serving currently exposes auto voice only; voice cloning and voice design are available offline.

Prerequisites¶

huggingface-cli download k2-fsa/OmniVoice

Voice cloning (offline) needs transformers>=5.3.0; auto voice works with transformers>=4.57.0.

Launch¶

vllm serve k2-fsa/OmniVoice --omni --port 8091 --trust-remote-code
# or:
./omnivoice/run_server.sh

CLI client¶

cd examples/online_serving/text_to_speech/omnivoice
# Text-only (auto voice)
python speech_client.py --text "Hello, how are you?"

# Language hint
python speech_client.py --text "Bonjour, comment allez-vous?" --language French
# Voice cloning (reference audio + optional ref_text)
python speech_client.py \
--text "Bonjour, comment allez-vous?" \
--ref-audio /path/to/ref_audio.wav \
--ref-text "Bonjour, comment allez-vous?"

# Style instruction (voice design-style control)
python speech_client.py \
--text "Bonjour, comment allez-vous?" \
--language French \
--instructions "loud voice"

# Deterministic output with seed parameter
python speech_client.py --text "Hello, how are you?" --seed 42

The client supports --api-base, --model, --text, --response-format, --language, --voice, --ref-audio, --ref-text, --instructions, --seed, and --output.

Qwen3-TTS¶

Three model variants exposed via separate checkpoints:

Variant	HF repo	Use
CustomVoice	`Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice`	Predefined speakers (`vivian`, `ryan`, …) with optional style instructions
VoiceDesign	`Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign`	Natural-language voice style description
Base	`Qwen/Qwen3-TTS-12Hz-1.7B-Base`	Voice cloning from a reference audio

Each variant ships smaller 0.6B companions where available.

Launch¶

vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --omni --port 8091
# or:
./qwen3_tts/run_server.sh                # default: CustomVoice
./qwen3_tts/run_server.sh VoiceDesign
./qwen3_tts/run_server.sh Base

Executor backend¶

Single-GPU serves now default to the uniproc executor (lower IPC overhead, the Base cloning use case from #2603 / #2604). vllm_omni/deploy/qwen3_tts.yaml is the only Qwen3-TTS deploy config; pass --deploy-config <path> to override.

To opt out of chunked streaming, pass --no-async-chunk — the pipeline auto-dispatches to the end-to-end codec processor.

Sending requests¶

# CustomVoice with a predefined speaker
python qwen3_tts/openai_speech_client.py \
    --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
    --text "今天天气真好" \
    --speaker ryan \
    --instructions "用开心的语气说"

# VoiceDesign with a style description
python qwen3_tts/openai_speech_client.py \
    --model Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign \
    --task-type VoiceDesign \
    --text "哥哥，你回来啦" \
    --instructions "体现撒娇稚嫩的萝莉女声，音调偏高"

# Base voice cloning
python qwen3_tts/openai_speech_client.py \
    --model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
    --task-type Base \
    --text "Hello, this is a cloned voice" \
    --ref-audio /path/to/reference.wav \
    --ref-text "Original transcript of the reference audio"

Voices endpoint¶

List available voices, or upload a custom one for Base cloning:

# List
curl http://localhost:8091/v1/audio/voices

# Upload
curl -X POST http://localhost:8091/v1/audio/voices \
    -F "audio_sample=@/path/to/voice_sample.wav" \
    -F "consent=user_consent_id" \
    -F "name=custom_voice_1" \
    -F "ref_text=The exact transcript of the audio sample." \
    -F "speaker_description=warm narrator"

Uploaded voices are then usable as voice="custom_voice_1" on subsequent requests.

Precomputed custom voices¶

For reused Base voice-cloning speakers, precompute the reference artifacts once and load them at server startup:

python qwen3_tts/precompute_custom_voice.py \
    --model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
    --voice-name alice \
    --ref-audio /path/to/reference.wav \
    --ref-text "Original transcript of the reference audio" \
    --mode icl \
    --output-dir /path/to/custom_voices

--mode icl stores both speaker_embedding and ref_code; --mode xvec stores only the speaker embedding. Add the output directory to a deploy config:

custom_voice_dir: /path/to/custom_voices

Then start the server with that config and call the Speech API with only the voice name:

vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-Base --omni --deploy-config /path/to/qwen3_tts_custom_voice.yaml

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{"input":"Hello from a precomputed voice.","voice":"alice","task_type":"Base"}' \
    --output alice.wav

Streaming PCM¶

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Hello, how are you?",
        "voice": "vivian",
        "language": "English",
        "stream": true,
        "stream_format": "audio",
        "response_format": "pcm"
    }' --no-buffer | play -t raw -r 24000 -e signed -b 16 -c 1 -

Raw PCM streaming requires stream_format="audio", response_format="pcm", and async_chunk: true on the stage config (default in qwen3_tts.yaml). speed is not supported when streaming.

Streaming WebSocket¶

The /v1/audio/speech/stream endpoint accepts text incrementally, splits it at sentence boundaries, and emits one PCM stream per sentence:

python qwen3_tts/streaming_speech_client.py --text "Hello world. How are you? I am fine."
python qwen3_tts/streaming_speech_client.py --text "..." --simulate-stt --stt-delay 0.1

To receive word-level timestamps, launch the server with a forced aligner:

vllm-omni serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
    --omni \
    --deploy-config vllm_omni/deploy/qwen3_tts.yaml \
    --trust-remote-code \
    --forced-aligner Qwen/Qwen3-ForcedAligner-0.6B

Then request PCM JSON sidecar chunks:

python qwen3_tts/streaming_speech_client.py \
    --text "Hello world. How are you?" \
    --stream-audio \
    --response-format pcm \
    --word-timestamps

The client writes one PCM file per sentence and a matching sentence_XXX_timestamps.json sidecar.

To see the alignment instead of reading a JSON sidecar, run the word-timestamp Gradio demo (server must be launched with --forced-aligner):

python qwen3_tts/word_timestamps_demo.py --api-base http://localhost:8091

Each sentence's audio plays in an <audio> element while its text is rendered as inline word spans; the current word highlights as audio.currentTime crosses each start_ms. The Stop (barge-in) button cuts playback and reports the last-spoken word, useful for the voice-agent barge-in case.

Gradio demos¶

./qwen3_tts/run_gradio_demo.sh                              # CustomVoice (default)
./qwen3_tts/run_gradio_demo.sh --task-type VoiceDesign
./qwen3_tts/run_gradio_demo.sh --task-type Base

Speaker embedding interpolation¶

qwen3_tts/speaker_embedding_interpolation.py blends two predefined speakers' embeddings to produce intermediate voices. See the script for usage.

Batch client¶

qwen3_tts/batch_speech_client.py issues many concurrent requests for throughput measurement.

Notes¶

Base voice cloning has uniproc-vs-mp tradeoffs depending on per-request reference audio cost; see the executor-backend section above.
With async chunking, Qwen3-TTS Base voice cloning sends the full reference context in the first Code2Wav packet, then caches that prefix on the Code2Wav stage for follow-up chunks in the same request.
vllm_omni/deploy/qwen3_tts.yaml is the default deploy config (loaded by HF model_type); per-stage runtime overrides are available via --stage-N-<field> <value>.

VoxCPM2¶

Single-stage native AR TTS at 48 kHz.

Launch¶

vllm serve openbmb/VoxCPM2 --omni --host 0.0.0.0 --port 8000

Deploy config auto-loads from vllm_omni/deploy/voxcpm2.yaml. Pass --deploy-config <path> to override or --stage-N-<field> <value> for per-stage runtime tweaks.

Sending requests¶

# Zero-shot synthesis
python voxcpm2/openai_speech_client.py --text "Hello, this is VoxCPM2."

# Voice cloning
python voxcpm2/openai_speech_client.py \
    --text "This should sound like the reference speaker." \
    --ref-audio /path/to/reference.wav

The ref_audio field accepts local file paths (auto-base64), HTTP URLs, or data:audio/wav;base64,... data URIs.

Precomputed custom voices¶

For repeated VoxCPM2 speakers, precompute the prompt cache and load it through custom_voice_dir:

python voxcpm2/precompute_custom_voice.py \
    --model openbmb/VoxCPM2 \
    --voice-name alice \
    --ref-audio /path/to/reference.wav \
    --mode ref_continuation \
    --prompt-text "Original transcript of the reference audio" \
    --output-dir /path/to/custom_voices

Add the output directory to the deploy config:

custom_voice_dir: /path/to/custom_voices

After startup, /v1/audio/voices lists alice, and /v1/audio/speech can use voice="alice" without sending ref_audio.

Gradio demo (gapless streaming via AudioWorklet)¶

python voxcpm2/gradio_demo.py

Uses an AudioWorklet-based player adapted from the Qwen3-TTS demo for gap-free playback. Raw PCM audio is streamed from the OpenAI Speech endpoint with stream=true and stream_format="audio".

Voxtral TTS¶

Voxtral-4B-TTS (Mistral). Uses the mistral_common SpeechRequest protocol; voice presets are model-specific.

Prerequisites¶

Latest mistral_common with SpeechRequest support:

pip install -e /path/to/mistral-common  # or upgrade from PyPI when available

Launch¶

vllm serve mistralai/Voxtral-4B-TTS-2603 --omni --port 8091

Deploy config auto-loads from vllm_omni/deploy/voxtral_tts.yaml.

Gradio demo¶

python voxtral_tts/gradio_demo.py

The demo handles voice-preset selection and reference-audio upload. voxtral_tts/text_preprocess.py provides the text-normalization helpers used by the demo (also available for other clients).

Notes¶

Voice presets are listed on the HF model card (mistralai/Voxtral-4B-TTS-2603).
Voice cloning is gated upstream and may require a recent mistral_common.
A standalone CLI client is not yet shipped; the gradio demo is the canonical reference for now.

SoulX-Singer¶

Singing voice synthesis (SVS) and conversion (SVC) at 24 kHz. Single-stage DiT with inline preprocess. Uses the /v1/chat/completions endpoint with multimodal input (prompt_audio + target_audio).

Prerequisites¶

Download DiT and preprocess weights, then set up separate SVS / SVC view directories and install dependencies as described in the offline README. config.json architectures field is the single source of truth for SVS vs SVC — point MODEL at the matching directory.

Launch¶

# SVS (default)
export MODEL=/path/to/SoulX-Singer
export PREPROCESS=/path/to/SoulX-Singer-Preprocess
bash examples/online_serving/text_to_speech/soulxsinger/run_server.sh

# SVC
export MODE=svc
export MODEL=/path/to/SoulX-Singer-svc
bash examples/online_serving/text_to_speech/soulxsinger/run_server.sh

Or equivalently, set SOULX_PREPROCESS_WEIGHTS_DIR and launch directly:

export SOULX_PREPROCESS_WEIGHTS_DIR=$PREPROCESS
vllm serve $MODEL --omni \
    --deploy-config vllm_omni/deploy/soulxsinger_${MODE}.yaml \
    --port 8192 --trust-remote-code --enforce-eager

Sending requests¶

Audio paths must be reachable from the server host (local filesystem or data URL). The client sends prompt vocal via input_audio and target accompaniment via extra_args['target_audio'].

# Default demo audio: tests/assets/soulxsinger/zh_prompt.mp3 + music.mp3
python examples/online_serving/text_to_speech/soulxsinger/openai_chat_client.py \
    --prompt-audio /path/on/server/zh_prompt.mp3 \
    --target-audio /path/on/server/music.mp3 \
    --preprocess-weights-dir /path/on/server/SoulX-Singer-Preprocess \
    -o output.wav

Use precomputed metadata to skip online preprocess with following command:

python examples/online_serving/text_to_speech/soulxsinger/openai_chat_client.py \
    --prompt-metadata-path /path/on/server/zh_prompt.json \
    --target-metadata-path /path/on/server/music.json \
    --audio-path /path/on/server/zh_prompt.mp3 \
    -o output.wav

SOULX_PREPROCESS_WEIGHTS_DIR makes --preprocess-weights-dir optional. See openai_chat_client.py --help for --vocal-sep, --language, --num-inference-steps, --guidance-scale, and --seed.

Notes¶

Output: 24 kHz mono WAV; batch only.
Defaults match upstream: --guidance-scale 3.0, --seed 42, --auto-shift on.
SVS --control: score or melody. MIDI / lyric QC: upstream midi_editor only.

Example materials¶

cosyvoice3/run_server.sh

#!/bin/bash
# Launch vLLM-Omni server for CosyVoice3 TTS
#
# Usage:
#   ./run_server.sh
#   CUDA_VISIBLE_DEVICES=0 ./run_server.sh
#
# Streaming (async-chunk) is on by default via vllm_omni/deploy/cosyvoice3.yaml.
# Set NO_ASYNC_CHUNK=1 to use the legacy synchronous path.

set -e

MODEL="${MODEL:-FunAudioLLM/Fun-CosyVoice3-0.5B-2512}"
PORT="${PORT:-8091}"

EXTRA_ARGS=()
if [[ -n "${NO_ASYNC_CHUNK:-}" ]]; then
    EXTRA_ARGS+=(--no-async-chunk)
fi

echo "Starting CosyVoice3 server with model: $MODEL"

vllm serve "$MODEL" \
    --host 0.0.0.0 \
    --port "$PORT" \
    --trust-remote-code \
    --omni \
    "${EXTRA_ARGS[@]}"

cosyvoice3/speech_client.py

"""Client for CosyVoice3 TTS via /v1/audio/speech endpoint.

CosyVoice3 has no built-in voice presets: every request is voice cloning
driven by ``ref_audio`` + ``ref_text``. The defaults below point at the
official upstream zero-shot prompt so the script runs out of the box.

Examples:
    # Voice cloning with the default upstream prompt
    python speech_client.py --text "收到好友从远方寄来的生日礼物。"

    # Custom reference clip + transcript
    python speech_client.py --text "Hello, this is a cloned voice." \
        --ref-audio /path/to/reference.wav \
        --ref-text "Transcript of the reference audio."

    # Streaming PCM output
    python speech_client.py --text "Hello world" --stream --output output.pcm
"""

import argparse
import base64
import os

import httpx

DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"
DEFAULT_MODEL = "FunAudioLLM/Fun-CosyVoice3-0.5B-2512"

# Official CosyVoice zero-shot prompt and its transcript.
DEFAULT_REF_AUDIO = "https://raw.githubusercontent.com/FunAudioLLM/CosyVoice/main/asset/zero_shot_prompt.wav"
DEFAULT_REF_TEXT = "希望你以后能够做的比我还好呦。"


def encode_audio_to_base64(audio_path: str) -> str:
    """Encode a local audio file to a base64 data URL."""
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")
    ext = audio_path.lower().rsplit(".", 1)[-1]
    mime_map = {"wav": "audio/wav", "mp3": "audio/mpeg", "flac": "audio/flac", "ogg": "audio/ogg"}
    mime_type = mime_map.get(ext, "audio/wav")
    with open(audio_path, "rb") as f:
        audio_b64 = base64.b64encode(f.read()).decode("utf-8")
    return f"data:{mime_type};base64,{audio_b64}"


def run_tts(args) -> None:
    """Generate speech via the /v1/audio/speech API."""
    payload = {
        "model": args.model,
        "input": args.text,
        "response_format": args.response_format,
    }

    if args.ref_audio.startswith(("http://", "https://")):
        payload["ref_audio"] = args.ref_audio
    else:
        payload["ref_audio"] = encode_audio_to_base64(args.ref_audio)
    payload["ref_text"] = args.ref_text

    if args.stream:
        payload["stream"] = True
        payload["stream_format"] = "audio"
        payload["response_format"] = "pcm"

    print(f"Model: {args.model}")
    print(f"Text: {args.text}")
    print(f"Voice cloning: ref_audio={args.ref_audio}, ref_text={args.ref_text}")
    print("Generating audio...")

    api_url = f"{args.api_base}/v1/audio/speech"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {args.api_key}",
    }

    if args.stream:
        output_path = args.output or "output.pcm"
        with httpx.Client(timeout=300.0) as client:
            with client.stream("POST", api_url, json=payload, headers=headers) as resp:
                if resp.status_code != 200:
                    print(f"Error: {resp.status_code}")
                    print(resp.read().decode())
                    return
                total_bytes = 0
                with open(output_path, "wb") as f:
                    for chunk in resp.iter_bytes():
                        f.write(chunk)
                        total_bytes += len(chunk)
                print(f"Streamed {total_bytes} bytes to: {output_path}")
    else:
        with httpx.Client(timeout=300.0) as client:
            response = client.post(api_url, json=payload, headers=headers)

        if response.status_code != 200:
            print(f"Error: {response.status_code}")
            print(response.text)
            return

        try:
            text = response.content.decode("utf-8")
            if text.startswith('{"error"'):
                print(f"Error: {text}")
                return
        except UnicodeDecodeError:
            pass

        output_path = args.output or "output.wav"
        with open(output_path, "wb") as f:
            f.write(response.content)
        print(f"Audio saved to: {output_path}")


def main():
    parser = argparse.ArgumentParser(description="CosyVoice3 TTS client")
    parser.add_argument("--api-base", default=DEFAULT_API_BASE, help="API base URL")
    parser.add_argument("--api-key", default=DEFAULT_API_KEY, help="API key")
    parser.add_argument("--model", "-m", default=DEFAULT_MODEL, help="Model name")
    parser.add_argument("--text", required=True, help="Text to synthesize")
    parser.add_argument(
        "--ref-audio",
        default=DEFAULT_REF_AUDIO,
        help="Reference audio for voice cloning (path or URL)",
    )
    parser.add_argument(
        "--ref-text",
        default=DEFAULT_REF_TEXT,
        help="Transcript of the reference audio",
    )
    parser.add_argument("--stream", action="store_true", help="Enable streaming (PCM output)")
    parser.add_argument(
        "--response-format",
        default="wav",
        choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
        help="Audio format (default: wav)",
    )
    parser.add_argument("--output", "-o", default=None, help="Output file path")
    args = parser.parse_args()
    run_tts(args)


if __name__ == "__main__":
    main()

fish_speech/gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/fish_speech/gradio_demo.py.

fish_speech/run_gradio_demo.sh

#!/bin/bash
# Launch Fish Speech S2 Pro server + Gradio demo together.
#
# Usage:
#   ./run_gradio_demo.sh
#   CUDA_VISIBLE_DEVICES=0 PORT=8091 GRADIO_PORT=7860 ./run_gradio_demo.sh

set -e

MODEL="${MODEL:-fishaudio/s2-pro}"
PORT="${PORT:-8091}"
GRADIO_PORT="${GRADIO_PORT:-7860}"
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"

echo "Starting Fish Speech S2 Pro server (port $PORT)..."
FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm serve "$MODEL" \
    --omni \
    --host 0.0.0.0 \
    --port "$PORT" &
SERVER_PID=$!

cleanup() {
    echo "Stopping server (PID $SERVER_PID)..."
    kill $SERVER_PID 2>/dev/null
    wait $SERVER_PID 2>/dev/null
}
trap cleanup EXIT

# Wait for server to be ready.
echo "Waiting for server to start..."
for i in $(seq 1 120); do
    if curl -s "http://localhost:$PORT/health" > /dev/null 2>&1; then
        echo "Server ready."
        break
    fi
    sleep 2
done

echo "Starting Gradio demo (port $GRADIO_PORT)..."
python "$SCRIPT_DIR/gradio_demo.py" \
    --api-base "http://localhost:$PORT" \
    --port "$GRADIO_PORT"

fish_speech/run_server.sh

#!/bin/bash
# Launch vLLM-Omni server for Fish Speech S2 Pro
#
# Usage:
#   ./run_server.sh
#   CUDA_VISIBLE_DEVICES=0 ./run_server.sh

set -e

MODEL="${MODEL:-fishaudio/s2-pro}"
PORT="${PORT:-8091}"

echo "Starting Fish Speech S2 Pro server with model: $MODEL"

FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm serve "$MODEL" \
    --omni \
    --host 0.0.0.0 \
    --port "$PORT"

fish_speech/speech_client.py

"""Client for Fish Speech S2 Pro via /v1/audio/speech endpoint.

Examples:
    # Basic TTS
    python speech_client.py --text "Hello, how are you?"

    # Voice cloning
    python speech_client.py --text "Hello, how are you?" \
        --ref-audio ref.wav --ref-text "This is the reference transcript."

    # Streaming PCM output
    python speech_client.py --text "Hello world" --stream --output output.pcm
"""

import argparse
import base64
import os

import httpx

DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"


def encode_audio_to_base64(audio_path: str) -> str:
    """Encode a local audio file to base64 data URL."""
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")
    ext = audio_path.lower().rsplit(".", 1)[-1]
    mime_map = {"wav": "audio/wav", "mp3": "audio/mpeg", "flac": "audio/flac", "ogg": "audio/ogg"}
    mime_type = mime_map.get(ext, "audio/wav")
    with open(audio_path, "rb") as f:
        audio_b64 = base64.b64encode(f.read()).decode("utf-8")
    return f"data:{mime_type};base64,{audio_b64}"


def run_tts(args) -> None:
    """Generate speech via /v1/audio/speech API."""
    payload = {
        "model": args.model,
        "input": args.text,
        "response_format": args.response_format,
    }

    # Voice cloning parameters.
    if args.ref_audio:
        if args.ref_audio.startswith(("http://", "https://")):
            payload["ref_audio"] = args.ref_audio
        else:
            payload["ref_audio"] = encode_audio_to_base64(args.ref_audio)
    if args.ref_text:
        payload["ref_text"] = args.ref_text

    if args.stream:
        payload["stream"] = True
        payload["stream_format"] = "audio"
        payload["response_format"] = "pcm"

    print(f"Model: {args.model}")
    print(f"Text: {args.text}")
    if args.ref_audio:
        print(f"Voice cloning: ref_audio={args.ref_audio}, ref_text={args.ref_text}")
    print("Generating audio...")

    api_url = f"{args.api_base}/v1/audio/speech"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {args.api_key}",
    }

    if args.stream:
        output_path = args.output or "output.pcm"
        with httpx.Client(timeout=300.0) as client:
            with client.stream("POST", api_url, json=payload, headers=headers) as resp:
                if resp.status_code != 200:
                    print(f"Error: {resp.status_code}")
                    print(resp.read().decode())
                    return
                total_bytes = 0
                with open(output_path, "wb") as f:
                    for chunk in resp.iter_bytes():
                        f.write(chunk)
                        total_bytes += len(chunk)
                print(f"Streamed {total_bytes} bytes to: {output_path}")
    else:
        with httpx.Client(timeout=300.0) as client:
            response = client.post(api_url, json=payload, headers=headers)

        if response.status_code != 200:
            print(f"Error: {response.status_code}")
            print(response.text)
            return

        try:
            text = response.content.decode("utf-8")
            if text.startswith('{"error"'):
                print(f"Error: {text}")
                return
        except UnicodeDecodeError:
            pass

        output_path = args.output or "output.wav"
        with open(output_path, "wb") as f:
            f.write(response.content)
        print(f"Audio saved to: {output_path}")


def main():
    parser = argparse.ArgumentParser(description="Fish Speech S2 Pro TTS client")
    parser.add_argument("--api-base", default=DEFAULT_API_BASE, help="API base URL")
    parser.add_argument("--api-key", default=DEFAULT_API_KEY, help="API key")
    parser.add_argument("--model", "-m", default="fishaudio/s2-pro", help="Model name")
    parser.add_argument("--text", required=True, help="Text to synthesize")
    parser.add_argument("--ref-audio", default=None, help="Reference audio for voice cloning (path or URL)")
    parser.add_argument("--ref-text", default=None, help="Transcript of reference audio")
    parser.add_argument("--stream", action="store_true", help="Enable streaming (PCM output)")
    parser.add_argument(
        "--response-format",
        default="wav",
        choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
        help="Audio format (default: wav)",
    )
    parser.add_argument("--output", "-o", default=None, help="Output file path")
    args = parser.parse_args()
    run_tts(args)


if __name__ == "__main__":
    main()

glm_tts/gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/glm_tts/gradio_demo.py.

glm_tts/openai_speech_client.py

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""OpenAI-compatible client for GLM-TTS via /v1/audio/speech endpoint.

GLM-TTS is a two-stage TTS system (AR + DiT) that generates audio from text
conditioned on reference speech. Each request requires ref_audio + ref_text.

Usage:
    # Voice cloning
    python openai_speech_client.py --text "你好" --ref-audio file:///path/to/ref.wav --ref-text "参考文本"

    # Streaming response, for async_chunk server mode
    python openai_speech_client.py --text "你好" --stream --ref-audio file:///path/to/ref.wav --ref-text "参考文本"

    # Specify output format
    python openai_speech_client.py --text "你好" --ref-audio file:///path/to/ref.wav \
        --ref-text "参考文本" --response-format mp3 -o output.mp3
"""

import argparse

import httpx

# Default server configuration
DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"


def run_tts_generation(args) -> None:
    """Run TTS generation via OpenAI-compatible /v1/audio/speech API."""
    if not args.ref_audio or not args.ref_text:
        raise ValueError("GLM-TTS requires --ref-audio and --ref-text for voice cloning.")

    payload = {
        "model": args.model,
        "voice": "default",
        "input": args.text,
        "response_format": args.response_format,
        "stream": bool(args.stream),
        "ref_audio": args.ref_audio,
        "ref_text": args.ref_text,
    }
    if args.stream:
        payload["stream_format"] = "audio"
        payload["response_format"] = "pcm"
    if args.max_new_tokens:
        payload["max_new_tokens"] = args.max_new_tokens

    print(f"Model: {args.model}")
    print(f"Text: {args.text}")
    print(f"Voice cloning: ref_audio={args.ref_audio}, ref_text={args.ref_text}")
    print(f"Stream: {args.stream}")
    print("Generating audio...")

    api_url = f"{args.api_base}/v1/audio/speech"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {args.api_key}",
    }

    if args.stream:
        output_path = args.output or "tts_output.pcm"
        with httpx.Client(timeout=300.0) as client, open(output_path, "wb") as f:
            with client.stream("POST", api_url, json=payload, headers=headers) as response:
                if response.status_code != 200:
                    print(f"Error: {response.status_code}")
                    response.read()
                    print(response.text)
                    return
                for chunk in response.iter_bytes():
                    f.write(chunk)
        print(f"Streaming audio saved to: {output_path}")
    else:
        with httpx.Client(timeout=300.0) as client:
            response = client.post(api_url, json=payload, headers=headers)
        if response.status_code != 200:
            print(f"Error: {response.status_code}")
            print(response.text)
            return
        try:
            text = response.content.decode("utf-8")
            if text.startswith('{"error"'):
                print(f"Error: {text}")
                return
        except UnicodeDecodeError:
            pass
        output_path = args.output or f"tts_output.{args.response_format}"
        with open(output_path, "wb") as f:
            f.write(response.content)
        print(f"Audio saved to: {output_path}")


def parse_args():
    """Parse command line arguments."""
    parser = argparse.ArgumentParser(
        description="OpenAI-compatible client for GLM-TTS via /v1/audio/speech",
    )

    # Server configuration
    parser.add_argument(
        "--api-base",
        type=str,
        default=DEFAULT_API_BASE,
        help=f"API base URL (default: {DEFAULT_API_BASE})",
    )
    parser.add_argument(
        "--api-key",
        type=str,
        default=DEFAULT_API_KEY,
        help="API key (default: EMPTY)",
    )
    parser.add_argument(
        "--model",
        "-m",
        type=str,
        default="glm-tts",
        help="Model name/path",
    )

    # Input text
    parser.add_argument(
        "--text",
        type=str,
        required=True,
        help="Text to synthesize",
    )

    # Generation parameters
    parser.add_argument(
        "--max-new-tokens",
        type=int,
        default=None,
        help="Maximum new tokens to generate (default: model default)",
    )

    # Output
    parser.add_argument(
        "--response-format",
        type=str,
        default="wav",
        choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
        help="Audio output format (default: wav)",
    )
    parser.add_argument(
        "--stream",
        action="store_true",
        help="Request a streaming audio response (use with async_chunk server mode).",
    )
    parser.add_argument(
        "--output",
        "-o",
        type=str,
        default=None,
        help="Output audio file path (default: tts_output.<format>)",
    )

    # Voice cloning parameters
    parser.add_argument(
        "--ref-audio",
        type=str,
        default=None,
        help="Reference audio URL, file:// URI, or base64 data URL for voice cloning",
    )
    parser.add_argument(
        "--ref-text",
        type=str,
        default=None,
        help="Transcript of the reference audio (required with --ref-audio)",
    )

    return parser.parse_args()


if __name__ == "__main__":
    args = parse_args()
    run_tts_generation(args)

glm_tts/run_gradio_demo.sh

#!/bin/bash
# Launch GLM-TTS server + Gradio demo together.
#
# Usage:
#   ./run_gradio_demo.sh
#   CUDA_VISIBLE_DEVICES=0 PORT=8091 GRADIO_PORT=7860 ./run_gradio_demo.sh

set -e

MODEL="${MODEL:-zai-org/GLM-TTS}"
PORT="${PORT:-8091}"
GRADIO_PORT="${GRADIO_PORT:-7860}"
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../../../.." && pwd)"

echo "Starting GLM-TTS server (port $PORT)..."
FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm-omni serve "$MODEL" \
    --deploy-config "$REPO_ROOT/vllm_omni/deploy/glm_tts.yaml" \
    --host 0.0.0.0 \
    --port "$PORT" \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --enforce-eager \
    --omni &
SERVER_PID=$!

cleanup() {
    echo "Stopping server (PID $SERVER_PID)..."
    kill $SERVER_PID 2>/dev/null
    wait $SERVER_PID 2>/dev/null
}
trap cleanup EXIT

# Wait for server to be ready.
echo "Waiting for server to start..."
for i in $(seq 1 120); do
    if curl -s "http://localhost:$PORT/health" > /dev/null 2>&1; then
        echo "Server ready."
        break
    fi
    sleep 2
done

echo "Starting Gradio demo (port $GRADIO_PORT)..."
python "$SCRIPT_DIR/gradio_demo.py" \
    --api-base "http://localhost:$PORT" \
    --port "$GRADIO_PORT"

glm_tts/run_server.sh

#!/bin/bash
# Launch vLLM-Omni server for GLM-TTS models
#
# Usage:
#   ./run_server.sh                           # Default model path, async_chunk mode
#   ./run_server.sh /path/to/GLM-TTS          # Custom model path, async_chunk mode
#   ./run_server.sh /path/to/GLM-TTS sync     # Sync two-stage mode
#
# NOTE: The model path should point to the repo ROOT (not llm/ subdirectory).
# model_subdir/tokenizer_subdir in the pipeline config resolve subdirectories.

set -e

MODEL="${1:-zai-org/GLM-TTS}"
MODE="${2:-async}"

EXTRA_ARGS=()
case "$MODE" in
    async|async_chunk)
        ;;
    sync|no_async_chunk)
        EXTRA_ARGS+=("--no-async-chunk")
        ;;
    *)
        echo "Unknown mode: $MODE (expected async or sync)" >&2
        exit 1
        ;;
esac

echo "Starting GLM-TTS server with model: $MODEL (mode: $MODE)"

vllm-omni serve "$MODEL" \
    --deploy-config vllm_omni/deploy/glm_tts.yaml \
    --host 0.0.0.0 \
    --port 8091 \
    --trust-remote-code \
    --omni \
    "${EXTRA_ARGS[@]}"

higgs_audio_v2/README.md

higgs-audio v2 online example¶

This directory contains the online-serving entry points for boson-ai's higgs-audio v2 as integrated by vllm-omni: a 2-stage TTS pipeline (Llama-3.2-3B talker with DualFFN audio expert + HiggsAudio codec decoder) emitting 24 kHz mono speech.

Prerequisites¶

Voice clone uses HF's HiggsAudioV2TokenizerModel loaded from k2-fsa/OmniVoice/audio_tokenizer/ (the boson-ai standalone tokenizer Hub repo's model.safetensors is the 3B talker LM, not the codec). Only that ~806 MB subdir is downloaded.

pip install -U "transformers>=5.3.0"

Files¶

run_server.sh — launch the vllm-omni server with the bundled vllm_omni/deploy/higgs_audio_v2.yaml deploy config.
batch_speech_client.py — send a list of prompts to /v1/audio/speech and save the returned WAV / PCM bytes to a directory; optionally passes --ref-audio + --ref-text for shallow voice clone.

Launching the server¶

GPUS=6,7 PORT=8094 bash examples/online_serving/text_to_speech/higgs_audio_v2/run_server.sh

Environment overrides:

MODEL — HF id of the talker (default bosonai/higgs-audio-v2-generation-3B-base).
PORT — server port (default 8094).
GPUS — CUDA_VISIBLE_DEVICES value (default 6,7).
GPU_UTIL — --gpu-memory-utilization (default 0.4).

The script also exports VLLM_USE_DEEP_GEMM=0 / VLLM_MOE_USE_DEEP_GEMM=0 so the example works on images without the optional deep_gemm backend.

The deploy YAML ships with async_chunk: false and codec_streaming: true, i.e. Stage 0 finishes its codec frames before Stage 1 starts decoding, and Stage 1 streams WAV/PCM bytes to the client chunk-by-chunk.

Driving the server¶

Plain TTS:

python examples/online_serving/text_to_speech/higgs_audio_v2/batch_speech_client.py \
    --base-url http://localhost:8094 \
    --model bosonai/higgs-audio-v2-generation-3B-base \
    --output-dir /tmp/higgs_audio_v2_batch \
    --prompts "Hello world." \
              "The quick brown fox jumps over the lazy dog."

Voice clone — pass a reference clip and its transcript (both required together):

python examples/online_serving/text_to_speech/higgs_audio_v2/batch_speech_client.py \
    --base-url http://localhost:8094 \
    --model bosonai/higgs-audio-v2-generation-3B-base \
    --output-dir /tmp/higgs_audio_v2_clone \
    --ref-audio /path/to/reference.wav \
    --ref-text  "Exact transcript spoken in reference.wav." \
    --prompts "Hello, this is a cloned voice."

Notes¶

--ref-text must be the real transcript of --ref-audio; mismatched text degrades cloned-voice quality.
Out of scope (rejected with explicit 4xx by the request validator): multi-speaker [SPEAKERn] tags inside input, profile: text-only speaker descriptions, the ref_audio_in_system_message system-block variant, chunked long-form generation, and per-request voice / instructions / task_type / language / speed != 1.0 / x_vector_only_mode / speaker_embedding.

higgs_audio_v2/batch_speech_client.py

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""Batch client for the higgs-audio v2 online server.

Sends a fixed list of prompts to ``/v1/audio/speech`` and saves the returned
WAV files (or raw PCM bytes when ``--format pcm``) into ``--output-dir``.

Usage (plain text -> speech):

  python examples/online_serving/text_to_speech/higgs_audio_v2/batch_speech_client.py \
      --base-url http://localhost:8094 \
      --output-dir /tmp/higgs_audio_v2_batch \
      --prompts "Hello world." "The quick brown fox jumps over the lazy dog."

Usage (shallow voice clone — pass a reference clip + its transcript):

  python examples/online_serving/text_to_speech/higgs_audio_v2/batch_speech_client.py \
      --base-url http://localhost:8094 \
      --output-dir /tmp/higgs_audio_v2_clone \
      --ref-audio path/to/reference.wav \
      --ref-text "the transcript of the reference clip" \
      --prompts "Hello world."
"""

from __future__ import annotations

import argparse
import base64
import sys
from pathlib import Path

DEFAULT_PROMPTS = (
    "Hello world.",
    "The quick brown fox jumps over the lazy dog.",
    "It was the night before my birthday.",
    "Innovation distinguishes between a leader and a follower.",
)


def _slug(text: str) -> str:
    import re

    s = re.sub(r"\s+", "_", text.strip().lower())
    return re.sub(r"[^a-z0-9_]+", "", s)[:32] or "prompt"


def main() -> int:
    parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
    parser.add_argument("--base-url", default="http://localhost:8094")
    parser.add_argument("--model", default="higgs_audio_v2")
    parser.add_argument("--prompts", nargs="+", default=list(DEFAULT_PROMPTS))
    parser.add_argument("--output-dir", type=Path, default=Path("/tmp/higgs_audio_v2_batch"))
    parser.add_argument("--format", choices=("wav", "pcm"), default="wav")
    parser.add_argument("--max-new-tokens", type=int, default=300)
    parser.add_argument("--seed", type=int, default=42)
    parser.add_argument("--timeout-s", type=float, default=120.0)
    parser.add_argument(
        "--ref-audio",
        type=Path,
        default=None,
        help="Reference clip for voice clone (path to a WAV file). Must be paired with --ref-text.",
    )
    parser.add_argument(
        "--ref-text",
        type=str,
        default=None,
        help="Transcript of the reference clip. Required when --ref-audio is set.",
    )
    args = parser.parse_args()

    if (args.ref_audio is None) != (args.ref_text is None):
        print("--ref-audio and --ref-text must be supplied together", file=sys.stderr)
        return 2

    ref_audio_data_url: str | None = None
    if args.ref_audio is not None:
        if not args.ref_audio.exists():
            print(f"ref-audio file not found: {args.ref_audio}", file=sys.stderr)
            return 2
        mime = "audio/wav" if args.ref_audio.suffix.lower() == ".wav" else "audio/mpeg"
        ref_b64 = base64.b64encode(args.ref_audio.read_bytes()).decode("ascii")
        ref_audio_data_url = f"data:{mime};base64,{ref_b64}"

    try:
        import httpx
    except ImportError:
        print(
            "this client needs `httpx`. Install with `pip install httpx`.",
            file=sys.stderr,
        )
        return 2

    args.output_dir.mkdir(parents=True, exist_ok=True)
    url = args.base_url.rstrip("/") + "/v1/audio/speech"
    failures = 0
    with httpx.Client(timeout=args.timeout_s) as client:
        for prompt in args.prompts:
            payload = {
                "model": args.model,
                "input": prompt,
                "response_format": args.format,
                "max_new_tokens": args.max_new_tokens,
                "seed": args.seed,
            }
            if ref_audio_data_url is not None:
                payload["ref_audio"] = ref_audio_data_url
                payload["ref_text"] = args.ref_text
            resp = client.post(url, json=payload)
            if resp.status_code != 200:
                print(f"[FAIL] {prompt!r} -> {resp.status_code}: {resp.text[:200]}", file=sys.stderr)
                failures += 1
                continue
            suffix = ".wav" if args.format == "wav" else ".pcm"
            out = args.output_dir / f"{_slug(prompt)}{suffix}"
            out.write_bytes(resp.content)
            print(f"[ ok ] {prompt!r} -> {out} ({len(resp.content)} bytes)")

    return 1 if failures else 0


if __name__ == "__main__":
    sys.exit(main())

higgs_audio_v2/run_server.sh

#!/bin/bash
# Launch vLLM-Omni server for higgs-audio v2.
#
# v1 scope: plain text -> 24 kHz speech only. Voice cloning, multi-speaker,
# ChatML rich content, and language overrides are rejected by the validator
# with explicit 4xx (see vllm_omni/entrypoints/openai/serving_speech.py).
#
# Usage:
#   ./run_server.sh                 # default port 8094, GPUs 6 and 7
#   PORT=8095 GPUS=6,7 ./run_server.sh
#   MODEL=bosonai/higgs-audio-v2-generation-3B-base ./run_server.sh

set -e

MODEL="${MODEL:-bosonai/higgs-audio-v2-generation-3B-base}"
PORT="${PORT:-8094}"
GPUS="${GPUS:-6,7}"
GPU_UTIL="${GPU_UTIL:-0.4}"

echo "Starting higgs-audio v2 server"
echo "  MODEL=$MODEL"
echo "  PORT=$PORT"
echo "  CUDA_VISIBLE_DEVICES=$GPUS"

# DeepGEMM FP8 kernels are optional and trip warmup on builds without
# the deep_gemm backend; disable them so the example works out of the box.
# Users with deep_gemm installed can re-enable via the same env vars.
CUDA_VISIBLE_DEVICES="$GPUS" \
VLLM_USE_DEEP_GEMM=0 \
VLLM_MOE_USE_DEEP_GEMM=0 \
vllm-omni serve "$MODEL" \
    --deploy-config vllm_omni/deploy/higgs_audio_v2.yaml \
    --host 0.0.0.0 \
    --port "$PORT" \
    --gpu-memory-utilization "$GPU_UTIL" \
    --trust-remote-code \
    --omni

higgs_audio_v3/README.md

Higgs-Audio V3 Online Serving¶

Start the server¶

# Default: GPU 0, port 8095
./examples/online_serving/text_to_speech/higgs_audio_v3/run_server.sh

# Custom GPU / port
PORT=8096 GPUS=0,1 ./examples/online_serving/text_to_speech/higgs_audio_v3/run_server.sh

Plain text TTS¶

python examples/online_serving/text_to_speech/higgs_audio_v3/batch_speech_client.py \
    --base-url http://localhost:8095 \
    --output-dir /tmp/higgs_v3_batch \
    --prompts "Hello world." "The quick brown fox jumps over the lazy dog."

Voice clone¶

python examples/online_serving/text_to_speech/higgs_audio_v3/batch_speech_client.py \
    --base-url http://localhost:8095 \
    --output-dir /tmp/higgs_v3_clone \
    --ref-audio path/to/reference.wav \
    --ref-text "transcript of the reference clip" \
    --prompts "Text to synthesize in the cloned voice."

curl example¶

curl -X POST http://localhost:8095/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{"model": "higgs_audio_v3", "input": "Hello world."}' \
    --output hello.wav

higgs_audio_v3/batch_speech_client.py

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""Batch client for the higgs-audio v3 online server.

Sends prompts to ``/v1/audio/speech`` and saves the returned WAV files.

Usage (plain text -> speech):

  python examples/online_serving/text_to_speech/higgs_audio_v3/batch_speech_client.py \
      --base-url http://localhost:8095 \
      --output-dir /tmp/higgs_v3_batch \
      --prompts "Hello world." "The quick brown fox jumps over the lazy dog."

Usage (voice clone):

  python examples/online_serving/text_to_speech/higgs_audio_v3/batch_speech_client.py \
      --base-url http://localhost:8095 \
      --output-dir /tmp/higgs_v3_clone \
      --ref-audio path/to/reference.wav \
      --ref-text "the transcript of the reference clip" \
      --prompts "Hello world."
"""

from __future__ import annotations

import argparse
import base64
import sys
from pathlib import Path

DEFAULT_PROMPTS = (
    "Hello world.",
    "The quick brown fox jumps over the lazy dog.",
    "Today is a beautiful day for a walk in the park.",
    "Innovation distinguishes between a leader and a follower.",
)


def _slug(text: str) -> str:
    import re

    s = re.sub(r"\s+", "_", text.strip().lower())
    return re.sub(r"[^a-z0-9_]+", "", s)[:32] or "prompt"


def main() -> int:
    parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
    parser.add_argument("--base-url", default="http://localhost:8095")
    parser.add_argument("--model", default="higgs_audio_v3")
    parser.add_argument("--prompts", nargs="+", default=list(DEFAULT_PROMPTS))
    parser.add_argument("--output-dir", type=Path, default=Path("/tmp/higgs_v3_batch"))
    parser.add_argument("--format", choices=("wav", "pcm"), default="wav")
    parser.add_argument("--max-new-tokens", type=int, default=2048)
    parser.add_argument("--seed", type=int, default=42)
    parser.add_argument("--timeout-s", type=float, default=120.0)
    parser.add_argument(
        "--ref-audio",
        type=Path,
        default=None,
        help="Reference clip for voice clone (WAV/FLAC/MP3 path). Pair with --ref-text.",
    )
    parser.add_argument(
        "--ref-text",
        type=str,
        default=None,
        help="Transcript of the reference clip. Optional but improves fidelity.",
    )
    args = parser.parse_args()

    ref_audio_data_url: str | None = None
    if args.ref_audio is not None:
        if not args.ref_audio.exists():
            print(f"ref-audio file not found: {args.ref_audio}", file=sys.stderr)
            return 2
        mime = "audio/wav" if args.ref_audio.suffix.lower() == ".wav" else "audio/mpeg"
        ref_b64 = base64.b64encode(args.ref_audio.read_bytes()).decode("ascii")
        ref_audio_data_url = f"data:{mime};base64,{ref_b64}"

    try:
        import httpx
    except ImportError:
        print("this client needs `httpx`. Install with `pip install httpx`.", file=sys.stderr)
        return 2

    args.output_dir.mkdir(parents=True, exist_ok=True)
    url = args.base_url.rstrip("/") + "/v1/audio/speech"
    failures = 0
    with httpx.Client(timeout=args.timeout_s) as client:
        for prompt in args.prompts:
            payload = {
                "model": args.model,
                "input": prompt,
                "response_format": args.format,
                "max_new_tokens": args.max_new_tokens,
                "seed": args.seed,
            }
            if ref_audio_data_url is not None:
                payload["ref_audio"] = ref_audio_data_url
                if args.ref_text:
                    payload["ref_text"] = args.ref_text
            resp = client.post(url, json=payload)
            if resp.status_code != 200:
                print(f"[FAIL] {prompt!r} -> {resp.status_code}: {resp.text[:200]}", file=sys.stderr)
                failures += 1
                continue
            suffix = ".wav" if args.format == "wav" else ".pcm"
            out = args.output_dir / f"{_slug(prompt)}{suffix}"
            out.write_bytes(resp.content)
            print(f"[ ok ] {prompt!r} -> {out} ({len(resp.content)} bytes)")

    return 1 if failures else 0


if __name__ == "__main__":
    sys.exit(main())

higgs_audio_v3/run_server.sh

#!/bin/bash
# Launch vLLM-Omni server for higgs-audio v3.
#
# Supports plain text TTS and voice cloning via /v1/audio/speech.
#
# Usage:
#   ./run_server.sh                 # default port 8095, GPU 0
#   PORT=8096 GPUS=0,1 ./run_server.sh
#   MODEL=/path/to/local/checkpoint ./run_server.sh

set -e

MODEL="${MODEL:-bosonai/higgs-audio-v3-tts-4b}"
PORT="${PORT:-8095}"
GPUS="${GPUS:-0}"
GPU_UTIL="${GPU_UTIL:-0.6}"

echo "Starting higgs-audio v3 server"
echo "  MODEL=$MODEL"
echo "  PORT=$PORT"
echo "  CUDA_VISIBLE_DEVICES=$GPUS"

CUDA_VISIBLE_DEVICES="$GPUS" \
VLLM_USE_DEEP_GEMM=0 \
VLLM_MOE_USE_DEEP_GEMM=0 \
vllm-omni serve "$MODEL" \
    --deploy-config vllm_omni/deploy/higgs_multimodal_qwen3.yaml \
    --host 0.0.0.0 \
    --port "$PORT" \
    --gpu-memory-utilization "$GPU_UTIL" \
    --trust-remote-code \
    --omni

indextts2/run_server.sh

#!/bin/bash
# Launch vLLM-Omni server for IndexTTS2
#
# Usage from repository root:
#   examples/online_serving/text_to_speech/indextts2/run_server.sh
#   CUDA_VISIBLE_DEVICES=0 PORT=8092 MODEL=/path/to/IndexTeam/IndexTTS-2 examples/online_serving/text_to_speech/indextts2/run_server.sh

set -e

SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
ROOT_DIR="$(cd -- "$SCRIPT_DIR/../../../.." && pwd)"

MODEL="${MODEL:-IndexTeam/IndexTTS-2}"
PORT="${PORT:-8092}"
DEPLOY_CONFIG="${DEPLOY_CONFIG:-$ROOT_DIR/vllm_omni/deploy/indextts2.yaml}"

echo "Starting IndexTTS2 server with model: $MODEL"

FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm serve "$MODEL" \
    --host 0.0.0.0 \
    --port "$PORT" \
    --omni \
    --trust-remote-code \
    --deploy-config "$DEPLOY_CONFIG"

indextts2/speech_client.py

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""OpenAI-compatible client for IndexTTS2 TTS via /v1/audio/speech endpoint.

Examples:
    # With reference audio for voice cloning
    python speech_client.py --text "你好，世界！" \
        --ref-audio /path/to/reference.wav

    # With emotion audio
    python speech_client.py --text "今天心情很好！" \
        --ref-audio /path/to/ref.wav \
        --emo-audio /path/to/happy.wav

Server setup:
    vllm serve IndexTeam/IndexTTS-2 --omni --host 0.0.0.0 --port 8092
"""

from __future__ import annotations

import argparse
import base64
import os

import httpx

DEFAULT_API_BASE = "http://localhost:8092"
DEFAULT_API_KEY = "sk-empty"


def encode_audio_to_base64(audio_path: str) -> str:
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")
    ext = audio_path.lower().rsplit(".", 1)[-1]
    mime = {"wav": "audio/wav", "mp3": "audio/mpeg", "flac": "audio/flac"}.get(ext, "audio/wav")
    with open(audio_path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode("utf-8")
    return f"data:{mime};base64,{b64}"


def main() -> None:
    parser = argparse.ArgumentParser(description="IndexTTS2 OpenAI speech client")
    parser.add_argument("--text", type=str, required=True)
    parser.add_argument("--ref-audio", type=str, default=None, help="Reference audio for voice cloning")
    parser.add_argument("--emo-audio", type=str, default=None, help="Emotion reference audio")
    parser.add_argument("--emo-text", type=str, default=None, help="Emotion description text")
    parser.add_argument(
        "--emo-vector",
        type=float,
        nargs=8,
        default=None,
        help="8-dim emotion vector: happy angry sad afraid disgusted melancholic surprised calm",
    )
    parser.add_argument("--emo-alpha", type=float, default=None, help="Emotion weight in [0, 1]")
    parser.add_argument("--use-emo-text", action="store_true", help="Infer emotion vector from emo-text or text")
    parser.add_argument("--use-random", action="store_true", help="Use random emotion prototypes")
    parser.add_argument("--model", type=str, default="IndexTeam/IndexTTS-2")
    parser.add_argument("--voice", type=str, default=None, help="Uploaded voice name to use instead of --ref-audio")
    parser.add_argument("--output", type=str, default="output.wav")
    parser.add_argument("--api-base", type=str, default=DEFAULT_API_BASE)
    parser.add_argument("--api-key", type=str, default=DEFAULT_API_KEY)
    parser.add_argument("--response-format", type=str, default="wav")
    args = parser.parse_args()

    if not args.ref_audio and not args.voice:
        parser.error("IndexTTS2 requires --ref-audio or --voice for voice cloning")

    payload: dict = {
        "model": args.model,
        "input": args.text,
        "response_format": args.response_format,
    }
    if args.voice:
        payload["voice"] = args.voice

    if args.ref_audio:
        ref = args.ref_audio
        if ref.startswith(("http://", "https://", "data:")):
            payload["ref_audio"] = ref
        else:
            payload["ref_audio"] = encode_audio_to_base64(ref)

    extra_params = {}
    if args.emo_audio:
        emo = args.emo_audio
        if emo.startswith(("http://", "https://", "data:")):
            extra_params["emo_audio"] = emo
        else:
            extra_params["emo_audio"] = encode_audio_to_base64(emo)
    if args.emo_text:
        extra_params["emo_text"] = args.emo_text
    if args.emo_vector is not None:
        extra_params["emo_vector"] = args.emo_vector
    if args.emo_alpha is not None:
        extra_params["emo_alpha"] = args.emo_alpha
    if args.use_emo_text:
        extra_params["use_emo_text"] = True
    if args.use_random:
        extra_params["use_random"] = True
    if extra_params:
        payload["extra_params"] = extra_params

    url = f"{args.api_base}/v1/audio/speech"
    print(f"POST {url}")
    print(f"  text: {args.text}")
    if args.ref_audio:
        print(f"  ref_audio: {args.ref_audio[:80]}...")

    with httpx.Client(timeout=300) as client:
        resp = client.post(
            url,
            json=payload,
            headers={"Authorization": f"Bearer {args.api_key}"},
        )

    if resp.status_code != 200:
        print(f"Error {resp.status_code}: {resp.text[:500]}")
        return

    with open(args.output, "wb") as f:
        f.write(resp.content)
    print(f"Saved: {args.output} ({len(resp.content):,} bytes)")


if __name__ == "__main__":
    main()

ming_flash_omni_tts/run_server.sh

#!/bin/bash
# Launch vLLM-Omni server for Ming-flash-omni-2.0 standalone talker (TTS).
#
# Usage:
#   ./run_server.sh
#   MODEL=/path/to/local/model ./run_server.sh
#   PORT=8091 ./run_server.sh
#   HOST=127.0.0.1 ./run_server.sh   # bind only to loopback

set -e

MODEL="${MODEL:-Jonathan1909/Ming-flash-omni-2.0}"
HOST="${HOST:-0.0.0.0}"
PORT="${PORT:-8091}"
DEPLOY_CONFIG="${DEPLOY_CONFIG:-vllm_omni/deploy/ming_flash_omni_tts.yaml}"

echo "Starting Ming standalone TTS server with model: $MODEL"
echo "Deploy config: $DEPLOY_CONFIG"

vllm serve "$MODEL" \
    --deploy-config "$DEPLOY_CONFIG" \
    --host "$HOST" \
    --port "$PORT" \
    --trust-remote-code \
    --omni

ming_flash_omni_tts/speech_client.py

"""Client for Ming standalone TTS via /v1/audio/speech endpoint."""

import argparse
import json
import sys

import httpx

DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"
DEFAULT_MODEL = "Jonathan1909/Ming-flash-omni-2.0"


def run_tts(args) -> None:
    payload = {
        "model": args.model,
        "input": args.text,
        "response_format": args.response_format,
    }

    instructions = args.instructions
    if args.instruction_json:
        if instructions:
            sys.exit("--instructions and --instruction-json are mutually exclusive")

        try:
            parsed = json.loads(args.instruction_json)
        except json.JSONDecodeError as exc:
            sys.exit(f"--instruction-json must be valid JSON: {exc}")
        if not isinstance(parsed, dict):
            sys.exit("--instruction-json must decode to a JSON object")
        # Re-encode with ensure_ascii=False so UTF-8 Chinese keys/values
        # arrive at the server intact rather than as \\uXXXX escapes.
        instructions = json.dumps(parsed, ensure_ascii=False)
    if instructions:
        payload["instructions"] = instructions

    print(f"Model: {args.model}")
    print(f"Text: {args.text}")
    print("Generating audio...")

    api_url = f"{args.api_base}/v1/audio/speech"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {args.api_key}",
    }

    with httpx.Client(timeout=300.0) as client:
        response = client.post(api_url, json=payload, headers=headers)

    if response.status_code != 200:
        print(f"Error: {response.status_code}")
        print(response.text)
        return

    output_path = args.output or "ming_tts_output.wav"
    with open(output_path, "wb") as f:
        f.write(response.content)
    print(f"Audio saved to: {output_path}")


def main():
    parser = argparse.ArgumentParser(description="Ming standalone TTS speech client")
    parser.add_argument("--api-base", default=DEFAULT_API_BASE, help="API base URL")
    parser.add_argument("--api-key", default=DEFAULT_API_KEY, help="API key")
    parser.add_argument("--model", "-m", default=DEFAULT_MODEL, help="Model name or local path")
    parser.add_argument("--text", required=True, help="Text to synthesize")
    parser.add_argument(
        "--response-format",
        default="wav",
        choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
        help="Audio format (default: wav)",
    )
    parser.add_argument("--output", "-o", default=None, help="Output file path")
    parser.add_argument(
        "--instructions",
        default=None,
        help="Free-form style description (mapped to caption 风格 on the server).",
    )
    parser.add_argument(
        "--instruction-json",
        default=None,
        help=(
            "Structured caption JSON forwarded as `instructions`. Accepts Ming "
            "caption keys: 方言, 风格, 语速, 基频, 音量, 情感, IP, 说话人, BGM. "
        ),
    )
    args = parser.parse_args()
    run_tts(args)


if __name__ == "__main__":
    main()

ming_tts/README.md

Ming-omni-tts Online Serving¶

Serve the dense inclusionAI/Ming-omni-tts-0.5B two-stage TTS model through the OpenAI-compatible /v1/audio/speech endpoint.

Start Server¶

vllm-omni serve inclusionAI/Ming-omni-tts-0.5B \
    --deploy-config vllm_omni/deploy/ming_tts.yaml \
    --omni \
    --port 8091 \
    --enforce-eager

Or:

cd examples/online_serving/text_to_speech/ming_tts
./run_server.sh

The tested ROCm environment is summarized in the Ming recipe.

Send Requests¶

The Python client targets http://localhost:8091/v1 with api_key=EMPTY; it does not call OpenAI's hosted API.

python openai_speech_client.py \
    --text "你好，这是 Ming 在线语音合成测试。" \
    --max-new-tokens 200

Style or dialect controls can be plain text or Ming JSON. The upstream dialect example also uses yue_prompt.wav for speaker conditioning:

python openai_speech_client.py \
    --text "我觉得社会企业同个人都有责任" \
    --instruction-json '{"方言":"广粤话"}' \
    --ref-audio /path/to/yue_prompt.wav \
    --max-new-tokens 200

When --ref-audio is supplied without --ref-text, the server extracts the Ming speaker embedding, matching upstream use_spk_emb=True, without using the audio as a zero-shot prompt.

Reference-audio cloning:

python openai_speech_client.py \
    --text "我们的愿景是构建未来服务业的数字化基础设施，为世界带来更多微小而美好的改变。" \
    --ref-audio /path/to/10002287-00000094.wav \
    --ref-text "在此奉劝大家别乱打美白针。" \
    --max-new-tokens 200

Podcast-style multi-speaker prompt:

python openai_speech_client.py \
    --text " speaker_1:你可以说一下，就大概说一下，可能虽然我也不知道，我看过那部电影没有。
 speaker_2:就是那个叫什么，变相一节课的嘛。
 speaker_1:嗯。
 speaker_2:一部搞笑的电影。
 speaker_1:一部搞笑的。" \
    --ref-audio /path/to/CTS-CN-F2F-2019-11-11-423-012-A.wav \
    --ref-audio /path/to/CTS-CN-F2F-2019-11-11-423-012-B.wav \
    --ref-text " speaker_1:并且我们还要进行每个月还要考核 笔试的话还要进行笔试，做个，当服务员还要去笔试了
 speaker_2:对啊，这真的很奇怪，就是 单纯的因，单纯自己工资不高，只是因为可能人家那个店比较出名一点，就对你苛刻要求"

Streaming PCM:

python openai_speech_client.py \
    --text "你好，这是流式输出测试。" \
    --stream \
    --output ming_output.pcm

run_curl.sh keeps small smoke checks:

./run_curl.sh basic
REF_AUDIO=/path/to/reference.wav REF_TEXT="在此奉劝大家别乱打美白针。" ./run_curl.sh zero_shot
./run_curl.sh stream

Request Fields¶

Field	Ming meaning
`input`	target text
`instructions`	plain style text, or JSON object for structured Ming controls
`voice`	Ming IP voice label unless it resolves to an uploaded speaker
`language`	Ming `方言` control
`ref_audio`	speaker reference; with `ref_text`, also supplies the prompt waveform
`ref_text`	transcript enabling zero-shot or podcast prompt-latent conditioning
`speaker_embedding`	192-d Ming speaker embedding
`max_new_tokens`	Ming `max_decode_steps`

Notes¶

ref_audio accepts local paths through the client, remote URLs, file://, or data: URLs.
Non-streaming responses return WAV bytes; streaming responses return PCM.
Music-only bgm generation is offline-only until the API exposes Ming prompt-mode selection.

ming_tts/openai_speech_client.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/ming_tts/openai_speech_client.py.

ming_tts/run_curl.sh

#!/bin/bash
set -euo pipefail

MODE="${1:-basic}"
HOST="${HOST:-localhost}"
PORT="${PORT:-8091}"
MODEL="${MODEL:-inclusionAI/Ming-omni-tts-0.5B}"
API_URL="http://${HOST}:${PORT}/v1/audio/speech"
TEXT="${TEXT:-你好，这是 Ming 在线语音合成测试。}"
OUTPUT="${OUTPUT:-ming_output.wav}"
STREAM_OUTPUT="${STREAM_OUTPUT:-ming_output.pcm}"
REF_AUDIO="${REF_AUDIO:-}"
REF_TEXT="${REF_TEXT:-}"

post_json() {
    local payload="$1"
    local output_path="$2"
    curl -X POST "$API_URL" \
        -H "Content-Type: application/json" \
        -H "Authorization: Bearer EMPTY" \
        -d "$payload" \
        --output "$output_path"
}

case "$MODE" in
    basic)
        post_json "{
            \"model\": \"${MODEL}\",
            \"input\": \"${TEXT}\",
            \"response_format\": \"wav\"
        }" "$OUTPUT"
        ;;
    zero_shot)
        if [ -z "$REF_AUDIO" ] || [ -z "$REF_TEXT" ]; then
            echo "zero_shot requires REF_AUDIO and REF_TEXT" >&2
            exit 1
        fi
        python - <<'PY' > /tmp/ming_zero_shot_payload.json
import base64
import json
import mimetypes
import os
from pathlib import Path

path = Path(os.environ["REF_AUDIO"])
mime_type = mimetypes.guess_type(path.name)[0] or "audio/wav"
payload = {
    "model": os.environ["MODEL"],
    "input": os.environ["TEXT"],
    "ref_audio": f"data:{mime_type};base64,{base64.b64encode(path.read_bytes()).decode('utf-8')}",
    "ref_text": os.environ["REF_TEXT"],
    "response_format": "wav",
}
print(json.dumps(payload, ensure_ascii=False))
PY
        curl -X POST "$API_URL" \
            -H "Content-Type: application/json" \
            -H "Authorization: Bearer EMPTY" \
            --data-binary @/tmp/ming_zero_shot_payload.json \
            --output "$OUTPUT"
        rm -f /tmp/ming_zero_shot_payload.json
        ;;
    stream)
        post_json "{
            \"model\": \"${MODEL}\",
            \"input\": \"${TEXT}\",
            \"stream\": true,
            \"stream_format\": \"audio\",
            \"response_format\": \"pcm\"
        }" "$STREAM_OUTPUT"
        ;;
    *)
        echo "Unknown mode: $MODE" >&2
        echo "Supported sanity checks: basic, zero_shot, stream" >&2
        exit 1
        ;;
esac

ming_tts/run_server.sh

#!/bin/bash
# Launch vLLM-Omni server for Ming-omni-tts.
#
# Usage:
#   ./run_server.sh
#   PORT=8000 ./run_server.sh

set -e

DIR="$(cd "$(dirname "$0")" && pwd)"
ROOT="$(cd "$DIR/../../../.." && pwd)"

MODEL="${MODEL:-inclusionAI/Ming-omni-tts-0.5B}"
PORT="${PORT:-8091}"
DEPLOY_CONFIG="${DEPLOY_CONFIG:-$ROOT/vllm_omni/deploy/ming_tts.yaml}"

echo "Starting Ming-omni-tts server with model: $MODEL"
echo "Deploy config: $DEPLOY_CONFIG"

vllm-omni serve "$MODEL" \
    --deploy-config "$DEPLOY_CONFIG" \
    --host 0.0.0.0 \
    --port "$PORT" \
    --enforce-eager \
    --omni

moss_tts_nano/gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/moss_tts_nano/gradio_demo.py.

moss_tts_nano/run_gradio_demo.sh

#!/bin/bash
# Launch MOSS-TTS-Nano server + Gradio demo together.
#
# Usage:
#   ./run_gradio_demo.sh
#   CUDA_VISIBLE_DEVICES=0 PORT=8091 GRADIO_PORT=7860 ./run_gradio_demo.sh

set -e

MODEL="${MODEL:-OpenMOSS-Team/MOSS-TTS-Nano}"
PORT="${PORT:-8091}"
GRADIO_PORT="${GRADIO_PORT:-7860}"
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"

echo "Starting MOSS-TTS-Nano server (port $PORT)..."
FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm serve "$MODEL" \
    --host 0.0.0.0 \
    --port "$PORT" \
    --omni &
SERVER_PID=$!

cleanup() {
    echo "Stopping server (PID $SERVER_PID)..."
    kill $SERVER_PID 2>/dev/null
    wait $SERVER_PID 2>/dev/null
}
trap cleanup EXIT

# Wait for server to be ready.
echo "Waiting for server to start..."
for i in $(seq 1 120); do
    if curl -s "http://localhost:$PORT/health" > /dev/null 2>&1; then
        echo "Server ready."
        break
    fi
    sleep 2
done

echo "Starting Gradio demo (port $GRADIO_PORT)..."
python "$SCRIPT_DIR/gradio_demo.py" \
    --api-base "http://localhost:$PORT" \
    --port "$GRADIO_PORT"

moss_tts_nano/run_server.sh

#!/bin/bash
# Launch vLLM-Omni server for MOSS-TTS-Nano
#
# Usage:
#   ./run_server.sh
#   CUDA_VISIBLE_DEVICES=0 PORT=8091 ./run_server.sh

set -e

MODEL="${MODEL:-OpenMOSS-Team/MOSS-TTS-Nano}"
PORT="${PORT:-8091}"

echo "Starting MOSS-TTS-Nano server with model: $MODEL"

FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm serve "$MODEL" \
    --host 0.0.0.0 \
    --port "$PORT" \
    --omni

omnivoice/run_server.sh

#!/bin/bash
# Launch vLLM-Omni server for OmniVoice TTS
#
# Usage:
#   ./run_server.sh
#   CUDA_VISIBLE_DEVICES=0 ./run_server.sh

set -e

MODEL="${MODEL:-k2-fsa/OmniVoice}"
PORT="${PORT:-8091}"

echo "Starting OmniVoice server with model: $MODEL"

vllm serve "$MODEL" \
    --host 0.0.0.0 \
    --port "$PORT" \
    --trust-remote-code \
    --omni

omnivoice/speech_client.py

"""Client for OmniVoice TTS via /v1/audio/speech endpoint.

Examples:
    # Basic TTS (auto voice)
    python speech_client.py --text "Hello, how are you?"

    # Specify language
    python speech_client.py --text "Bonjour, comment allez-vous?" --language French

    # Use a specific uploaded/supported voice
    python speech_client.py --text "Hello" --voice my_uploaded_voice
"""

import argparse
import base64
import os

import httpx

DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"


def encode_audio_to_base64(audio_path: str) -> str:
    """Encode a local audio file to a base64 data URL."""
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")

    ext = audio_path.lower().rsplit(".", 1)[-1]
    mime = {
        "wav": "audio/wav",
        "mp3": "audio/mpeg",
        "flac": "audio/flac",
        "ogg": "audio/ogg",
    }.get(ext, "audio/wav")

    with open(audio_path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode("utf-8")
    return f"data:{mime};base64,{b64}"


def run_tts(args) -> None:
    """Generate speech via /v1/audio/speech API."""
    payload = {
        "model": args.model,
        "input": args.text,
        "response_format": args.response_format,
    }
    if args.seed is not None:
        payload["extra_params"] = {}
        payload["extra_params"]["seed"] = args.seed

    if args.voice:
        payload["voice"] = args.voice
    if args.language:
        payload["language"] = args.language

    if args.ref_audio:
        ref = args.ref_audio
        if ref.startswith(("http://", "https://", "data:")):
            payload["ref_audio"] = ref
        else:
            payload["ref_audio"] = encode_audio_to_base64(ref)

    if args.ref_text:
        payload["ref_text"] = args.ref_text

    if args.instructions:
        payload["instructions"] = args.instructions

    print(f"Model: {args.model}")
    print(f"Text: {args.text}")
    if args.seed:
        print(f"Seed: {args.seed}")

    if args.voice:
        print(f"Voice: {args.voice}")

    if args.language:
        print(f"Language: {args.language}")
    print("Generating audio...")

    api_url = f"{args.api_base}/v1/audio/speech"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {args.api_key}",
    }
    with httpx.Client(timeout=300.0) as client:
        response = client.post(api_url, json=payload, headers=headers)

    if response.status_code != 200:
        print(f"Error: {response.status_code}")
        print(response.text)
        return

    try:
        text = response.content.decode("utf-8")
        if text.startswith('{"error"'):
            print(f"Error: {text}")
            return
    except UnicodeDecodeError:
        pass

    output_path = args.output or "omnivoice_output.wav"
    with open(output_path, "wb") as f:
        f.write(response.content)
    print(f"Audio saved to: {output_path}")


def main():
    parser = argparse.ArgumentParser(description="OmniVoice TTS client")
    parser.add_argument("--api-base", default=DEFAULT_API_BASE, help="API base URL")
    parser.add_argument("--api-key", default=DEFAULT_API_KEY, help="API key")
    parser.add_argument("--model", "-m", default="k2-fsa/OmniVoice", help="Model name")
    parser.add_argument("--text", required=True, help="Text to synthesize")
    parser.add_argument(
        "--voice",
        default=None,
        help="Voice name (omit for auto voice; must match a supported or uploaded speaker if set)",
    )
    parser.add_argument("--language", default=None, help="Language hint (e.g., English, Chinese, French)")
    parser.add_argument(
        "--response-format",
        default="wav",
        choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
        help="Audio format (default: wav)",
    )
    parser.add_argument(
        "--ref-audio",
        type=str,
        default=None,
        help="Reference audio for voice cloning (local path, URL, or data: URI)",
    )
    parser.add_argument(
        "--ref-text",
        type=str,
        default=None,
        help="Reference text for voice cloning",
    )
    parser.add_argument(
        "--instructions",
        type=str,
        default=None,
        help="Voice style/emotion instructions",
    )
    parser.add_argument(
        "--seed",
        type=int,
        default=None,
        help="Random seed for generation, default: None for stochastic output)",
    )
    parser.add_argument("--output", "-o", default=None, help="Output file path")
    args = parser.parse_args()
    run_tts(args)


if __name__ == "__main__":
    main()

qwen3_tts/batch_speech_client.py

"""Batch speech client for Qwen3-TTS via /v1/audio/speech/batch endpoint.

This script demonstrates how to synthesize multiple texts in a single request.
A particularly useful scenario is voice cloning: set ref_audio once at the
batch level and generate many utterances in the cloned voice without repeating
the reference for each item.

Start the server (with batch-optimized stage settings for best throughput):

    vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
        --omni \
        --trust-remote-code \
        --stage-overrides '{"0":{"max_num_seqs":4,"gpu_memory_utilization":0.2},
                            "1":{"max_num_seqs":4,"gpu_memory_utilization":0.2}}'

Examples:
    # Batch with a predefined voice
    python batch_speech_client.py \
        --texts "Hello, how are you?" "Goodbye, see you later!"

    # Voice cloning: one ref_audio, many outputs
    python batch_speech_client.py \
        --task-type Base \
        --ref-audio /path/to/reference.wav \
        --ref-text "Transcript of the reference audio" \
        --texts "First cloned sentence." "Second cloned sentence." \
               "Third cloned sentence."
"""

import argparse
import base64
import os

import httpx

DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"


def encode_audio_to_base64(audio_path: str) -> str:
    """Encode a local audio file to a base64 data URL."""
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")

    ext = os.path.splitext(audio_path)[1].lower()
    mime_map = {".wav": "audio/wav", ".mp3": "audio/mpeg", ".flac": "audio/flac", ".ogg": "audio/ogg"}
    mime_type = mime_map.get(ext, "audio/wav")

    with open(audio_path, "rb") as f:
        audio_b64 = base64.b64encode(f.read()).decode("utf-8")
    return f"data:{mime_type};base64,{audio_b64}"


def run_batch(args) -> None:
    """Send a batch TTS request and save each result to a file."""
    items = [{"input": text} for text in args.texts]

    payload: dict = {
        "items": items,
        "response_format": args.response_format,
    }
    if args.voice:
        payload["voice"] = args.voice
    if args.language:
        payload["language"] = args.language
    if args.task_type:
        payload["task_type"] = args.task_type
    if args.instructions:
        payload["instructions"] = args.instructions
    if args.max_new_tokens:
        payload["max_new_tokens"] = args.max_new_tokens

    # Voice cloning parameters (shared across all items)
    if args.ref_audio:
        if args.ref_audio.startswith(("http://", "https://")):
            payload["ref_audio"] = args.ref_audio
        else:
            payload["ref_audio"] = encode_audio_to_base64(args.ref_audio)
    if args.ref_text:
        payload["ref_text"] = args.ref_text

    print(f"Sending batch of {len(items)} item(s) to {args.api_base}")
    if args.ref_audio:
        print("Voice cloning mode — ref_audio applied to all items")

    url = f"{args.api_base}/v1/audio/speech/batch"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {args.api_key}",
    }

    with httpx.Client(timeout=300.0) as client:
        response = client.post(url, json=payload, headers=headers)

    if response.status_code != 200:
        print(f"Error {response.status_code}: {response.text}")
        return

    data = response.json()
    print(f"Total: {data['total']}  Succeeded: {data['succeeded']}  Failed: {data['failed']}")

    os.makedirs(args.output_dir, exist_ok=True)
    for result in data["results"]:
        idx = result["index"]
        if result["status"] == "success":
            audio_bytes = base64.b64decode(result["audio_data"])
            out_path = os.path.join(args.output_dir, f"batch_{idx}.{args.response_format}")
            with open(out_path, "wb") as f:
                f.write(audio_bytes)
            print(f"  [{idx}] saved {len(audio_bytes)} bytes -> {out_path}")
        else:
            print(f"  [{idx}] FAILED: {result['error']}")


def parse_args():
    parser = argparse.ArgumentParser(
        description="Batch speech client for /v1/audio/speech/batch",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog=__doc__,
    )

    parser.add_argument("--api-base", default=DEFAULT_API_BASE, help="API base URL")
    parser.add_argument("--api-key", default=DEFAULT_API_KEY, help="API key")

    # Texts to synthesize
    parser.add_argument(
        "--texts",
        nargs="+",
        required=True,
        help="One or more texts to synthesize",
    )

    # Shared voice settings
    parser.add_argument("--voice", default="vivian", help="Speaker name (default: vivian)")
    parser.add_argument("--language", default=None, help="Language: Auto, Chinese, English, etc.")
    parser.add_argument("--instructions", default=None, help="Voice style/emotion instructions")
    parser.add_argument(
        "--task-type",
        default=None,
        choices=["CustomVoice", "VoiceDesign", "Base"],
        help="TTS task type (default: CustomVoice)",
    )

    # Voice cloning (Base task)
    parser.add_argument("--ref-audio", default=None, help="Reference audio path or URL for voice cloning")
    parser.add_argument("--ref-text", default=None, help="Reference audio transcript for voice cloning")

    # Generation
    parser.add_argument("--max-new-tokens", type=int, default=None, help="Max new tokens per item")
    parser.add_argument(
        "--response-format",
        default="wav",
        choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
        help="Audio format (default: wav)",
    )
    parser.add_argument("--output-dir", "-o", default="batch_output", help="Output directory (default: batch_output)")

    return parser.parse_args()


if __name__ == "__main__":
    args = parse_args()
    run_batch(args)

qwen3_tts/gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/qwen3_tts/gradio_demo.py.

qwen3_tts/openai_speech_client.py

"""OpenAI-compatible client for Qwen3-TTS via /v1/audio/speech endpoint.

This script demonstrates how to use the OpenAI-compatible speech API
to generate audio from text using Qwen3-TTS models.

Examples:
    # CustomVoice task (predefined speaker)
    python openai_speech_client.py --text "Hello, how are you?" --voice vivian

    # CustomVoice with emotion instruction
    python openai_speech_client.py --text "I'm so happy!" --voice vivian \
        --instructions "Speak with excitement"

    # VoiceDesign task (voice from description)
    python openai_speech_client.py --text "Hello world" \
        --task-type VoiceDesign \
        --instructions "A warm, friendly female voice"

    # Base task (voice cloning)
    python openai_speech_client.py --text "Hello world" \
        --task-type Base \
        --ref-audio "https://example.com/reference.wav" \
        --ref-text "This is the reference transcript"

    # Base task with pre-computed speaker embedding
    python openai_speech_client.py --text "Hello world" \
        --task-type Base \
        --speaker-embedding embedding.json
"""

import argparse
import base64
import json
import os

import httpx

# Default server configuration
DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"


def encode_audio_to_base64(audio_path: str) -> str:
    """Encode a local audio file to base64 data URL."""
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")

    # Detect MIME type from extension
    audio_path_lower = audio_path.lower()
    if audio_path_lower.endswith(".wav"):
        mime_type = "audio/wav"
    elif audio_path_lower.endswith((".mp3", ".mpeg")):
        mime_type = "audio/mpeg"
    elif audio_path_lower.endswith(".flac"):
        mime_type = "audio/flac"
    elif audio_path_lower.endswith(".ogg"):
        mime_type = "audio/ogg"
    else:
        mime_type = "audio/wav"  # Default

    with open(audio_path, "rb") as f:
        audio_bytes = f.read()
    audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")
    return f"data:{mime_type};base64,{audio_b64}"


def run_tts_generation(args) -> None:
    """Run TTS generation via OpenAI-compatible /v1/audio/speech API."""

    # Build request payload
    payload = {
        "model": args.model,
        "input": args.text,
        "voice": args.speaker,
        "response_format": args.response_format,
    }

    # Add optional parameters
    if args.instructions:
        payload["instructions"] = args.instructions
    if args.task_type:
        payload["task_type"] = args.task_type
    if args.language:
        payload["language"] = args.language
    if args.max_new_tokens:
        payload["max_new_tokens"] = args.max_new_tokens

    # Voice clone parameters (Base task)
    if args.ref_audio:
        if args.ref_audio.startswith(("http://", "https://")):
            payload["ref_audio"] = args.ref_audio
        elif args.ref_audio.startswith("data:"):
            payload["ref_audio"] = args.ref_audio
        else:
            payload["ref_audio"] = encode_audio_to_base64(args.ref_audio)
    if args.ref_text:
        payload["ref_text"] = args.ref_text
    if args.x_vector_only:
        payload["x_vector_only_mode"] = True
    if args.speaker_embedding:
        with open(args.speaker_embedding) as f:
            payload["speaker_embedding"] = json.load(f)

    print(f"Model: {args.model}")
    print(f"Task type: {args.task_type or 'CustomVoice'}")
    print(f"Text: {args.text}")
    print(f"Speaker: {args.speaker}")
    print("Generating audio...")

    # Make the API call
    api_url = f"{args.api_base}/v1/audio/speech"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {args.api_key}",
    }

    with httpx.Client(timeout=300.0) as client:
        response = client.post(api_url, json=payload, headers=headers)

    if response.status_code != 200:
        print(f"Error: {response.status_code}")
        print(response.text)
        return

    # Check for JSON error response (only if content is valid UTF-8 text)
    try:
        text = response.content.decode("utf-8")
        if text.startswith('{"error"'):
            print(f"Error: {text}")
            return
    except UnicodeDecodeError:
        pass  # Binary audio data, not an error

    # Save audio response
    output_path = args.output or "tts_output.wav"
    with open(output_path, "wb") as f:
        f.write(response.content)
    print(f"Audio saved to: {output_path}")


def parse_args():
    """Parse command line arguments."""
    parser = argparse.ArgumentParser(
        description="OpenAI-compatible client for Qwen3-TTS via /v1/audio/speech",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog=__doc__,
    )

    # Server configuration
    parser.add_argument(
        "--api-base",
        type=str,
        default=DEFAULT_API_BASE,
        help=f"API base URL (default: {DEFAULT_API_BASE})",
    )
    parser.add_argument(
        "--api-key",
        type=str,
        default=DEFAULT_API_KEY,
        help="API key (default: EMPTY)",
    )
    parser.add_argument(
        "--model",
        "-m",
        type=str,
        default="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
        help="Model name/path",
    )

    # Task configuration
    parser.add_argument(
        "--task-type",
        "-t",
        type=str,
        default=None,
        choices=["CustomVoice", "VoiceDesign", "Base"],
        help="TTS task type (default: CustomVoice)",
    )

    # Input text
    parser.add_argument(
        "--text",
        type=str,
        required=True,
        help="Text to synthesize",
    )

    # Voice/speaker
    parser.add_argument(
        "--speaker",
        type=str,
        default="vivian",
        help="Speaker name (default: vivian). Options: vivian, ryan, aiden, etc.",
    )
    parser.add_argument(
        "--language",
        type=str,
        default=None,
        help="Language: Auto, Chinese, English, etc.",
    )
    parser.add_argument(
        "--instructions",
        type=str,
        default=None,
        help="Voice style/emotion instructions",
    )

    # Base (voice clone) parameters
    parser.add_argument(
        "--ref-audio",
        type=str,
        default=None,
        help="Reference audio file path, URL, or base64 for voice cloning (Base task)",
    )
    parser.add_argument(
        "--ref-text",
        type=str,
        default=None,
        help="Reference audio transcript for voice cloning (Base task)",
    )
    parser.add_argument(
        "--x-vector-only",
        action="store_true",
        help="Use x-vector only mode for voice cloning (no ICL)",
    )
    parser.add_argument(
        "--speaker-embedding",
        type=str,
        default=None,
        help="Path to JSON file containing a pre-computed speaker embedding vector (1024-dim for 0.6B, 2048-dim for 1.7B)",
    )

    # Generation parameters
    parser.add_argument(
        "--max-new-tokens",
        type=int,
        default=None,
        help="Maximum new tokens to generate",
    )

    # Output
    parser.add_argument(
        "--response-format",
        type=str,
        default="wav",
        choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
        help="Audio output format (default: wav)",
    )
    parser.add_argument(
        "--output",
        "-o",
        type=str,
        default=None,
        help="Output audio file path (default: tts_output.wav)",
    )

    return parser.parse_args()


if __name__ == "__main__":
    args = parse_args()
    run_tts_generation(args)

qwen3_tts/precompute_custom_voice.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/qwen3_tts/precompute_custom_voice.py.

qwen3_tts/run_gradio_demo.sh

#!/bin/bash
# Launch both vLLM server and Gradio demo for Qwen3-TTS
#
# Usage:
#   ./run_gradio_demo.sh                                    # Default: CustomVoice
#   ./run_gradio_demo.sh --task-type VoiceDesign            # VoiceDesign model
#   ./run_gradio_demo.sh --task-type Base --gradio-port 7861
#
# Options:
#   --task-type TYPE        Task type: CustomVoice, VoiceDesign, Base (default: CustomVoice)
#   --server-port PORT      Port for vLLM server (default: 8000)
#   --gradio-port PORT      Port for Gradio demo (default: 7860)
#   --server-host HOST      Host for vLLM server (default: 0.0.0.0)
#   --gradio-ip IP          IP for Gradio demo (default: 127.0.0.1)
#   --share                 Share Gradio demo publicly

set -e

# Default values
TASK_TYPE="CustomVoice"
SERVER_PORT=8000
GRADIO_PORT=7860
SERVER_HOST="0.0.0.0"
GRADIO_IP="127.0.0.1"
GRADIO_SHARE=false

# Parse command line arguments
while [[ $# -gt 0 ]]; do
    case $1 in
        --task-type)
            TASK_TYPE="$2"
            shift 2
            ;;
        --server-port)
            SERVER_PORT="$2"
            shift 2
            ;;
        --gradio-port)
            GRADIO_PORT="$2"
            shift 2
            ;;
        --server-host)
            SERVER_HOST="$2"
            shift 2
            ;;
        --gradio-ip)
            GRADIO_IP="$2"
            shift 2
            ;;
        --share)
            GRADIO_SHARE=true
            shift
            ;;
        --help)
            echo "Usage: $0 [OPTIONS]"
            echo ""
            echo "Options:"
            echo "  --task-type TYPE        Task type: CustomVoice, VoiceDesign, Base (default: CustomVoice)"
            echo "  --server-port PORT      Port for vLLM server (default: 8000)"
            echo "  --gradio-port PORT      Port for Gradio demo (default: 7860)"
            echo "  --server-host HOST      Host for vLLM server (default: 0.0.0.0)"
            echo "  --gradio-ip IP          IP for Gradio demo (default: 127.0.0.1)"
            echo "  --share                 Share Gradio demo publicly"
            echo ""
            exit 0
            ;;
        *)
            echo "Unknown option: $1"
            echo "Use --help for usage information"
            exit 1
            ;;
    esac
done

# Map task type to model
case "$TASK_TYPE" in
    CustomVoice)
        MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"
        ;;
    VoiceDesign)
        MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign"
        ;;
    Base)
        MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-Base"
        ;;
    *)
        echo "Unknown task type: $TASK_TYPE"
        echo "Supported: CustomVoice, VoiceDesign, Base"
        exit 1
        ;;
esac

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
API_BASE="http://localhost:${SERVER_PORT}"

echo "=========================================="
echo "Qwen3-TTS Gradio Demo"
echo "=========================================="
echo "Task Type : $TASK_TYPE"
echo "Model     : $MODEL"
echo "Server    : http://${SERVER_HOST}:${SERVER_PORT}"
echo "Gradio    : http://${GRADIO_IP}:${GRADIO_PORT}"
echo "=========================================="

# Cleanup on exit
cleanup() {
    echo ""
    echo "Shutting down..."
    if [ -n "$SERVER_PID" ]; then
        echo "Stopping vLLM server (PID: $SERVER_PID)..."
        kill "$SERVER_PID" 2>/dev/null || true
        wait "$SERVER_PID" 2>/dev/null || true
    fi
    if [ -n "$GRADIO_PID" ]; then
        echo "Stopping Gradio demo (PID: $GRADIO_PID)..."
        kill "$GRADIO_PID" 2>/dev/null || true
        wait "$GRADIO_PID" 2>/dev/null || true
    fi
    echo "Cleanup complete"
    exit 0
}
trap cleanup SIGINT SIGTERM

# Start vLLM server
echo ""
echo "Starting vLLM server..."
LOG_FILE="/tmp/vllm_tts_server_${SERVER_PORT}.log"

vllm-omni serve "$MODEL" \
    --deploy-config vllm_omni/deploy/qwen3_tts.yaml \
    --host "$SERVER_HOST" \
    --port "$SERVER_PORT" \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --omni 2>&1 | tee "$LOG_FILE" &
SERVER_PID=$!

# Wait for server startup
echo ""
echo "Waiting for vLLM server to be ready..."
STARTUP_FLAG="/tmp/vllm_tts_startup_flag_${SERVER_PORT}.tmp"
rm -f "$STARTUP_FLAG"

(
    tail -f "$LOG_FILE" 2>/dev/null | grep -m 1 "Application startup complete" > /dev/null && touch "$STARTUP_FLAG"
) &
TAIL_PID=$!

MAX_WAIT=300
ELAPSED=0
while [ $ELAPSED -lt $MAX_WAIT ]; do
    if [ -f "$STARTUP_FLAG" ]; then
        kill "$TAIL_PID" 2>/dev/null || true
        wait "$TAIL_PID" 2>/dev/null || true
        echo ""
        echo "vLLM server is ready!"
        break
    fi
    if ! kill -0 "$SERVER_PID" 2>/dev/null; then
        kill "$TAIL_PID" 2>/dev/null || true
        echo ""
        echo "Error: vLLM server failed to start"
        exit 1
    fi
    sleep 1
    ELAPSED=$((ELAPSED + 1))
done

rm -f "$STARTUP_FLAG"

if [ $ELAPSED -ge $MAX_WAIT ]; then
    kill "$TAIL_PID" 2>/dev/null || true
    echo "Error: Server startup timed out after ${MAX_WAIT}s"
    kill "$SERVER_PID" 2>/dev/null || true
    exit 1
fi

# Start Gradio demo
echo ""
echo "Starting Gradio demo..."
cd "$SCRIPT_DIR"
GRADIO_CMD=("python" "gradio_demo.py" "--api-base" "$API_BASE" "--host" "$GRADIO_IP" "--port" "$GRADIO_PORT")
if [ "$GRADIO_SHARE" = true ]; then
    GRADIO_CMD+=("--share")
fi

"${GRADIO_CMD[@]}" &
GRADIO_PID=$!

echo ""
echo "=========================================="
echo "Both services are running!"
echo "=========================================="
echo "vLLM Server : http://${SERVER_HOST}:${SERVER_PORT}"
echo "Gradio Demo : http://${GRADIO_IP}:${GRADIO_PORT}"
echo ""
echo "Press Ctrl+C to stop both services"
echo "=========================================="
echo ""

wait $SERVER_PID $GRADIO_PID || true
cleanup

qwen3_tts/run_server.sh

#!/bin/bash
# Launch vLLM-Omni server for Qwen3-TTS models
#
# Usage:
#   ./run_server.sh                           # Default: CustomVoice model
#   ./run_server.sh CustomVoice               # CustomVoice model
#   ./run_server.sh VoiceDesign               # VoiceDesign model
#   ./run_server.sh Base                      # Base (voice clone) model

set -e

TASK_TYPE="${1:-CustomVoice}"

case "$TASK_TYPE" in
    CustomVoice)
        MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"
        ;;
    VoiceDesign)
        MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign"
        ;;
    Base)
        MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-Base"
        ;;
    *)
        echo "Unknown task type: $TASK_TYPE"
        echo "Supported: CustomVoice, VoiceDesign, Base"
        exit 1
        ;;
esac

echo "Starting Qwen3-TTS server with model: $MODEL"

vllm-omni serve "$MODEL" \
    --deploy-config vllm_omni/deploy/qwen3_tts.yaml \
    --host 0.0.0.0 \
    --port 8091 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --omni

qwen3_tts/speaker_embedding_interpolation.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/qwen3_tts/speaker_embedding_interpolation.py.

qwen3_tts/streaming_speech_client.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/qwen3_tts/streaming_speech_client.py.

qwen3_tts/tts_common.py

"""Shared constants, helpers, and payload building for Qwen3-TTS Gradio demos."""

import base64
import io

try:
    import gradio as gr
except ImportError:
    raise ImportError("gradio is required to run this demo. Install it with: pip install 'vllm-omni[demo]'") from None
import httpx
import numpy as np
import soundfile as sf

SUPPORTED_LANGUAGES = [
    "Auto",
    "Chinese",
    "English",
    "Japanese",
    "Korean",
    "German",
    "French",
    "Russian",
    "Portuguese",
    "Spanish",
    "Italian",
]

TASK_TYPES = ["CustomVoice", "VoiceDesign", "Base"]

PCM_SAMPLE_RATE = 24000

DEFAULT_API_BASE = "http://localhost:8000"


def fetch_voices(api_base: str) -> list[str]:
    """Fetch available voices from the server."""
    try:
        with httpx.Client(timeout=10.0) as client:
            resp = client.get(
                f"{api_base}/v1/audio/voices",
                headers={"Authorization": "Bearer EMPTY"},
            )
        if resp.status_code == 200:
            data = resp.json()
            voices = data.get("voices") or []
            if voices:
                return voices
    except Exception:
        pass
    return ["Vivian", "Ryan"]


def encode_audio_to_base64(audio_data: tuple) -> str:
    """Encode Gradio audio input (sample_rate, numpy_array) to base64 data URL."""
    sample_rate, audio_np = audio_data

    if audio_np.dtype != np.int16:
        if audio_np.dtype in (np.float32, np.float64):
            audio_np = np.clip(audio_np, -1.0, 1.0)
            audio_np = (audio_np * 32767).astype(np.int16)
        else:
            audio_np = audio_np.astype(np.int16)

    buf = io.BytesIO()
    sf.write(buf, audio_np, sample_rate, format="WAV")
    wav_b64 = base64.b64encode(buf.getvalue()).decode("utf-8")
    return f"data:audio/wav;base64,{wav_b64}"


def build_payload(
    text: str,
    task_type: str,
    voice: str,
    language: str,
    instructions: str,
    ref_audio: tuple | None,
    ref_audio_url: str,
    ref_text: str,
    x_vector_only: bool,
    response_format: str = "pcm",
    speed: float = 1.0,
    stream: bool = True,
) -> dict:
    """Build the /v1/audio/speech request payload.

    Raises gr.Error for invalid input so callers don't need to validate.
    """
    if not text or not text.strip():
        raise gr.Error("Please enter text to synthesize.")

    payload: dict = {
        "input": text.strip(),
        "response_format": "pcm" if stream else response_format,
        "stream": stream,
    }
    if stream:
        payload["stream_format"] = "audio"
    if not stream:
        payload["speed"] = speed

    if task_type:
        payload["task_type"] = task_type
    if language:
        payload["language"] = language

    if task_type == "CustomVoice":
        if voice:
            payload["voice"] = voice
        if instructions and instructions.strip():
            payload["instructions"] = instructions.strip()

    elif task_type == "VoiceDesign":
        if not instructions or not instructions.strip():
            raise gr.Error("VoiceDesign task requires voice style instructions.")
        payload["instructions"] = instructions.strip()

    elif task_type == "Base":
        ref_audio_url_stripped = ref_audio_url.strip() if ref_audio_url else ""
        if ref_audio_url_stripped:
            payload["ref_audio"] = ref_audio_url_stripped
        elif ref_audio is not None:
            payload["ref_audio"] = encode_audio_to_base64(ref_audio)
        else:
            raise gr.Error("Base (voice clone) task requires reference audio. Upload a file or provide a URL.")
        if ref_text and ref_text.strip():
            payload["ref_text"] = ref_text.strip()
        if x_vector_only:
            payload["x_vector_only_mode"] = True

    return payload


def on_task_type_change(task_type: str):
    """Update UI visibility based on selected task type."""
    if task_type == "CustomVoice":
        return (
            gr.update(visible=True),  # voice dropdown
            gr.update(visible=True, info="Optional style/emotion instructions"),
            gr.update(visible=False),  # ref_audio
            gr.update(visible=False),  # ref_audio_url
            gr.update(visible=False),  # ref_text
            gr.update(visible=False),  # x_vector_only
        )
    elif task_type == "VoiceDesign":
        return (
            gr.update(visible=False),
            gr.update(visible=True, info="Required: describe the voice style"),
            gr.update(visible=False),
            gr.update(visible=False),
            gr.update(visible=False),
            gr.update(visible=False),
        )
    elif task_type == "Base":
        return (
            gr.update(visible=False),
            gr.update(visible=False),
            gr.update(visible=True),
            gr.update(visible=True),
            gr.update(visible=True),
            gr.update(visible=True),
        )
    return (
        gr.update(visible=True),
        gr.update(visible=True),
        gr.update(visible=False),
        gr.update(visible=False),
        gr.update(visible=False),
        gr.update(visible=False),
    )


def stream_pcm_chunks(api_base: str, payload: dict):
    """Stream raw PCM bytes from the server, yielding int16 numpy arrays.

    Handles odd-byte boundaries between network chunks.
    """
    leftover = b""
    with httpx.Client(timeout=300.0) as client:
        with client.stream(
            "POST",
            f"{api_base}/v1/audio/speech",
            json=payload,
            headers={
                "Content-Type": "application/json",
                "Authorization": "Bearer EMPTY",
            },
        ) as resp:
            if resp.status_code != 200:
                resp.read()
                raise gr.Error(f"Server error ({resp.status_code}): {resp.text}")
            for chunk in resp.iter_bytes():
                if not chunk:
                    continue
                raw = leftover + chunk
                usable = len(raw) - (len(raw) % 2)
                leftover = raw[usable:]
                if usable == 0:
                    continue
                yield np.frombuffer(raw[:usable], dtype=np.int16).copy()


def add_common_args(parser):
    """Add CLI arguments shared by both demos."""
    parser.add_argument(
        "--api-base",
        default=DEFAULT_API_BASE,
        help=f"Base URL for the vLLM API server (default: {DEFAULT_API_BASE}).",
    )
    parser.add_argument(
        "--host",
        default="0.0.0.0",
        help="Host/IP for Gradio server (default: 0.0.0.0).",
    )
    parser.add_argument(
        "--port",
        type=int,
        default=7860,
        help="Port for Gradio server (default: 7860).",
    )
    parser.add_argument(
        "--share",
        action="store_true",
        help="Share the Gradio demo publicly.",
    )
    return parser

qwen3_tts/word_timestamps_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/qwen3_tts/word_timestamps_demo.py.

soulxsinger/openai_chat_client.py

#!/usr/bin/env python3
"""SoulX-Singer OpenAI-compatible chat client (SVS / SVC).

Sends prompt audio via ``input_audio`` and target accompaniment via
``extra_args['target_audio']`` (server-local path). For integrated preprocess,
also pass ``preprocess_weights_dir`` in ``extra_args``.

Usage:
  python openai_chat_client.py \\
      --prompt-audio /path/on/server/zh_prompt.mp3 \\
      --target-audio /path/on/server/music.mp3 \\
      --preprocess-weights-dir /path/on/server/SoulX-Singer-Preprocess \\
      -o output.wav
"""

from __future__ import annotations

import argparse
import base64
import io
import sys
from pathlib import Path

import requests
import soundfile
import torch


def _audio_to_data_url(path: Path) -> str:
    with path.open("rb") as handle:
        data = base64.b64encode(handle.read()).decode("ascii")
    return f"data:audio/mpeg;base64,{data}"


def _save_wav(audio: torch.Tensor, path: Path, sample_rate: int) -> None:
    audio = audio.to(torch.float32)
    peak = audio.abs().max().clamp(min=1e-8)
    audio = audio / peak
    path.parent.mkdir(parents=True, exist_ok=True)
    soundfile.write(str(path), audio.clamp(-1.0, 1.0).cpu().T.numpy(), sample_rate, subtype="PCM_16")


def _decode_audio_from_response(body: dict) -> tuple[torch.Tensor, int]:
    for choice in body.get("choices", []):
        audio_obj = choice.get("message", {}).get("audio")
        if isinstance(audio_obj, dict) and audio_obj.get("data"):
            data, sr = soundfile.read(
                io.BytesIO(base64.b64decode(audio_obj["data"])),
                dtype="float32",
                always_2d=True,
            )
            return torch.from_numpy(data).transpose(0, 1), sr
    brief = {k: v for k, v in body.items() if k != "choices"}
    raise RuntimeError(f"no audio in response message.audio: {brief}")


def main() -> int:
    repo_root = Path(__file__).resolve().parents[4]
    default_assets = repo_root / "tests" / "assets" / "soulxsinger"

    parser = argparse.ArgumentParser(description="SoulX-Singer online chat client")
    parser.add_argument("--port", type=int, default=8192)
    parser.add_argument("--model", default="Soul-AILab/SoulX-Singer")
    parser.add_argument(
        "--prompt-audio",
        default=str(default_assets / "zh_prompt.mp3"),
        help="Prompt vocal audio (path on server if using extra_args, or local for input_audio)",
    )
    parser.add_argument(
        "--target-audio",
        default=str(default_assets / "music.mp3"),
        help="Target accompaniment path on the server (extra_args['target_audio'])",
    )
    parser.add_argument(
        "--prompt-metadata-path",
        default=None,
        help="SVS precomputed prompt metadata.json",
    )
    parser.add_argument(
        "--target-metadata-path",
        default=None,
        help="SVS precomputed target metadata.json",
    )
    parser.add_argument(
        "--audio-path",
        default=None,
        help="SVS prompt vocal wav for precomputed metadata",
    )
    parser.add_argument("--preprocess-weights-dir", default=None)
    parser.add_argument("--output", "-o", default="soulxsinger_out.wav")
    parser.add_argument("--svc", action="store_true", help="Use SVC mode knobs")
    parser.add_argument("--language", default="Mandarin")
    parser.add_argument("--num-inference-steps", type=int, default=32)
    parser.add_argument("--guidance-scale", type=float, default=3.0)
    parser.add_argument(
        "--seed",
        type=int,
        default=42,
        help="Optional CFM seed. Omit for non-deterministic sampling.",
    )
    parser.add_argument(
        "--auto-shift",
        action=argparse.BooleanOptionalAction,
        default=True,
        help="Auto pitch shift (default: on, original upstream infer.sh)",
    )
    parser.add_argument(
        "--control",
        default="melody",
        choices=["melody", "score"],
        help="SVS control mode",
    )
    parser.add_argument("--vocal-sep", action="store_true")
    args = parser.parse_args()

    meta_paths = (args.prompt_metadata_path, args.target_metadata_path, args.audio_path)
    if any(meta_paths) and not all(meta_paths):
        print(
            "ERROR: precomputed metadata requires --prompt-metadata-path, "
            "--target-metadata-path, and --audio-path together.",
            file=sys.stderr,
        )
        return 2

    extra_args: dict = {
        "vocal_sep": args.vocal_sep,
        "auto_shift": args.auto_shift,
        "pitch_shift": 0,
    }
    if all(meta_paths):
        extra_args.update(
            {
                "prompt_metadata_path": str(Path(args.prompt_metadata_path).expanduser().resolve()),
                "target_metadata_path": str(Path(args.target_metadata_path).expanduser().resolve()),
                "audio_path": str(Path(args.audio_path).expanduser().resolve()),
            }
        )
        content = [{"type": "text", "text": "soulx-singer"}]
    else:
        prompt_path = Path(args.prompt_audio).expanduser().resolve()
        if not prompt_path.is_file():
            print(f"ERROR: prompt audio not found: {prompt_path}", file=sys.stderr)
            return 2
        extra_args["prompt_audio"] = str(prompt_path)
        extra_args["target_audio"] = str(Path(args.target_audio).expanduser().resolve())
        if args.preprocess_weights_dir:
            extra_args["preprocess_weights_dir"] = str(Path(args.preprocess_weights_dir).expanduser().resolve())
        content = [
            {"type": "text", "text": "soulx-singer"},
            {
                "type": "input_audio",
                "input_audio": {"data": _audio_to_data_url(prompt_path), "format": "mp3"},
            },
        ]
    if not args.svc:
        extra_args["language"] = args.language
        extra_args["control"] = args.control

    payload = {
        "model": args.model,
        "modalities": ["audio"],
        "messages": [{"role": "user", "content": content}],
        "num_inference_steps": args.num_inference_steps,
        "guidance_scale": args.guidance_scale,
        "extra_args": extra_args,
    }
    if args.seed is not None:
        payload["seed"] = args.seed

    print(f"POST http://localhost:{args.port}/v1/chat/completions")
    response = requests.post(
        f"http://localhost:{args.port}/v1/chat/completions",
        headers={"Content-Type": "application/json"},
        json=payload,
        timeout=1800,
    )
    response.raise_for_status()
    audio, sample_rate = _decode_audio_from_response(response.json())
    _save_wav(audio, Path(args.output), sample_rate)
    duration = audio.shape[-1] / sample_rate
    print(f"saved {args.output}  sr={sample_rate}Hz  duration={duration:.2f}s")
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

soulxsinger/run_server.sh

#!/bin/bash
# Launch vLLM-Omni server for SoulX-Singer (single-stage DiT, preprocess inline).
#
# Usage:
#   MODEL=/path/to/SoulX-Singer PREPROCESS=/path/to/Preprocess \
#
# Audio paths in client extra_args must be readable on the server host.

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../../../.." && pwd)"

MODEL="${MODEL:-Soul-AILab/SoulX-Singer}"
MODE="${MODE:-svs}"
PORT="${PORT:-8192}"
GPUS="${GPUS:-0}"

if [[ "$MODE" == "svc" ]]; then
  DEPLOY_CONFIG="${DEPLOY_CONFIG:-$REPO_ROOT/vllm_omni/deploy/soulxsinger_svc.yaml}"
else
  DEPLOY_CONFIG="${DEPLOY_CONFIG:-$REPO_ROOT/vllm_omni/deploy/soulxsinger_svs.yaml}"
fi

echo "Starting SoulX-Singer server"
echo "  MODEL=$MODEL"
echo "  MODE=$MODE"
echo "  PORT=$PORT"
echo "  DEPLOY_CONFIG=$DEPLOY_CONFIG"
echo "  CUDA_VISIBLE_DEVICES=$GPUS"

CUDA_VISIBLE_DEVICES="$GPUS" \
vllm serve "$MODEL" \
    --omni \
    --deploy-config "$DEPLOY_CONFIG" \
    --host 0.0.0.0 \
    --port "$PORT" \
    --trust-remote-code \
    --enforce-eager

voxcpm2/gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/voxcpm2/gradio_demo.py.

voxcpm2/openai_speech_client.py

"""OpenAI-compatible client for VoxCPM2 TTS via /v1/audio/speech endpoint.

Examples:
    # Zero-shot synthesis
    python openai_speech_client.py --text "Hello, this is VoxCPM2."

    # Voice cloning with a local reference audio file
    python openai_speech_client.py --text "Hello world" \
        --ref-audio /path/to/reference.wav

    # Voice cloning with a URL
    python openai_speech_client.py --text "Hello world" \
        --ref-audio "https://example.com/reference.wav"

Server setup:
    vllm serve openbmb/VoxCPM2 --omni --host 0.0.0.0 --port 8000
"""

from __future__ import annotations

import argparse
import base64
import os

import httpx

DEFAULT_API_BASE = "http://localhost:8000"
DEFAULT_API_KEY = "sk-empty"


def encode_audio_to_base64(audio_path: str) -> str:
    """Encode a local audio file to a base64 data URL."""
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")

    ext = audio_path.lower().rsplit(".", 1)[-1]
    mime = {
        "wav": "audio/wav",
        "mp3": "audio/mpeg",
        "flac": "audio/flac",
        "ogg": "audio/ogg",
    }.get(ext, "audio/wav")

    with open(audio_path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode("utf-8")
    return f"data:{mime};base64,{b64}"


def main() -> None:
    parser = argparse.ArgumentParser(description="VoxCPM2 OpenAI speech client")
    parser.add_argument("--text", type=str, required=True, help="Text to synthesize")
    parser.add_argument(
        "--ref-audio",
        type=str,
        default=None,
        help="Reference audio for voice cloning (local path, URL, or data: URI)",
    )
    parser.add_argument("--model", type=str, default="voxcpm2")
    parser.add_argument("--output", type=str, default="output.wav")
    parser.add_argument("--api-base", type=str, default=DEFAULT_API_BASE)
    parser.add_argument("--api-key", type=str, default=DEFAULT_API_KEY)
    parser.add_argument("--response-format", type=str, default="wav")
    args = parser.parse_args()

    # VoxCPM2 has no predefined voices. The "voice" field is required by
    # the OpenAI API schema but ignored by VoxCPM2 — use any placeholder.
    # For voice cloning, pass --ref-audio instead.
    payload: dict = {
        "model": args.model,
        "input": args.text,
        "voice": "default",
        "response_format": args.response_format,
    }

    if args.ref_audio:
        ref = args.ref_audio
        if ref.startswith(("http://", "https://", "data:")):
            payload["ref_audio"] = ref
        else:
            payload["ref_audio"] = encode_audio_to_base64(ref)

    url = f"{args.api_base}/v1/audio/speech"
    print(f"POST {url}")
    print(f"  text: {args.text}")
    if args.ref_audio:
        print(f"  ref_audio: {args.ref_audio[:80]}...")

    with httpx.Client(timeout=300) as client:
        resp = client.post(
            url,
            json=payload,
            headers={"Authorization": f"Bearer {args.api_key}"},
        )

    if resp.status_code != 200:
        print(f"Error {resp.status_code}: {resp.text[:500]}")
        return

    with open(args.output, "wb") as f:
        f.write(resp.content)
    print(f"Saved: {args.output} ({len(resp.content):,} bytes)")


if __name__ == "__main__":
    main()

voxcpm2/precompute_custom_voice.py

"""Pre-compute VoxCPM2 custom voice profiles.

The generated directory can be passed to the server via
``custom_voice_dir`` in ``vllm_omni/deploy/voxcpm2.yaml``. Requests can then
use ``/v1/audio/speech`` with ``voice="<name>"`` and no per-request ref_audio.
"""

from __future__ import annotations

import argparse
import json
import sys
from pathlib import Path
from typing import Any

import torch
from safetensors.torch import save_file

REPO_ROOT = Path(__file__).resolve().parents[4]
if str(REPO_ROOT) not in sys.path:
    sys.path.insert(0, str(REPO_ROOT))

from vllm_omni.utils.custom_voice_io import safe_voice_stem  # noqa: E402

MANIFEST_NAME = "custom_voice_manifest.json"


def _load_tts(model: str, device: torch.device):
    from vllm_omni.model_executor.models.voxcpm2.voxcpm2_import_utils import import_voxcpm2_core

    VoxCPM = import_voxcpm2_core()
    native = VoxCPM.from_pretrained(model, load_denoiser=False, optimize=False)
    return native.tts_model.to(device).eval()


def _load_manifest(output_dir: Path, model: str) -> dict[str, Any]:
    path = output_dir / MANIFEST_NAME
    if path.exists():
        return json.loads(path.read_text(encoding="utf-8"))
    return {
        "schema_version": 1,
        "model_type": "voxcpm2",
        "model": model,
        "voices": {},
    }


def _write_voice(
    *,
    model: str,
    output_dir: Path,
    voice_name: str,
    ref_audio: str,
    prompt_text: str | None,
    mode: str,
    speaker_description: str | None,
    device: torch.device,
) -> None:
    if mode in ("continuation", "ref_continuation") and not prompt_text:
        raise ValueError("--prompt-text is required for continuation/ref_continuation modes")

    tts = _load_tts(model, device)
    tensors: dict[str, torch.Tensor] = {}
    with torch.inference_mode():
        if mode in ("reference", "ref_continuation"):
            tensors["ref_audio_feat"] = tts._encode_wav(ref_audio, padding_mode="right").float().cpu().contiguous()
        if mode in ("continuation", "ref_continuation"):
            tensors["audio_feat"] = tts._encode_wav(ref_audio, padding_mode="left").float().cpu().contiguous()

    output_dir.mkdir(parents=True, exist_ok=True)
    filename = f"{safe_voice_stem(voice_name)}.safetensors"
    save_file(tensors, str(output_dir / filename))

    manifest = _load_manifest(output_dir, model)
    entry: dict[str, Any] = {
        "name": voice_name,
        "file": filename,
        "mode": mode,
    }
    if "ref_audio_feat" in tensors:
        entry["ref_audio_feat_len"] = int(tensors["ref_audio_feat"].shape[0])
    if "audio_feat" in tensors:
        entry["audio_feat_len"] = int(tensors["audio_feat"].shape[0])
    if prompt_text:
        entry["prompt_text"] = prompt_text
    if speaker_description:
        entry["speaker_description"] = speaker_description

    manifest.setdefault("voices", {})[voice_name] = entry
    (output_dir / MANIFEST_NAME).write_text(json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8")
    print(f"Wrote {output_dir / filename}")
    print(f"Updated {output_dir / MANIFEST_NAME}")


def main() -> None:
    parser = argparse.ArgumentParser(description="Pre-compute VoxCPM2 custom voice profile")
    parser.add_argument("--model", default="openbmb/VoxCPM2", help="VoxCPM2 model path or Hugging Face ID")
    parser.add_argument("--voice-name", required=True)
    parser.add_argument("--ref-audio", required=True)
    parser.add_argument(
        "--prompt-text",
        default=None,
        help="Transcript of ref audio for continuation/ref_continuation modes",
    )
    parser.add_argument(
        "--mode",
        choices=["reference", "continuation", "ref_continuation"],
        default="reference",
    )
    parser.add_argument("--speaker-description", default=None)
    parser.add_argument("--output-dir", required=True)
    parser.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu")
    args = parser.parse_args()

    _write_voice(
        model=args.model,
        output_dir=Path(args.output_dir),
        voice_name=args.voice_name,
        ref_audio=args.ref_audio,
        prompt_text=args.prompt_text,
        mode=args.mode,
        speaker_description=args.speaker_description,
        device=torch.device(args.device),
    )


if __name__ == "__main__":
    main()

voxtral_tts/gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/voxtral_tts/gradio_demo.py.

voxtral_tts/text_preprocess.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/voxtral_tts/text_preprocess.py.