Skip to content

Text-To-Speech (Online Serving)

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/text_to_speech.

vLLM-Omni exposes TTS models through the OpenAI-compatible POST /v1/audio/speech endpoint, launched with vllm serve <model> --omni. Each TTS model has its own subdirectory containing client snippets, gradio demos, and helper scripts; this README is the single doc entry point for all of them.

For offline inference, see examples/offline_inference/text_to_speech. For the full list of supported architectures across all modalities, see Supported Models.

Supported Models

Model HuggingFace repo Voice cloning Streaming Voice presets / upload Gradio demo
Fish Speech S2 Pro fishaudio/s2-pro ✓ (ref_audio+ref_text) ✓ (PCM stream)
GLM-TTS zai-org/GLM-TTS ✓ (ref_audio+ref_text, required) ✓ (PCM stream)
IndexTTS-2 IndexTeam/IndexTTS-2 ✓ (ref_audio or uploaded voice) compat only, non-chunk uploaded audio voice only; no presets
Ming-omni-tts inclusionAI/Ming-omni-tts-0.5B ✓ (ref_audio / speaker_embedding) ✓ (PCM stream) IP labels + structured instructions
Ming-flash-omni-TTS Jonathan1909/Ming-flash-omni-2.0 — (caption-controlled) caption fields (instructions)
MOSS-TTS-Nano OpenMOSS-Team/MOSS-TTS-Nano ✓ (ref_audio required) ✓ (PCM stream)
OmniVoice k2-fsa/OmniVoice
Qwen3-TTS Qwen/Qwen3-TTS-12Hz-1.7B-{CustomVoice,VoiceDesign,Base} ✓ (Base) ✓ (PCM + WebSocket) ✓ (presets + /v1/audio/voices upload) ✓ (standard + FastRTC)
VoxCPM2 openbmb/VoxCPM2 ✓ (AudioWorklet via gradio)
Voxtral TTS mistralai/Voxtral-4B-TTS-2603 ✓ (gated upstream) ✓ (presets)
SoulX-Singer Soul-AILab/SoulX-Singer ✓ (prompt audio) — (batch only) — (prompt + target audio) — (chat client)

CosyVoice3 is intentionally absent: no online example exists for it yet. See its offline section instead.

Common Quick Start

Launch the server (defaults shown — adjust --port, --gpu-memory-utilization, etc. as needed):

vllm serve <hf-repo-or-local-path> --omni --port 8091

Send a TTS request via curl. These generic snippets assume a model with a preset/default voice; voice-cloning-only models such as IndexTTS-2 require ref_audio or an uploaded audio voice (see model-specific sections below).

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Hello, how are you?",
        "voice": "default",
        "response_format": "wav"
    }' --output output.wav

Or via Python httpx:

import httpx

response = httpx.post(
    "http://localhost:8091/v1/audio/speech",
    json={
        "input": "Hello, how are you?",
        "voice": "default",
        "response_format": "wav",
    },
    timeout=300.0,
)
open("output.wav", "wb").write(response.content)

Or via the OpenAI SDK:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8091/v1", api_key="none")
response = client.audio.speech.create(
    model="<hf-repo>",
    voice="default",
    input="Hello, how are you?",
)
response.stream_to_file("output.wav")

Streaming PCM output (where supported) — set stream=true, stream_format="audio", and response_format="pcm":

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Hello, how are you?",
        "voice": "default",
        "stream": true,
        "stream_format": "audio",
        "response_format": "pcm"
    }' --no-buffer | play -t raw -r 24000 -e signed -b 16 -c 1 -

Adjust the player's sample rate to match the model (44.1 kHz for Fish Speech, 48 kHz for VoxCPM2, 22.05 kHz for IndexTTS-2, and 24 kHz for many others).

For full request-shape documentation (all parameters, response formats, error codes), see the Speech API reference.


GLM-TTS

2-stage TTS (AR + DiT flow-matching) at 24 kHz. Every request requires ref_audio + ref_text.

Launch

vllm serve zai-org/GLM-TTS --omni --trust-remote-code --port 8091
# or:
bash examples/online_serving/text_to_speech/glm_tts/run_server.sh /path/to/GLM-TTS

Sending requests

# Voice cloning (required)
python examples/online_serving/text_to_speech/glm_tts/openai_speech_client.py \
    --text "你好,这是语音克隆测试。" \
    --ref-audio file:///path/to/ref.wav \
    --ref-text "这是参考音频的文本内容。"

# Custom format
python examples/online_serving/text_to_speech/glm_tts/openai_speech_client.py \
    --text "Hello, this is a voice cloning test." \
    --ref-audio file:///path/to/ref.wav \
    --ref-text "Transcript of the reference audio." \
    --response-format mp3 -o output.mp3

Gradio demo

bash examples/online_serving/text_to_speech/glm_tts/run_gradio_demo.sh

Notes

  • Output: 24 kHz mono WAV via HiFT vocoder.
  • ref_audio + ref_text are required together on every request. Reference audio should be 3-10 seconds.
  • Voice cloning feature extraction (WhisperVQ, CampPlus, mel) runs on the model side — no external dependency on the serving layer.

IndexTTS-2

2-stage TTS (GPT AR + S2Mel CFM DiT + BigVGAN) at 22.05 kHz. Requests use ref_audio for voice cloning, or an uploaded audio voice from /v1/audio/voices. Supports emotion conditioning via emo_audio, emo_text, or emo_vector passed in extra_params.

Launch

vllm serve IndexTeam/IndexTTS-2 --omni --trust-remote-code --port 8092
# or, to pass the bundled deploy config explicitly:
bash examples/online_serving/text_to_speech/indextts2/run_server.sh

Sending requests

# Voice cloning (ref_audio required)
python examples/online_serving/text_to_speech/indextts2/speech_client.py \
    --text "你好,世界!" \
    --ref-audio /path/to/reference.wav

# With emotion audio
python examples/online_serving/text_to_speech/indextts2/speech_client.py \
    --text "今天心情很好!" \
    --ref-audio /path/to/ref.wav \
    --emo-audio /path/to/happy.wav

Notes

  • Output: 22.05 kHz mono WAV.
  • Provide ref_audio on the documented raw request path, or pass voice only when it names an uploaded audio voice; IndexTTS-2 does not provide a built-in text-only preset voice.
  • Emotion params (emo_audio, emo_text, emo_vector, emo_alpha, use_emo_text, use_random) are passed via the extra_params field. Official precedence is use_emo_text > emo_vector > emo_audio > same emotion as the speaker reference.
  • stream=true is accepted as an OpenAI-compatible response path, but IndexTTS-2 is not async-chunk streaming; audio is produced after S2Mel receives the full mel-code sequence.
  • Deploy config: vllm_omni/deploy/indextts2.yaml (auto-loaded).

Fish Speech S2 Pro

4B dual-AR TTS at 44.1 kHz. Server uses the DAC codec.

Prerequisites

pip install fish-speech

Kvcache attention fast path

Fish Speech S2 Pro uses a Triton decode-only kvcache attention fast path by default on CUDA builds. Set VLLM_OMNI_FISH_KVCACHE_ATTN=0 to disable it, or VLLM_OMNI_FISH_KVCACHE_ATTN=required to fail fast if the fast path cannot be installed.

# Verify fast path availability.
python - <<'PY'
from vllm_omni.attention import fish_kvcache_attn

print(fish_kvcache_attn.is_available())
print(fish_kvcache_attn.load_error())
PY

# Optional: disable the runtime fast path.
export VLLM_OMNI_FISH_KVCACHE_ATTN=0

Launch

vllm serve fishaudio/s2-pro --omni --port 8091
# or:
./fish_speech/run_server.sh
The deploy config auto-loads from vllm_omni/deploy/fish_qwen3_omni.yaml (the HF model_type on the fishaudio checkpoint is fish_qwen3_omni).

Voice cloning

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Hello, this is a cloned voice.",
        "voice": "default",
        "ref_audio": "https://example.com/reference.wav",
        "ref_text": "Transcript of the reference audio."
    }' --output cloned.wav

CLI client

cd examples/online_serving/text_to_speech/fish_speech
python speech_client.py --text "Hello, how are you?"
python speech_client.py --text "Hello world" --stream --output output.pcm

Gradio demo

./fish_speech/run_gradio_demo.sh             # launches server + Gradio
python fish_speech/gradio_demo.py --api-base http://localhost:8091  # if server already running

Notes

  • Output: 44.1 kHz mono.
  • Streaming PCM player command must use -r 44100.

Ming-omni-tts

Dense 0.5B two-stage TTS served through /v1/audio/speech. Ming uses the standard speech endpoint plus structured controls in instructions, voice, language, ref_audio, ref_text, and speaker_embedding.

Launch

bash examples/online_serving/text_to_speech/ming_tts/run_server.sh
Equivalent manual command:
vllm-omni serve inclusionAI/Ming-omni-tts-0.5B \
    --deploy-config vllm_omni/deploy/ming_tts.yaml \
    --host 0.0.0.0 --port 8091 \
    --enforce-eager --omni

Sending requests

python examples/online_serving/text_to_speech/ming_tts/openai_speech_client.py \
    --text "你好,这是 Ming 在线语音合成测试。"

Structured dialect control:

python examples/online_serving/text_to_speech/ming_tts/openai_speech_client.py \
    --text "我觉得社会企业同个人都有责任" \
    --instruction-json '{"方言":"广粤话"}' \
    --ref-audio /path/to/yue_prompt.wav

Zero-shot cloning:

python examples/online_serving/text_to_speech/ming_tts/openai_speech_client.py \
    --text "我们的愿景是构建未来服务业的数字化基础设施,为世界带来更多微小而美好的改变。" \
    --ref-audio /path/to/10002287-00000094.wav \
    --ref-text "在此奉劝大家别乱打美白针。"

Notes

  • run_curl.sh keeps a small sanity subset; use the Ming README for the broader request cookbook.
  • Online serving is speech-shaped today; music-only bgm and text-to-audio tta remain offline examples.
  • Full request details live in ming_tts/README.md.

Ming-flash-omni-TTS

Standalone talker-only deployment of Ming-flash-omni-2.0. Voice is controlled through caption text passed via instructions.

Launch

# from repo root
bash examples/online_serving/text_to_speech/ming_flash_omni_tts/run_server.sh
Equivalent manual command:
vllm serve Jonathan1909/Ming-flash-omni-2.0 \
    --deploy-config vllm_omni/deploy/ming_flash_omni_tts.yaml \
    --host 0.0.0.0 --port 8091 \
    --trust-remote-code --omni

Sending requests

python examples/online_serving/text_to_speech/ming_flash_omni_tts/speech_client.py \
    --text "我们当迎着阳光辛勤耕作,去摘取,去制作,去品尝,去馈赠。" \
    --output ming_online.wav

ASMR-style caption via instructions:

python examples/online_serving/text_to_speech/ming_flash_omni_tts/speech_client.py \
    --text "我会一直在这里陪着你,直到你慢慢、慢慢地沉入那个最温柔的梦里……好吗?" \
    --instructions "这是一种ASMR耳语,属于一种旨在引发特殊感官体验的创意风格。这个女性使用轻柔的普通话进行耳语,声音气音成分重。" \
    --output ming_online_asmr.wav

Notes


MOSS-TTS-Nano

Single-stage 0.1B AR LM + MOSS-Audio-Tokenizer-Nano codec at 48 kHz mono. Every request must include ref_audio; there are no built-in speaker presets.

The OpenAI-schema voice and ref_text fields are accepted but ignored — voice_clone does not consume a transcript, and upstream's continuation mode (the only path that accepts prompt_text) emits near-silent output, so it is not exposed here. Sample reference clips ship in the upstream repo under assets/audio/.

Launch

vllm serve OpenMOSS-Team/MOSS-TTS-Nano --omni --port 8091
# or:
./moss_tts_nano/run_server.sh
The deploy config at vllm_omni/deploy/moss_tts_nano.yaml auto-loads; no --stage-configs-path, --trust-remote-code, or --enforce-eager flags are needed.

Sending requests

# One-off fetch of a sample reference clip; cache under XDG_CACHE_HOME.
REF_DIR="${XDG_CACHE_HOME:-$HOME/.cache}/moss-tts-nano"
mkdir -p "$REF_DIR"
REF_WAV="$REF_DIR/zh_1.wav"
[ -s "$REF_WAV" ] || curl -L -o "$REF_WAV" https://raw.githubusercontent.com/OpenMOSS/MOSS-TTS-Nano/main/assets/audio/zh_1.wav
REF_AUDIO=$(base64 -w 0 "$REF_WAV")

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d "{
        \"input\": \"你好,这是语音合成测试。\",
        \"ref_audio\": \"data:audio/wav;base64,${REF_AUDIO}\",
        \"response_format\": \"wav\"
    }" --output output.wav

Streaming PCM

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d "{
        \"input\": \"Hello, streaming output from MOSS-TTS-Nano.\",
        \"ref_audio\": \"data:audio/wav;base64,${REF_AUDIO}\",
        \"stream\": true,
        \"stream_format\": \"audio\",
        \"response_format\": \"pcm\"
    }" --no-buffer | play -t raw -r 48000 -e signed -b 16 -c 1 -

Gradio demo

# Option 1: launch server + Gradio together
./moss_tts_nano/run_gradio_demo.sh

# Option 2: server already running
python moss_tts_nano/gradio_demo.py --api-base http://localhost:8091
Then open http://localhost:7860 in your browser.

Notes

  • Output is 48 kHz mono PCM (the upstream tokenizer is internally stereo at 48 kHz; the wrapper averages to mono before reaching the engine).
  • Standard /v1/audio/speech request shape: input, ref_audio (base64 data URL), response_format, stream, max_new_tokens. The voice and ref_text fields from the OpenAI schema are accepted but ignored.

OmniVoice

Zero-shot multilingual TTS (600+ languages). Online serving currently exposes auto voice only; voice cloning and voice design are available offline.

Prerequisites

huggingface-cli download k2-fsa/OmniVoice
Voice cloning (offline) needs transformers>=5.3.0; auto voice works with transformers>=4.57.0.

Launch

vllm serve k2-fsa/OmniVoice --omni --port 8091 --trust-remote-code
# or:
./omnivoice/run_server.sh

CLI client

cd examples/online_serving/text_to_speech/omnivoice
# Text-only (auto voice)
python speech_client.py --text "Hello, how are you?"

# Language hint
python speech_client.py --text "Bonjour, comment allez-vous?" --language French
# Voice cloning (reference audio + optional ref_text)
python speech_client.py \
--text "Bonjour, comment allez-vous?" \
--ref-audio /path/to/ref_audio.wav \
--ref-text "Bonjour, comment allez-vous?"

# Style instruction (voice design-style control)
python speech_client.py \
--text "Bonjour, comment allez-vous?" \
--language French \
--instructions "loud voice"

# Deterministic output with seed parameter
python speech_client.py --text "Hello, how are you?" --seed 42

The client supports --api-base, --model, --text, --response-format, --language, --voice, --ref-audio, --ref-text, --instructions, --seed, and --output.

Qwen3-TTS

Three model variants exposed via separate checkpoints:

Variant HF repo Use
CustomVoice Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice Predefined speakers (vivian, ryan, …) with optional style instructions
VoiceDesign Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign Natural-language voice style description
Base Qwen/Qwen3-TTS-12Hz-1.7B-Base Voice cloning from a reference audio

Each variant ships smaller 0.6B companions where available.

Launch

vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --omni --port 8091
# or:
./qwen3_tts/run_server.sh                # default: CustomVoice
./qwen3_tts/run_server.sh VoiceDesign
./qwen3_tts/run_server.sh Base

Executor backend

Single-GPU serves now default to the uniproc executor (lower IPC overhead, the Base cloning use case from #2603 / #2604). vllm_omni/deploy/qwen3_tts.yaml is the only Qwen3-TTS deploy config; pass --deploy-config <path> to override.

To opt out of chunked streaming, pass --no-async-chunk — the pipeline auto-dispatches to the end-to-end codec processor.

Sending requests

# CustomVoice with a predefined speaker
python qwen3_tts/openai_speech_client.py \
    --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
    --text "今天天气真好" \
    --speaker ryan \
    --instructions "用开心的语气说"

# VoiceDesign with a style description
python qwen3_tts/openai_speech_client.py \
    --model Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign \
    --task-type VoiceDesign \
    --text "哥哥,你回来啦" \
    --instructions "体现撒娇稚嫩的萝莉女声,音调偏高"

# Base voice cloning
python qwen3_tts/openai_speech_client.py \
    --model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
    --task-type Base \
    --text "Hello, this is a cloned voice" \
    --ref-audio /path/to/reference.wav \
    --ref-text "Original transcript of the reference audio"

Voices endpoint

List available voices, or upload a custom one for Base cloning:

# List
curl http://localhost:8091/v1/audio/voices

# Upload
curl -X POST http://localhost:8091/v1/audio/voices \
    -F "audio_sample=@/path/to/voice_sample.wav" \
    -F "consent=user_consent_id" \
    -F "name=custom_voice_1" \
    -F "ref_text=The exact transcript of the audio sample." \
    -F "speaker_description=warm narrator"
Uploaded voices are then usable as voice="custom_voice_1" on subsequent requests.

Precomputed custom voices

For reused Base voice-cloning speakers, precompute the reference artifacts once and load them at server startup:

python qwen3_tts/precompute_custom_voice.py \
    --model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
    --voice-name alice \
    --ref-audio /path/to/reference.wav \
    --ref-text "Original transcript of the reference audio" \
    --mode icl \
    --output-dir /path/to/custom_voices
--mode icl stores both speaker_embedding and ref_code; --mode xvec stores only the speaker embedding. Add the output directory to a deploy config:
custom_voice_dir: /path/to/custom_voices
Then start the server with that config and call the Speech API with only the voice name:
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-Base --omni --deploy-config /path/to/qwen3_tts_custom_voice.yaml

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{"input":"Hello from a precomputed voice.","voice":"alice","task_type":"Base"}' \
    --output alice.wav

Streaming PCM

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Hello, how are you?",
        "voice": "vivian",
        "language": "English",
        "stream": true,
        "stream_format": "audio",
        "response_format": "pcm"
    }' --no-buffer | play -t raw -r 24000 -e signed -b 16 -c 1 -
Raw PCM streaming requires stream_format="audio", response_format="pcm", and async_chunk: true on the stage config (default in qwen3_tts.yaml). speed is not supported when streaming.

Streaming WebSocket

The /v1/audio/speech/stream endpoint accepts text incrementally, splits it at sentence boundaries, and emits one PCM stream per sentence:

python qwen3_tts/streaming_speech_client.py --text "Hello world. How are you? I am fine."
python qwen3_tts/streaming_speech_client.py --text "..." --simulate-stt --stt-delay 0.1

To receive word-level timestamps, launch the server with a forced aligner:

vllm-omni serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
    --omni \
    --deploy-config vllm_omni/deploy/qwen3_tts.yaml \
    --trust-remote-code \
    --forced-aligner Qwen/Qwen3-ForcedAligner-0.6B
Then request PCM JSON sidecar chunks:
python qwen3_tts/streaming_speech_client.py \
    --text "Hello world. How are you?" \
    --stream-audio \
    --response-format pcm \
    --word-timestamps
The client writes one PCM file per sentence and a matching sentence_XXX_timestamps.json sidecar.

To see the alignment instead of reading a JSON sidecar, run the word-timestamp Gradio demo (server must be launched with --forced-aligner):

python qwen3_tts/word_timestamps_demo.py --api-base http://localhost:8091
Each sentence's audio plays in an <audio> element while its text is rendered as inline word spans; the current word highlights as audio.currentTime crosses each start_ms. The Stop (barge-in) button cuts playback and reports the last-spoken word, useful for the voice-agent barge-in case.

Gradio demos

./qwen3_tts/run_gradio_demo.sh                              # CustomVoice (default)
./qwen3_tts/run_gradio_demo.sh --task-type VoiceDesign
./qwen3_tts/run_gradio_demo.sh --task-type Base

Speaker embedding interpolation

qwen3_tts/speaker_embedding_interpolation.py blends two predefined speakers' embeddings to produce intermediate voices. See the script for usage.

Batch client

qwen3_tts/batch_speech_client.py issues many concurrent requests for throughput measurement.

Notes

  • Base voice cloning has uniproc-vs-mp tradeoffs depending on per-request reference audio cost; see the executor-backend section above.
  • With async chunking, Qwen3-TTS Base voice cloning sends the full reference context in the first Code2Wav packet, then caches that prefix on the Code2Wav stage for follow-up chunks in the same request.
  • vllm_omni/deploy/qwen3_tts.yaml is the default deploy config (loaded by HF model_type); per-stage runtime overrides are available via --stage-N-<field> <value>.

VoxCPM2

Single-stage native AR TTS at 48 kHz.

Launch

vllm serve openbmb/VoxCPM2 --omni --host 0.0.0.0 --port 8000
Deploy config auto-loads from vllm_omni/deploy/voxcpm2.yaml. Pass --deploy-config <path> to override or --stage-N-<field> <value> for per-stage runtime tweaks.

Sending requests

# Zero-shot synthesis
python voxcpm2/openai_speech_client.py --text "Hello, this is VoxCPM2."

# Voice cloning
python voxcpm2/openai_speech_client.py \
    --text "This should sound like the reference speaker." \
    --ref-audio /path/to/reference.wav
The ref_audio field accepts local file paths (auto-base64), HTTP URLs, or data:audio/wav;base64,... data URIs.

Precomputed custom voices

For repeated VoxCPM2 speakers, precompute the prompt cache and load it through custom_voice_dir:

python voxcpm2/precompute_custom_voice.py \
    --model openbmb/VoxCPM2 \
    --voice-name alice \
    --ref-audio /path/to/reference.wav \
    --mode ref_continuation \
    --prompt-text "Original transcript of the reference audio" \
    --output-dir /path/to/custom_voices
Add the output directory to the deploy config:
custom_voice_dir: /path/to/custom_voices
After startup, /v1/audio/voices lists alice, and /v1/audio/speech can use voice="alice" without sending ref_audio.

Gradio demo (gapless streaming via AudioWorklet)

python voxcpm2/gradio_demo.py
Uses an AudioWorklet-based player adapted from the Qwen3-TTS demo for gap-free playback. Raw PCM audio is streamed from the OpenAI Speech endpoint with stream=true and stream_format="audio".


Voxtral TTS

Voxtral-4B-TTS (Mistral). Uses the mistral_common SpeechRequest protocol; voice presets are model-specific.

Prerequisites

Latest mistral_common with SpeechRequest support:

pip install -e /path/to/mistral-common  # or upgrade from PyPI when available

Launch

vllm serve mistralai/Voxtral-4B-TTS-2603 --omni --port 8091
Deploy config auto-loads from vllm_omni/deploy/voxtral_tts.yaml.

Gradio demo

python voxtral_tts/gradio_demo.py
The demo handles voice-preset selection and reference-audio upload. voxtral_tts/text_preprocess.py provides the text-normalization helpers used by the demo (also available for other clients).

Notes

  • Voice presets are listed on the HF model card (mistralai/Voxtral-4B-TTS-2603).
  • Voice cloning is gated upstream and may require a recent mistral_common.
  • A standalone CLI client is not yet shipped; the gradio demo is the canonical reference for now.

SoulX-Singer

Singing voice synthesis (SVS) and conversion (SVC) at 24 kHz. Single-stage DiT with inline preprocess. Uses the /v1/chat/completions endpoint with multimodal input (prompt_audio + target_audio).

Prerequisites

Download DiT and preprocess weights, then set up separate SVS / SVC view directories and install dependencies as described in the offline README. config.json architectures field is the single source of truth for SVS vs SVC — point MODEL at the matching directory.

Launch

# SVS (default)
export MODEL=/path/to/SoulX-Singer
export PREPROCESS=/path/to/SoulX-Singer-Preprocess
bash examples/online_serving/text_to_speech/soulxsinger/run_server.sh

# SVC
export MODE=svc
export MODEL=/path/to/SoulX-Singer-svc
bash examples/online_serving/text_to_speech/soulxsinger/run_server.sh

Or equivalently, set SOULX_PREPROCESS_WEIGHTS_DIR and launch directly:

export SOULX_PREPROCESS_WEIGHTS_DIR=$PREPROCESS
vllm serve $MODEL --omni \
    --deploy-config vllm_omni/deploy/soulxsinger_${MODE}.yaml \
    --port 8192 --trust-remote-code --enforce-eager

Sending requests

Audio paths must be reachable from the server host (local filesystem or data URL). The client sends prompt vocal via input_audio and target accompaniment via extra_args['target_audio'].

# Default demo audio: tests/assets/soulxsinger/zh_prompt.mp3 + music.mp3
python examples/online_serving/text_to_speech/soulxsinger/openai_chat_client.py \
    --prompt-audio /path/on/server/zh_prompt.mp3 \
    --target-audio /path/on/server/music.mp3 \
    --preprocess-weights-dir /path/on/server/SoulX-Singer-Preprocess \
    -o output.wav

Use precomputed metadata to skip online preprocess with following command:

python examples/online_serving/text_to_speech/soulxsinger/openai_chat_client.py \
    --prompt-metadata-path /path/on/server/zh_prompt.json \
    --target-metadata-path /path/on/server/music.json \
    --audio-path /path/on/server/zh_prompt.mp3 \
    -o output.wav

SOULX_PREPROCESS_WEIGHTS_DIR makes --preprocess-weights-dir optional. See openai_chat_client.py --help for --vocal-sep, --language, --num-inference-steps, --guidance-scale, and --seed.

Notes

  • Output: 24 kHz mono WAV; batch only.
  • Defaults match upstream: --guidance-scale 3.0, --seed 42, --auto-shift on.
  • SVS --control: score or melody. MIDI / lyric QC: upstream midi_editor only.

Example materials

cosyvoice3/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for CosyVoice3 TTS
#
# Usage:
#   ./run_server.sh
#   CUDA_VISIBLE_DEVICES=0 ./run_server.sh
#
# Streaming (async-chunk) is on by default via vllm_omni/deploy/cosyvoice3.yaml.
# Set NO_ASYNC_CHUNK=1 to use the legacy synchronous path.

set -e

MODEL="${MODEL:-FunAudioLLM/Fun-CosyVoice3-0.5B-2512}"
PORT="${PORT:-8091}"

EXTRA_ARGS=()
if [[ -n "${NO_ASYNC_CHUNK:-}" ]]; then
    EXTRA_ARGS+=(--no-async-chunk)
fi

echo "Starting CosyVoice3 server with model: $MODEL"

vllm serve "$MODEL" \
    --host 0.0.0.0 \
    --port "$PORT" \
    --trust-remote-code \
    --omni \
    "${EXTRA_ARGS[@]}"
cosyvoice3/speech_client.py
"""Client for CosyVoice3 TTS via /v1/audio/speech endpoint.

CosyVoice3 has no built-in voice presets: every request is voice cloning
driven by ``ref_audio`` + ``ref_text``. The defaults below point at the
official upstream zero-shot prompt so the script runs out of the box.

Examples:
    # Voice cloning with the default upstream prompt
    python speech_client.py --text "收到好友从远方寄来的生日礼物。"

    # Custom reference clip + transcript
    python speech_client.py --text "Hello, this is a cloned voice." \
        --ref-audio /path/to/reference.wav \
        --ref-text "Transcript of the reference audio."

    # Streaming PCM output
    python speech_client.py --text "Hello world" --stream --output output.pcm
"""

import argparse
import base64
import os

import httpx

DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"
DEFAULT_MODEL = "FunAudioLLM/Fun-CosyVoice3-0.5B-2512"

# Official CosyVoice zero-shot prompt and its transcript.
DEFAULT_REF_AUDIO = "https://raw.githubusercontent.com/FunAudioLLM/CosyVoice/main/asset/zero_shot_prompt.wav"
DEFAULT_REF_TEXT = "希望你以后能够做的比我还好呦。"


def encode_audio_to_base64(audio_path: str) -> str:
    """Encode a local audio file to a base64 data URL."""
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")
    ext = audio_path.lower().rsplit(".", 1)[-1]
    mime_map = {"wav": "audio/wav", "mp3": "audio/mpeg", "flac": "audio/flac", "ogg": "audio/ogg"}
    mime_type = mime_map.get(ext, "audio/wav")
    with open(audio_path, "rb") as f:
        audio_b64 = base64.b64encode(f.read()).decode("utf-8")
    return f"data:{mime_type};base64,{audio_b64}"


def run_tts(args) -> None:
    """Generate speech via the /v1/audio/speech API."""
    payload = {
        "model": args.model,
        "input": args.text,
        "response_format": args.response_format,
    }

    if args.ref_audio.startswith(("http://", "https://")):
        payload["ref_audio"] = args.ref_audio
    else:
        payload["ref_audio"] = encode_audio_to_base64(args.ref_audio)
    payload["ref_text"] = args.ref_text

    if args.stream:
        payload["stream"] = True
        payload["stream_format"] = "audio"
        payload["response_format"] = "pcm"

    print(f"Model: {args.model}")
    print(f"Text: {args.text}")
    print(f"Voice cloning: ref_audio={args.ref_audio}, ref_text={args.ref_text}")
    print("Generating audio...")

    api_url = f"{args.api_base}/v1/audio/speech"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {args.api_key}",
    }

    if args.stream:
        output_path = args.output or "output.pcm"
        with httpx.Client(timeout=300.0) as client:
            with client.stream("POST", api_url, json=payload, headers=headers) as resp:
                if resp.status_code != 200:
                    print(f"Error: {resp.status_code}")
                    print(resp.read().decode())
                    return
                total_bytes = 0
                with open(output_path, "wb") as f:
                    for chunk in resp.iter_bytes():
                        f.write(chunk)
                        total_bytes += len(chunk)
                print(f"Streamed {total_bytes} bytes to: {output_path}")
    else:
        with httpx.Client(timeout=300.0) as client:
            response = client.post(api_url, json=payload, headers=headers)

        if response.status_code != 200:
            print(f"Error: {response.status_code}")
            print(response.text)
            return

        try:
            text = response.content.decode("utf-8")
            if text.startswith('{"error"'):
                print(f"Error: {text}")
                return
        except UnicodeDecodeError:
            pass

        output_path = args.output or "output.wav"
        with open(output_path, "wb") as f:
            f.write(response.content)
        print(f"Audio saved to: {output_path}")


def main():
    parser = argparse.ArgumentParser(description="CosyVoice3 TTS client")
    parser.add_argument("--api-base", default=DEFAULT_API_BASE, help="API base URL")
    parser.add_argument("--api-key", default=DEFAULT_API_KEY, help="API key")
    parser.add_argument("--model", "-m", default=DEFAULT_MODEL, help="Model name")
    parser.add_argument("--text", required=True, help="Text to synthesize")
    parser.add_argument(
        "--ref-audio",
        default=DEFAULT_REF_AUDIO,
        help="Reference audio for voice cloning (path or URL)",
    )
    parser.add_argument(
        "--ref-text",
        default=DEFAULT_REF_TEXT,
        help="Transcript of the reference audio",
    )
    parser.add_argument("--stream", action="store_true", help="Enable streaming (PCM output)")
    parser.add_argument(
        "--response-format",
        default="wav",
        choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
        help="Audio format (default: wav)",
    )
    parser.add_argument("--output", "-o", default=None, help="Output file path")
    args = parser.parse_args()
    run_tts(args)


if __name__ == "__main__":
    main()
fish_speech/gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/fish_speech/gradio_demo.py.

fish_speech/run_gradio_demo.sh
#!/bin/bash
# Launch Fish Speech S2 Pro server + Gradio demo together.
#
# Usage:
#   ./run_gradio_demo.sh
#   CUDA_VISIBLE_DEVICES=0 PORT=8091 GRADIO_PORT=7860 ./run_gradio_demo.sh

set -e

MODEL="${MODEL:-fishaudio/s2-pro}"
PORT="${PORT:-8091}"
GRADIO_PORT="${GRADIO_PORT:-7860}"
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"

echo "Starting Fish Speech S2 Pro server (port $PORT)..."
FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm serve "$MODEL" \
    --omni \
    --host 0.0.0.0 \
    --port "$PORT" &
SERVER_PID=$!

cleanup() {
    echo "Stopping server (PID $SERVER_PID)..."
    kill $SERVER_PID 2>/dev/null
    wait $SERVER_PID 2>/dev/null
}
trap cleanup EXIT

# Wait for server to be ready.
echo "Waiting for server to start..."
for i in $(seq 1 120); do
    if curl -s "http://localhost:$PORT/health" > /dev/null 2>&1; then
        echo "Server ready."
        break
    fi
    sleep 2
done

echo "Starting Gradio demo (port $GRADIO_PORT)..."
python "$SCRIPT_DIR/gradio_demo.py" \
    --api-base "http://localhost:$PORT" \
    --port "$GRADIO_PORT"
fish_speech/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for Fish Speech S2 Pro
#
# Usage:
#   ./run_server.sh
#   CUDA_VISIBLE_DEVICES=0 ./run_server.sh

set -e

MODEL="${MODEL:-fishaudio/s2-pro}"
PORT="${PORT:-8091}"

echo "Starting Fish Speech S2 Pro server with model: $MODEL"

FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm serve "$MODEL" \
    --omni \
    --host 0.0.0.0 \
    --port "$PORT"
fish_speech/speech_client.py
"""Client for Fish Speech S2 Pro via /v1/audio/speech endpoint.

Examples:
    # Basic TTS
    python speech_client.py --text "Hello, how are you?"

    # Voice cloning
    python speech_client.py --text "Hello, how are you?" \
        --ref-audio ref.wav --ref-text "This is the reference transcript."

    # Streaming PCM output
    python speech_client.py --text "Hello world" --stream --output output.pcm
"""

import argparse
import base64
import os

import httpx

DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"


def encode_audio_to_base64(audio_path: str) -> str:
    """Encode a local audio file to base64 data URL."""
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")
    ext = audio_path.lower().rsplit(".", 1)[-1]
    mime_map = {"wav": "audio/wav", "mp3": "audio/mpeg", "flac": "audio/flac", "ogg": "audio/ogg"}
    mime_type = mime_map.get(ext, "audio/wav")
    with open(audio_path, "rb") as f:
        audio_b64 = base64.b64encode(f.read()).decode("utf-8")
    return f"data:{mime_type};base64,{audio_b64}"


def run_tts(args) -> None:
    """Generate speech via /v1/audio/speech API."""
    payload = {
        "model": args.model,
        "input": args.text,
        "response_format": args.response_format,
    }

    # Voice cloning parameters.
    if args.ref_audio:
        if args.ref_audio.startswith(("http://", "https://")):
            payload["ref_audio"] = args.ref_audio
        else:
            payload["ref_audio"] = encode_audio_to_base64(args.ref_audio)
    if args.ref_text:
        payload["ref_text"] = args.ref_text

    if args.stream:
        payload["stream"] = True
        payload["stream_format"] = "audio"
        payload["response_format"] = "pcm"

    print(f"Model: {args.model}")
    print(f"Text: {args.text}")
    if args.ref_audio:
        print(f"Voice cloning: ref_audio={args.ref_audio}, ref_text={args.ref_text}")
    print("Generating audio...")

    api_url = f"{args.api_base}/v1/audio/speech"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {args.api_key}",
    }

    if args.stream:
        output_path = args.output or "output.pcm"
        with httpx.Client(timeout=300.0) as client:
            with client.stream("POST", api_url, json=payload, headers=headers) as resp:
                if resp.status_code != 200:
                    print(f"Error: {resp.status_code}")
                    print(resp.read().decode())
                    return
                total_bytes = 0
                with open(output_path, "wb") as f:
                    for chunk in resp.iter_bytes():
                        f.write(chunk)
                        total_bytes += len(chunk)
                print(f"Streamed {total_bytes} bytes to: {output_path}")
    else:
        with httpx.Client(timeout=300.0) as client:
            response = client.post(api_url, json=payload, headers=headers)

        if response.status_code != 200:
            print(f"Error: {response.status_code}")
            print(response.text)
            return

        try:
            text = response.content.decode("utf-8")
            if text.startswith('{"error"'):
                print(f"Error: {text}")
                return
        except UnicodeDecodeError:
            pass

        output_path = args.output or "output.wav"
        with open(output_path, "wb") as f:
            f.write(response.content)
        print(f"Audio saved to: {output_path}")


def main():
    parser = argparse.ArgumentParser(description="Fish Speech S2 Pro TTS client")
    parser.add_argument("--api-base", default=DEFAULT_API_BASE, help="API base URL")
    parser.add_argument("--api-key", default=DEFAULT_API_KEY, help="API key")
    parser.add_argument("--model", "-m", default="fishaudio/s2-pro", help="Model name")
    parser.add_argument("--text", required=True, help="Text to synthesize")
    parser.add_argument("--ref-audio", default=None, help="Reference audio for voice cloning (path or URL)")
    parser.add_argument("--ref-text", default=None, help="Transcript of reference audio")
    parser.add_argument("--stream", action="store_true", help="Enable streaming (PCM output)")
    parser.add_argument(
        "--response-format",
        default="wav",
        choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
        help="Audio format (default: wav)",
    )
    parser.add_argument("--output", "-o", default=None, help="Output file path")
    args = parser.parse_args()
    run_tts(args)


if __name__ == "__main__":
    main()
glm_tts/gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/glm_tts/gradio_demo.py.

glm_tts/openai_speech_client.py
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""OpenAI-compatible client for GLM-TTS via /v1/audio/speech endpoint.

GLM-TTS is a two-stage TTS system (AR + DiT) that generates audio from text
conditioned on reference speech. Each request requires ref_audio + ref_text.

Usage:
    # Voice cloning
    python openai_speech_client.py --text "你好" --ref-audio file:///path/to/ref.wav --ref-text "参考文本"

    # Streaming response, for async_chunk server mode
    python openai_speech_client.py --text "你好" --stream --ref-audio file:///path/to/ref.wav --ref-text "参考文本"

    # Specify output format
    python openai_speech_client.py --text "你好" --ref-audio file:///path/to/ref.wav \
        --ref-text "参考文本" --response-format mp3 -o output.mp3
"""

import argparse

import httpx

# Default server configuration
DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"


def run_tts_generation(args) -> None:
    """Run TTS generation via OpenAI-compatible /v1/audio/speech API."""
    if not args.ref_audio or not args.ref_text:
        raise ValueError("GLM-TTS requires --ref-audio and --ref-text for voice cloning.")

    payload = {
        "model": args.model,
        "voice": "default",
        "input": args.text,
        "response_format": args.response_format,
        "stream": bool(args.stream),
        "ref_audio": args.ref_audio,
        "ref_text": args.ref_text,
    }
    if args.stream:
        payload["stream_format"] = "audio"
        payload["response_format"] = "pcm"
    if args.max_new_tokens:
        payload["max_new_tokens"] = args.max_new_tokens

    print(f"Model: {args.model}")
    print(f"Text: {args.text}")
    print(f"Voice cloning: ref_audio={args.ref_audio}, ref_text={args.ref_text}")
    print(f"Stream: {args.stream}")
    print("Generating audio...")

    api_url = f"{args.api_base}/v1/audio/speech"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {args.api_key}",
    }

    if args.stream:
        output_path = args.output or "tts_output.pcm"
        with httpx.Client(timeout=300.0) as client, open(output_path, "wb") as f:
            with client.stream("POST", api_url, json=payload, headers=headers) as response:
                if response.status_code != 200:
                    print(f"Error: {response.status_code}")
                    response.read()
                    print(response.text)
                    return
                for chunk in response.iter_bytes():
                    f.write(chunk)
        print(f"Streaming audio saved to: {output_path}")
    else:
        with httpx.Client(timeout=300.0) as client:
            response = client.post(api_url, json=payload, headers=headers)
        if response.status_code != 200:
            print(f"Error: {response.status_code}")
            print(response.text)
            return
        try:
            text = response.content.decode("utf-8")
            if text.startswith('{"error"'):
                print(f"Error: {text}")
                return
        except UnicodeDecodeError:
            pass
        output_path = args.output or f"tts_output.{args.response_format}"
        with open(output_path, "wb") as f:
            f.write(response.content)
        print(f"Audio saved to: {output_path}")


def parse_args():
    """Parse command line arguments."""
    parser = argparse.ArgumentParser(
        description="OpenAI-compatible client for GLM-TTS via /v1/audio/speech",
    )

    # Server configuration
    parser.add_argument(
        "--api-base",
        type=str,
        default=DEFAULT_API_BASE,
        help=f"API base URL (default: {DEFAULT_API_BASE})",
    )
    parser.add_argument(
        "--api-key",
        type=str,
        default=DEFAULT_API_KEY,
        help="API key (default: EMPTY)",
    )
    parser.add_argument(
        "--model",
        "-m",
        type=str,
        default="glm-tts",
        help="Model name/path",
    )

    # Input text
    parser.add_argument(
        "--text",
        type=str,
        required=True,
        help="Text to synthesize",
    )

    # Generation parameters
    parser.add_argument(
        "--max-new-tokens",
        type=int,
        default=None,
        help="Maximum new tokens to generate (default: model default)",
    )

    # Output
    parser.add_argument(
        "--response-format",
        type=str,
        default="wav",
        choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
        help="Audio output format (default: wav)",
    )
    parser.add_argument(
        "--stream",
        action="store_true",
        help="Request a streaming audio response (use with async_chunk server mode).",
    )
    parser.add_argument(
        "--output",
        "-o",
        type=str,
        default=None,
        help="Output audio file path (default: tts_output.<format>)",
    )

    # Voice cloning parameters
    parser.add_argument(
        "--ref-audio",
        type=str,
        default=None,
        help="Reference audio URL, file:// URI, or base64 data URL for voice cloning",
    )
    parser.add_argument(
        "--ref-text",
        type=str,
        default=None,
        help="Transcript of the reference audio (required with --ref-audio)",
    )

    return parser.parse_args()


if __name__ == "__main__":
    args = parse_args()
    run_tts_generation(args)
glm_tts/run_gradio_demo.sh
#!/bin/bash
# Launch GLM-TTS server + Gradio demo together.
#
# Usage:
#   ./run_gradio_demo.sh
#   CUDA_VISIBLE_DEVICES=0 PORT=8091 GRADIO_PORT=7860 ./run_gradio_demo.sh

set -e

MODEL="${MODEL:-zai-org/GLM-TTS}"
PORT="${PORT:-8091}"
GRADIO_PORT="${GRADIO_PORT:-7860}"
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../../../.." && pwd)"

echo "Starting GLM-TTS server (port $PORT)..."
FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm-omni serve "$MODEL" \
    --deploy-config "$REPO_ROOT/vllm_omni/deploy/glm_tts.yaml" \
    --host 0.0.0.0 \
    --port "$PORT" \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --enforce-eager \
    --omni &
SERVER_PID=$!

cleanup() {
    echo "Stopping server (PID $SERVER_PID)..."
    kill $SERVER_PID 2>/dev/null
    wait $SERVER_PID 2>/dev/null
}
trap cleanup EXIT

# Wait for server to be ready.
echo "Waiting for server to start..."
for i in $(seq 1 120); do
    if curl -s "http://localhost:$PORT/health" > /dev/null 2>&1; then
        echo "Server ready."
        break
    fi
    sleep 2
done

echo "Starting Gradio demo (port $GRADIO_PORT)..."
python "$SCRIPT_DIR/gradio_demo.py" \
    --api-base "http://localhost:$PORT" \
    --port "$GRADIO_PORT"
glm_tts/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for GLM-TTS models
#
# Usage:
#   ./run_server.sh                           # Default model path, async_chunk mode
#   ./run_server.sh /path/to/GLM-TTS          # Custom model path, async_chunk mode
#   ./run_server.sh /path/to/GLM-TTS sync     # Sync two-stage mode
#
# NOTE: The model path should point to the repo ROOT (not llm/ subdirectory).
# model_subdir/tokenizer_subdir in the pipeline config resolve subdirectories.

set -e

MODEL="${1:-zai-org/GLM-TTS}"
MODE="${2:-async}"

EXTRA_ARGS=()
case "$MODE" in
    async|async_chunk)
        ;;
    sync|no_async_chunk)
        EXTRA_ARGS+=("--no-async-chunk")
        ;;
    *)
        echo "Unknown mode: $MODE (expected async or sync)" >&2
        exit 1
        ;;
esac

echo "Starting GLM-TTS server with model: $MODEL (mode: $MODE)"

vllm-omni serve "$MODEL" \
    --deploy-config vllm_omni/deploy/glm_tts.yaml \
    --host 0.0.0.0 \
    --port 8091 \
    --trust-remote-code \
    --omni \
    "${EXTRA_ARGS[@]}"
higgs_audio_v2/README.md

higgs-audio v2 online example

This directory contains the online-serving entry points for boson-ai's higgs-audio v2 as integrated by vllm-omni: a 2-stage TTS pipeline (Llama-3.2-3B talker with DualFFN audio expert + HiggsAudio codec decoder) emitting 24 kHz mono speech.

Prerequisites

Voice clone uses HF's HiggsAudioV2TokenizerModel loaded from k2-fsa/OmniVoice/audio_tokenizer/ (the boson-ai standalone tokenizer Hub repo's model.safetensors is the 3B talker LM, not the codec). Only that ~806 MB subdir is downloaded.

pip install -U "transformers>=5.3.0"

Files

  • run_server.sh — launch the vllm-omni server with the bundled vllm_omni/deploy/higgs_audio_v2.yaml deploy config.
  • batch_speech_client.py — send a list of prompts to /v1/audio/speech and save the returned WAV / PCM bytes to a directory; optionally passes --ref-audio + --ref-text for shallow voice clone.

Launching the server

GPUS=6,7 PORT=8094 bash examples/online_serving/text_to_speech/higgs_audio_v2/run_server.sh

Environment overrides:

  • MODEL — HF id of the talker (default bosonai/higgs-audio-v2-generation-3B-base).
  • PORT — server port (default 8094).
  • GPUSCUDA_VISIBLE_DEVICES value (default 6,7).
  • GPU_UTIL--gpu-memory-utilization (default 0.4).

The script also exports VLLM_USE_DEEP_GEMM=0 / VLLM_MOE_USE_DEEP_GEMM=0 so the example works on images without the optional deep_gemm backend.

The deploy YAML ships with async_chunk: false and codec_streaming: true, i.e. Stage 0 finishes its codec frames before Stage 1 starts decoding, and Stage 1 streams WAV/PCM bytes to the client chunk-by-chunk.

Driving the server

Plain TTS:

python examples/online_serving/text_to_speech/higgs_audio_v2/batch_speech_client.py \
    --base-url http://localhost:8094 \
    --model bosonai/higgs-audio-v2-generation-3B-base \
    --output-dir /tmp/higgs_audio_v2_batch \
    --prompts "Hello world." \
              "The quick brown fox jumps over the lazy dog."

Voice clone — pass a reference clip and its transcript (both required together):

python examples/online_serving/text_to_speech/higgs_audio_v2/batch_speech_client.py \
    --base-url http://localhost:8094 \
    --model bosonai/higgs-audio-v2-generation-3B-base \
    --output-dir /tmp/higgs_audio_v2_clone \
    --ref-audio /path/to/reference.wav \
    --ref-text  "Exact transcript spoken in reference.wav." \
    --prompts "Hello, this is a cloned voice."

Notes

  • --ref-text must be the real transcript of --ref-audio; mismatched text degrades cloned-voice quality.
  • Out of scope (rejected with explicit 4xx by the request validator): multi-speaker [SPEAKERn] tags inside input, profile: text-only speaker descriptions, the ref_audio_in_system_message system-block variant, chunked long-form generation, and per-request voice / instructions / task_type / language / speed != 1.0 / x_vector_only_mode / speaker_embedding.
higgs_audio_v2/batch_speech_client.py
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""Batch client for the higgs-audio v2 online server.

Sends a fixed list of prompts to ``/v1/audio/speech`` and saves the returned
WAV files (or raw PCM bytes when ``--format pcm``) into ``--output-dir``.

Usage (plain text -> speech):

  python examples/online_serving/text_to_speech/higgs_audio_v2/batch_speech_client.py \
      --base-url http://localhost:8094 \
      --output-dir /tmp/higgs_audio_v2_batch \
      --prompts "Hello world." "The quick brown fox jumps over the lazy dog."

Usage (shallow voice clone — pass a reference clip + its transcript):

  python examples/online_serving/text_to_speech/higgs_audio_v2/batch_speech_client.py \
      --base-url http://localhost:8094 \
      --output-dir /tmp/higgs_audio_v2_clone \
      --ref-audio path/to/reference.wav \
      --ref-text "the transcript of the reference clip" \
      --prompts "Hello world."
"""

from __future__ import annotations

import argparse
import base64
import sys
from pathlib import Path

DEFAULT_PROMPTS = (
    "Hello world.",
    "The quick brown fox jumps over the lazy dog.",
    "It was the night before my birthday.",
    "Innovation distinguishes between a leader and a follower.",
)


def _slug(text: str) -> str:
    import re

    s = re.sub(r"\s+", "_", text.strip().lower())
    return re.sub(r"[^a-z0-9_]+", "", s)[:32] or "prompt"


def main() -> int:
    parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
    parser.add_argument("--base-url", default="http://localhost:8094")
    parser.add_argument("--model", default="higgs_audio_v2")
    parser.add_argument("--prompts", nargs="+", default=list(DEFAULT_PROMPTS))
    parser.add_argument("--output-dir", type=Path, default=Path("/tmp/higgs_audio_v2_batch"))
    parser.add_argument("--format", choices=("wav", "pcm"), default="wav")
    parser.add_argument("--max-new-tokens", type=int, default=300)
    parser.add_argument("--seed", type=int, default=42)
    parser.add_argument("--timeout-s", type=float, default=120.0)
    parser.add_argument(
        "--ref-audio",
        type=Path,
        default=None,
        help="Reference clip for voice clone (path to a WAV file). Must be paired with --ref-text.",
    )
    parser.add_argument(
        "--ref-text",
        type=str,
        default=None,
        help="Transcript of the reference clip. Required when --ref-audio is set.",
    )
    args = parser.parse_args()

    if (args.ref_audio is None) != (args.ref_text is None):
        print("--ref-audio and --ref-text must be supplied together", file=sys.stderr)
        return 2

    ref_audio_data_url: str | None = None
    if args.ref_audio is not None:
        if not args.ref_audio.exists():
            print(f"ref-audio file not found: {args.ref_audio}", file=sys.stderr)
            return 2
        mime = "audio/wav" if args.ref_audio.suffix.lower() == ".wav" else "audio/mpeg"
        ref_b64 = base64.b64encode(args.ref_audio.read_bytes()).decode("ascii")
        ref_audio_data_url = f"data:{mime};base64,{ref_b64}"

    try:
        import httpx
    except ImportError:
        print(
            "this client needs `httpx`. Install with `pip install httpx`.",
            file=sys.stderr,
        )
        return 2

    args.output_dir.mkdir(parents=True, exist_ok=True)
    url = args.base_url.rstrip("/") + "/v1/audio/speech"
    failures = 0
    with httpx.Client(timeout=args.timeout_s) as client:
        for prompt in args.prompts:
            payload = {
                "model": args.model,
                "input": prompt,
                "response_format": args.format,
                "max_new_tokens": args.max_new_tokens,
                "seed": args.seed,
            }
            if ref_audio_data_url is not None:
                payload["ref_audio"] = ref_audio_data_url
                payload["ref_text"] = args.ref_text
            resp = client.post(url, json=payload)
            if resp.status_code != 200:
                print(f"[FAIL] {prompt!r} -> {resp.status_code}: {resp.text[:200]}", file=sys.stderr)
                failures += 1
                continue
            suffix = ".wav" if args.format == "wav" else ".pcm"
            out = args.output_dir / f"{_slug(prompt)}{suffix}"
            out.write_bytes(resp.content)
            print(f"[ ok ] {prompt!r} -> {out} ({len(resp.content)} bytes)")

    return 1 if failures else 0


if __name__ == "__main__":
    sys.exit(main())
higgs_audio_v2/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for higgs-audio v2.
#
# v1 scope: plain text -> 24 kHz speech only. Voice cloning, multi-speaker,
# ChatML rich content, and language overrides are rejected by the validator
# with explicit 4xx (see vllm_omni/entrypoints/openai/serving_speech.py).
#
# Usage:
#   ./run_server.sh                 # default port 8094, GPUs 6 and 7
#   PORT=8095 GPUS=6,7 ./run_server.sh
#   MODEL=bosonai/higgs-audio-v2-generation-3B-base ./run_server.sh

set -e

MODEL="${MODEL:-bosonai/higgs-audio-v2-generation-3B-base}"
PORT="${PORT:-8094}"
GPUS="${GPUS:-6,7}"
GPU_UTIL="${GPU_UTIL:-0.4}"

echo "Starting higgs-audio v2 server"
echo "  MODEL=$MODEL"
echo "  PORT=$PORT"
echo "  CUDA_VISIBLE_DEVICES=$GPUS"

# DeepGEMM FP8 kernels are optional and trip warmup on builds without
# the deep_gemm backend; disable them so the example works out of the box.
# Users with deep_gemm installed can re-enable via the same env vars.
CUDA_VISIBLE_DEVICES="$GPUS" \
VLLM_USE_DEEP_GEMM=0 \
VLLM_MOE_USE_DEEP_GEMM=0 \
vllm-omni serve "$MODEL" \
    --deploy-config vllm_omni/deploy/higgs_audio_v2.yaml \
    --host 0.0.0.0 \
    --port "$PORT" \
    --gpu-memory-utilization "$GPU_UTIL" \
    --trust-remote-code \
    --omni
higgs_audio_v3/README.md

Higgs-Audio V3 Online Serving

Start the server

# Default: GPU 0, port 8095
./examples/online_serving/text_to_speech/higgs_audio_v3/run_server.sh

# Custom GPU / port
PORT=8096 GPUS=0,1 ./examples/online_serving/text_to_speech/higgs_audio_v3/run_server.sh

Plain text TTS

python examples/online_serving/text_to_speech/higgs_audio_v3/batch_speech_client.py \
    --base-url http://localhost:8095 \
    --output-dir /tmp/higgs_v3_batch \
    --prompts "Hello world." "The quick brown fox jumps over the lazy dog."

Voice clone

python examples/online_serving/text_to_speech/higgs_audio_v3/batch_speech_client.py \
    --base-url http://localhost:8095 \
    --output-dir /tmp/higgs_v3_clone \
    --ref-audio path/to/reference.wav \
    --ref-text "transcript of the reference clip" \
    --prompts "Text to synthesize in the cloned voice."

curl example

curl -X POST http://localhost:8095/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{"model": "higgs_audio_v3", "input": "Hello world."}' \
    --output hello.wav
higgs_audio_v3/batch_speech_client.py
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""Batch client for the higgs-audio v3 online server.

Sends prompts to ``/v1/audio/speech`` and saves the returned WAV files.

Usage (plain text -> speech):

  python examples/online_serving/text_to_speech/higgs_audio_v3/batch_speech_client.py \
      --base-url http://localhost:8095 \
      --output-dir /tmp/higgs_v3_batch \
      --prompts "Hello world." "The quick brown fox jumps over the lazy dog."

Usage (voice clone):

  python examples/online_serving/text_to_speech/higgs_audio_v3/batch_speech_client.py \
      --base-url http://localhost:8095 \
      --output-dir /tmp/higgs_v3_clone \
      --ref-audio path/to/reference.wav \
      --ref-text "the transcript of the reference clip" \
      --prompts "Hello world."
"""

from __future__ import annotations

import argparse
import base64
import sys
from pathlib import Path

DEFAULT_PROMPTS = (
    "Hello world.",
    "The quick brown fox jumps over the lazy dog.",
    "Today is a beautiful day for a walk in the park.",
    "Innovation distinguishes between a leader and a follower.",
)


def _slug(text: str) -> str:
    import re

    s = re.sub(r"\s+", "_", text.strip().lower())
    return re.sub(r"[^a-z0-9_]+", "", s)[:32] or "prompt"


def main() -> int:
    parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
    parser.add_argument("--base-url", default="http://localhost:8095")
    parser.add_argument("--model", default="higgs_audio_v3")
    parser.add_argument("--prompts", nargs="+", default=list(DEFAULT_PROMPTS))
    parser.add_argument("--output-dir", type=Path, default=Path("/tmp/higgs_v3_batch"))
    parser.add_argument("--format", choices=("wav", "pcm"), default="wav")
    parser.add_argument("--max-new-tokens", type=int, default=2048)
    parser.add_argument("--seed", type=int, default=42)
    parser.add_argument("--timeout-s", type=float, default=120.0)
    parser.add_argument(
        "--ref-audio",
        type=Path,
        default=None,
        help="Reference clip for voice clone (WAV/FLAC/MP3 path). Pair with --ref-text.",
    )
    parser.add_argument(
        "--ref-text",
        type=str,
        default=None,
        help="Transcript of the reference clip. Optional but improves fidelity.",
    )
    args = parser.parse_args()

    ref_audio_data_url: str | None = None
    if args.ref_audio is not None:
        if not args.ref_audio.exists():
            print(f"ref-audio file not found: {args.ref_audio}", file=sys.stderr)
            return 2
        mime = "audio/wav" if args.ref_audio.suffix.lower() == ".wav" else "audio/mpeg"
        ref_b64 = base64.b64encode(args.ref_audio.read_bytes()).decode("ascii")
        ref_audio_data_url = f"data:{mime};base64,{ref_b64}"

    try:
        import httpx
    except ImportError:
        print("this client needs `httpx`. Install with `pip install httpx`.", file=sys.stderr)
        return 2

    args.output_dir.mkdir(parents=True, exist_ok=True)
    url = args.base_url.rstrip("/") + "/v1/audio/speech"
    failures = 0
    with httpx.Client(timeout=args.timeout_s) as client:
        for prompt in args.prompts:
            payload = {
                "model": args.model,
                "input": prompt,
                "response_format": args.format,
                "max_new_tokens": args.max_new_tokens,
                "seed": args.seed,
            }
            if ref_audio_data_url is not None:
                payload["ref_audio"] = ref_audio_data_url
                if args.ref_text:
                    payload["ref_text"] = args.ref_text
            resp = client.post(url, json=payload)
            if resp.status_code != 200:
                print(f"[FAIL] {prompt!r} -> {resp.status_code}: {resp.text[:200]}", file=sys.stderr)
                failures += 1
                continue
            suffix = ".wav" if args.format == "wav" else ".pcm"
            out = args.output_dir / f"{_slug(prompt)}{suffix}"
            out.write_bytes(resp.content)
            print(f"[ ok ] {prompt!r} -> {out} ({len(resp.content)} bytes)")

    return 1 if failures else 0


if __name__ == "__main__":
    sys.exit(main())
higgs_audio_v3/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for higgs-audio v3.
#
# Supports plain text TTS and voice cloning via /v1/audio/speech.
#
# Usage:
#   ./run_server.sh                 # default port 8095, GPU 0
#   PORT=8096 GPUS=0,1 ./run_server.sh
#   MODEL=/path/to/local/checkpoint ./run_server.sh

set -e

MODEL="${MODEL:-bosonai/higgs-audio-v3-tts-4b}"
PORT="${PORT:-8095}"
GPUS="${GPUS:-0}"
GPU_UTIL="${GPU_UTIL:-0.6}"

echo "Starting higgs-audio v3 server"
echo "  MODEL=$MODEL"
echo "  PORT=$PORT"
echo "  CUDA_VISIBLE_DEVICES=$GPUS"

CUDA_VISIBLE_DEVICES="$GPUS" \
VLLM_USE_DEEP_GEMM=0 \
VLLM_MOE_USE_DEEP_GEMM=0 \
vllm-omni serve "$MODEL" \
    --deploy-config vllm_omni/deploy/higgs_multimodal_qwen3.yaml \
    --host 0.0.0.0 \
    --port "$PORT" \
    --gpu-memory-utilization "$GPU_UTIL" \
    --trust-remote-code \
    --omni
indextts2/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for IndexTTS2
#
# Usage from repository root:
#   examples/online_serving/text_to_speech/indextts2/run_server.sh
#   CUDA_VISIBLE_DEVICES=0 PORT=8092 MODEL=/path/to/IndexTeam/IndexTTS-2 examples/online_serving/text_to_speech/indextts2/run_server.sh

set -e

SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
ROOT_DIR="$(cd -- "$SCRIPT_DIR/../../../.." && pwd)"

MODEL="${MODEL:-IndexTeam/IndexTTS-2}"
PORT="${PORT:-8092}"
DEPLOY_CONFIG="${DEPLOY_CONFIG:-$ROOT_DIR/vllm_omni/deploy/indextts2.yaml}"

echo "Starting IndexTTS2 server with model: $MODEL"

FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm serve "$MODEL" \
    --host 0.0.0.0 \
    --port "$PORT" \
    --omni \
    --trust-remote-code \
    --deploy-config "$DEPLOY_CONFIG"
indextts2/speech_client.py
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""OpenAI-compatible client for IndexTTS2 TTS via /v1/audio/speech endpoint.

Examples:
    # With reference audio for voice cloning
    python speech_client.py --text "你好,世界!" \
        --ref-audio /path/to/reference.wav

    # With emotion audio
    python speech_client.py --text "今天心情很好!" \
        --ref-audio /path/to/ref.wav \
        --emo-audio /path/to/happy.wav

Server setup:
    vllm serve IndexTeam/IndexTTS-2 --omni --host 0.0.0.0 --port 8092
"""

from __future__ import annotations

import argparse
import base64
import os

import httpx

DEFAULT_API_BASE = "http://localhost:8092"
DEFAULT_API_KEY = "sk-empty"


def encode_audio_to_base64(audio_path: str) -> str:
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")
    ext = audio_path.lower().rsplit(".", 1)[-1]
    mime = {"wav": "audio/wav", "mp3": "audio/mpeg", "flac": "audio/flac"}.get(ext, "audio/wav")
    with open(audio_path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode("utf-8")
    return f"data:{mime};base64,{b64}"


def main() -> None:
    parser = argparse.ArgumentParser(description="IndexTTS2 OpenAI speech client")
    parser.add_argument("--text", type=str, required=True)
    parser.add_argument("--ref-audio", type=str, default=None, help="Reference audio for voice cloning")
    parser.add_argument("--emo-audio", type=str, default=None, help="Emotion reference audio")
    parser.add_argument("--emo-text", type=str, default=None, help="Emotion description text")
    parser.add_argument(
        "--emo-vector",
        type=float,
        nargs=8,
        default=None,
        help="8-dim emotion vector: happy angry sad afraid disgusted melancholic surprised calm",
    )
    parser.add_argument("--emo-alpha", type=float, default=None, help="Emotion weight in [0, 1]")
    parser.add_argument("--use-emo-text", action="store_true", help="Infer emotion vector from emo-text or text")
    parser.add_argument("--use-random", action="store_true", help="Use random emotion prototypes")
    parser.add_argument("--model", type=str, default="IndexTeam/IndexTTS-2")
    parser.add_argument("--voice", type=str, default=None, help="Uploaded voice name to use instead of --ref-audio")
    parser.add_argument("--output", type=str, default="output.wav")
    parser.add_argument("--api-base", type=str, default=DEFAULT_API_BASE)
    parser.add_argument("--api-key", type=str, default=DEFAULT_API_KEY)
    parser.add_argument("--response-format", type=str, default="wav")
    args = parser.parse_args()

    if not args.ref_audio and not args.voice:
        parser.error("IndexTTS2 requires --ref-audio or --voice for voice cloning")

    payload: dict = {
        "model": args.model,
        "input": args.text,
        "response_format": args.response_format,
    }
    if args.voice:
        payload["voice"] = args.voice

    if args.ref_audio:
        ref = args.ref_audio
        if ref.startswith(("http://", "https://", "data:")):
            payload["ref_audio"] = ref
        else:
            payload["ref_audio"] = encode_audio_to_base64(ref)

    extra_params = {}
    if args.emo_audio:
        emo = args.emo_audio
        if emo.startswith(("http://", "https://", "data:")):
            extra_params["emo_audio"] = emo
        else:
            extra_params["emo_audio"] = encode_audio_to_base64(emo)
    if args.emo_text:
        extra_params["emo_text"] = args.emo_text
    if args.emo_vector is not None:
        extra_params["emo_vector"] = args.emo_vector
    if args.emo_alpha is not None:
        extra_params["emo_alpha"] = args.emo_alpha
    if args.use_emo_text:
        extra_params["use_emo_text"] = True
    if args.use_random:
        extra_params["use_random"] = True
    if extra_params:
        payload["extra_params"] = extra_params

    url = f"{args.api_base}/v1/audio/speech"
    print(f"POST {url}")
    print(f"  text: {args.text}")
    if args.ref_audio:
        print(f"  ref_audio: {args.ref_audio[:80]}...")

    with httpx.Client(timeout=300) as client:
        resp = client.post(
            url,
            json=payload,
            headers={"Authorization": f"Bearer {args.api_key}"},
        )

    if resp.status_code != 200:
        print(f"Error {resp.status_code}: {resp.text[:500]}")
        return

    with open(args.output, "wb") as f:
        f.write(resp.content)
    print(f"Saved: {args.output} ({len(resp.content):,} bytes)")


if __name__ == "__main__":
    main()
ming_flash_omni_tts/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for Ming-flash-omni-2.0 standalone talker (TTS).
#
# Usage:
#   ./run_server.sh
#   MODEL=/path/to/local/model ./run_server.sh
#   PORT=8091 ./run_server.sh
#   HOST=127.0.0.1 ./run_server.sh   # bind only to loopback

set -e

MODEL="${MODEL:-Jonathan1909/Ming-flash-omni-2.0}"
HOST="${HOST:-0.0.0.0}"
PORT="${PORT:-8091}"
DEPLOY_CONFIG="${DEPLOY_CONFIG:-vllm_omni/deploy/ming_flash_omni_tts.yaml}"

echo "Starting Ming standalone TTS server with model: $MODEL"
echo "Deploy config: $DEPLOY_CONFIG"

vllm serve "$MODEL" \
    --deploy-config "$DEPLOY_CONFIG" \
    --host "$HOST" \
    --port "$PORT" \
    --trust-remote-code \
    --omni
ming_flash_omni_tts/speech_client.py
"""Client for Ming standalone TTS via /v1/audio/speech endpoint."""

import argparse
import json
import sys

import httpx

DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"
DEFAULT_MODEL = "Jonathan1909/Ming-flash-omni-2.0"


def run_tts(args) -> None:
    payload = {
        "model": args.model,
        "input": args.text,
        "response_format": args.response_format,
    }

    instructions = args.instructions
    if args.instruction_json:
        if instructions:
            sys.exit("--instructions and --instruction-json are mutually exclusive")

        try:
            parsed = json.loads(args.instruction_json)
        except json.JSONDecodeError as exc:
            sys.exit(f"--instruction-json must be valid JSON: {exc}")
        if not isinstance(parsed, dict):
            sys.exit("--instruction-json must decode to a JSON object")
        # Re-encode with ensure_ascii=False so UTF-8 Chinese keys/values
        # arrive at the server intact rather than as \\uXXXX escapes.
        instructions = json.dumps(parsed, ensure_ascii=False)
    if instructions:
        payload["instructions"] = instructions

    print(f"Model: {args.model}")
    print(f"Text: {args.text}")
    print("Generating audio...")

    api_url = f"{args.api_base}/v1/audio/speech"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {args.api_key}",
    }

    with httpx.Client(timeout=300.0) as client:
        response = client.post(api_url, json=payload, headers=headers)

    if response.status_code != 200:
        print(f"Error: {response.status_code}")
        print(response.text)
        return

    output_path = args.output or "ming_tts_output.wav"
    with open(output_path, "wb") as f:
        f.write(response.content)
    print(f"Audio saved to: {output_path}")


def main():
    parser = argparse.ArgumentParser(description="Ming standalone TTS speech client")
    parser.add_argument("--api-base", default=DEFAULT_API_BASE, help="API base URL")
    parser.add_argument("--api-key", default=DEFAULT_API_KEY, help="API key")
    parser.add_argument("--model", "-m", default=DEFAULT_MODEL, help="Model name or local path")
    parser.add_argument("--text", required=True, help="Text to synthesize")
    parser.add_argument(
        "--response-format",
        default="wav",
        choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
        help="Audio format (default: wav)",
    )
    parser.add_argument("--output", "-o", default=None, help="Output file path")
    parser.add_argument(
        "--instructions",
        default=None,
        help="Free-form style description (mapped to caption 风格 on the server).",
    )
    parser.add_argument(
        "--instruction-json",
        default=None,
        help=(
            "Structured caption JSON forwarded as `instructions`. Accepts Ming "
            "caption keys: 方言, 风格, 语速, 基频, 音量, 情感, IP, 说话人, BGM. "
        ),
    )
    args = parser.parse_args()
    run_tts(args)


if __name__ == "__main__":
    main()
ming_tts/README.md

Ming-omni-tts Online Serving

Serve the dense inclusionAI/Ming-omni-tts-0.5B two-stage TTS model through the OpenAI-compatible /v1/audio/speech endpoint.

Start Server

vllm-omni serve inclusionAI/Ming-omni-tts-0.5B \
    --deploy-config vllm_omni/deploy/ming_tts.yaml \
    --omni \
    --port 8091 \
    --enforce-eager

Or:

cd examples/online_serving/text_to_speech/ming_tts
./run_server.sh

The tested ROCm environment is summarized in the Ming recipe.

Send Requests

The Python client targets http://localhost:8091/v1 with api_key=EMPTY; it does not call OpenAI's hosted API.

python openai_speech_client.py \
    --text "你好,这是 Ming 在线语音合成测试。" \
    --max-new-tokens 200

Style or dialect controls can be plain text or Ming JSON. The upstream dialect example also uses yue_prompt.wav for speaker conditioning:

python openai_speech_client.py \
    --text "我觉得社会企业同个人都有责任" \
    --instruction-json '{"方言":"广粤话"}' \
    --ref-audio /path/to/yue_prompt.wav \
    --max-new-tokens 200

When --ref-audio is supplied without --ref-text, the server extracts the Ming speaker embedding, matching upstream use_spk_emb=True, without using the audio as a zero-shot prompt.

Reference-audio cloning:

python openai_speech_client.py \
    --text "我们的愿景是构建未来服务业的数字化基础设施,为世界带来更多微小而美好的改变。" \
    --ref-audio /path/to/10002287-00000094.wav \
    --ref-text "在此奉劝大家别乱打美白针。" \
    --max-new-tokens 200

Podcast-style multi-speaker prompt:

python openai_speech_client.py \
    --text " speaker_1:你可以说一下,就大概说一下,可能虽然我也不知道,我看过那部电影没有。
 speaker_2:就是那个叫什么,变相一节课的嘛。
 speaker_1:嗯。
 speaker_2:一部搞笑的电影。
 speaker_1:一部搞笑的。" \
    --ref-audio /path/to/CTS-CN-F2F-2019-11-11-423-012-A.wav \
    --ref-audio /path/to/CTS-CN-F2F-2019-11-11-423-012-B.wav \
    --ref-text " speaker_1:并且我们还要进行每个月还要考核 笔试的话还要进行笔试,做个,当服务员还要去笔试了
 speaker_2:对啊,这真的很奇怪,就是 单纯的因,单纯自己工资不高,只是因为可能人家那个店比较出名一点,就对你苛刻要求"

Streaming PCM:

python openai_speech_client.py \
    --text "你好,这是流式输出测试。" \
    --stream \
    --output ming_output.pcm

run_curl.sh keeps small smoke checks:

./run_curl.sh basic
REF_AUDIO=/path/to/reference.wav REF_TEXT="在此奉劝大家别乱打美白针。" ./run_curl.sh zero_shot
./run_curl.sh stream

Request Fields

Field Ming meaning
input target text
instructions plain style text, or JSON object for structured Ming controls
voice Ming IP voice label unless it resolves to an uploaded speaker
language Ming 方言 control
ref_audio speaker reference; with ref_text, also supplies the prompt waveform
ref_text transcript enabling zero-shot or podcast prompt-latent conditioning
speaker_embedding 192-d Ming speaker embedding
max_new_tokens Ming max_decode_steps

Notes

  • ref_audio accepts local paths through the client, remote URLs, file://, or data: URLs.
  • Non-streaming responses return WAV bytes; streaming responses return PCM.
  • Music-only bgm generation is offline-only until the API exposes Ming prompt-mode selection.
ming_tts/openai_speech_client.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/ming_tts/openai_speech_client.py.

ming_tts/run_curl.sh
#!/bin/bash
set -euo pipefail

MODE="${1:-basic}"
HOST="${HOST:-localhost}"
PORT="${PORT:-8091}"
MODEL="${MODEL:-inclusionAI/Ming-omni-tts-0.5B}"
API_URL="http://${HOST}:${PORT}/v1/audio/speech"
TEXT="${TEXT:-你好,这是 Ming 在线语音合成测试。}"
OUTPUT="${OUTPUT:-ming_output.wav}"
STREAM_OUTPUT="${STREAM_OUTPUT:-ming_output.pcm}"
REF_AUDIO="${REF_AUDIO:-}"
REF_TEXT="${REF_TEXT:-}"

post_json() {
    local payload="$1"
    local output_path="$2"
    curl -X POST "$API_URL" \
        -H "Content-Type: application/json" \
        -H "Authorization: Bearer EMPTY" \
        -d "$payload" \
        --output "$output_path"
}

case "$MODE" in
    basic)
        post_json "{
            \"model\": \"${MODEL}\",
            \"input\": \"${TEXT}\",
            \"response_format\": \"wav\"
        }" "$OUTPUT"
        ;;
    zero_shot)
        if [ -z "$REF_AUDIO" ] || [ -z "$REF_TEXT" ]; then
            echo "zero_shot requires REF_AUDIO and REF_TEXT" >&2
            exit 1
        fi
        python - <<'PY' > /tmp/ming_zero_shot_payload.json
import base64
import json
import mimetypes
import os
from pathlib import Path

path = Path(os.environ["REF_AUDIO"])
mime_type = mimetypes.guess_type(path.name)[0] or "audio/wav"
payload = {
    "model": os.environ["MODEL"],
    "input": os.environ["TEXT"],
    "ref_audio": f"data:{mime_type};base64,{base64.b64encode(path.read_bytes()).decode('utf-8')}",
    "ref_text": os.environ["REF_TEXT"],
    "response_format": "wav",
}
print(json.dumps(payload, ensure_ascii=False))
PY
        curl -X POST "$API_URL" \
            -H "Content-Type: application/json" \
            -H "Authorization: Bearer EMPTY" \
            --data-binary @/tmp/ming_zero_shot_payload.json \
            --output "$OUTPUT"
        rm -f /tmp/ming_zero_shot_payload.json
        ;;
    stream)
        post_json "{
            \"model\": \"${MODEL}\",
            \"input\": \"${TEXT}\",
            \"stream\": true,
            \"stream_format\": \"audio\",
            \"response_format\": \"pcm\"
        }" "$STREAM_OUTPUT"
        ;;
    *)
        echo "Unknown mode: $MODE" >&2
        echo "Supported sanity checks: basic, zero_shot, stream" >&2
        exit 1
        ;;
esac
ming_tts/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for Ming-omni-tts.
#
# Usage:
#   ./run_server.sh
#   PORT=8000 ./run_server.sh

set -e

DIR="$(cd "$(dirname "$0")" && pwd)"
ROOT="$(cd "$DIR/../../../.." && pwd)"

MODEL="${MODEL:-inclusionAI/Ming-omni-tts-0.5B}"
PORT="${PORT:-8091}"
DEPLOY_CONFIG="${DEPLOY_CONFIG:-$ROOT/vllm_omni/deploy/ming_tts.yaml}"

echo "Starting Ming-omni-tts server with model: $MODEL"
echo "Deploy config: $DEPLOY_CONFIG"

vllm-omni serve "$MODEL" \
    --deploy-config "$DEPLOY_CONFIG" \
    --host 0.0.0.0 \
    --port "$PORT" \
    --enforce-eager \
    --omni
moss_tts_nano/gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/moss_tts_nano/gradio_demo.py.

moss_tts_nano/run_gradio_demo.sh
#!/bin/bash
# Launch MOSS-TTS-Nano server + Gradio demo together.
#
# Usage:
#   ./run_gradio_demo.sh
#   CUDA_VISIBLE_DEVICES=0 PORT=8091 GRADIO_PORT=7860 ./run_gradio_demo.sh

set -e

MODEL="${MODEL:-OpenMOSS-Team/MOSS-TTS-Nano}"
PORT="${PORT:-8091}"
GRADIO_PORT="${GRADIO_PORT:-7860}"
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"

echo "Starting MOSS-TTS-Nano server (port $PORT)..."
FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm serve "$MODEL" \
    --host 0.0.0.0 \
    --port "$PORT" \
    --omni &
SERVER_PID=$!

cleanup() {
    echo "Stopping server (PID $SERVER_PID)..."
    kill $SERVER_PID 2>/dev/null
    wait $SERVER_PID 2>/dev/null
}
trap cleanup EXIT

# Wait for server to be ready.
echo "Waiting for server to start..."
for i in $(seq 1 120); do
    if curl -s "http://localhost:$PORT/health" > /dev/null 2>&1; then
        echo "Server ready."
        break
    fi
    sleep 2
done

echo "Starting Gradio demo (port $GRADIO_PORT)..."
python "$SCRIPT_DIR/gradio_demo.py" \
    --api-base "http://localhost:$PORT" \
    --port "$GRADIO_PORT"
moss_tts_nano/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for MOSS-TTS-Nano
#
# Usage:
#   ./run_server.sh
#   CUDA_VISIBLE_DEVICES=0 PORT=8091 ./run_server.sh

set -e

MODEL="${MODEL:-OpenMOSS-Team/MOSS-TTS-Nano}"
PORT="${PORT:-8091}"

echo "Starting MOSS-TTS-Nano server with model: $MODEL"

FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm serve "$MODEL" \
    --host 0.0.0.0 \
    --port "$PORT" \
    --omni
omnivoice/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for OmniVoice TTS
#
# Usage:
#   ./run_server.sh
#   CUDA_VISIBLE_DEVICES=0 ./run_server.sh

set -e

MODEL="${MODEL:-k2-fsa/OmniVoice}"
PORT="${PORT:-8091}"

echo "Starting OmniVoice server with model: $MODEL"

vllm serve "$MODEL" \
    --host 0.0.0.0 \
    --port "$PORT" \
    --trust-remote-code \
    --omni
omnivoice/speech_client.py
"""Client for OmniVoice TTS via /v1/audio/speech endpoint.

Examples:
    # Basic TTS (auto voice)
    python speech_client.py --text "Hello, how are you?"

    # Specify language
    python speech_client.py --text "Bonjour, comment allez-vous?" --language French

    # Use a specific uploaded/supported voice
    python speech_client.py --text "Hello" --voice my_uploaded_voice
"""

import argparse
import base64
import os

import httpx

DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"


def encode_audio_to_base64(audio_path: str) -> str:
    """Encode a local audio file to a base64 data URL."""
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")

    ext = audio_path.lower().rsplit(".", 1)[-1]
    mime = {
        "wav": "audio/wav",
        "mp3": "audio/mpeg",
        "flac": "audio/flac",
        "ogg": "audio/ogg",
    }.get(ext, "audio/wav")

    with open(audio_path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode("utf-8")
    return f"data:{mime};base64,{b64}"


def run_tts(args) -> None:
    """Generate speech via /v1/audio/speech API."""
    payload = {
        "model": args.model,
        "input": args.text,
        "response_format": args.response_format,
    }
    if args.seed is not None:
        payload["extra_params"] = {}
        payload["extra_params"]["seed"] = args.seed

    if args.voice:
        payload["voice"] = args.voice
    if args.language:
        payload["language"] = args.language

    if args.ref_audio:
        ref = args.ref_audio
        if ref.startswith(("http://", "https://", "data:")):
            payload["ref_audio"] = ref
        else:
            payload["ref_audio"] = encode_audio_to_base64(ref)

    if args.ref_text:
        payload["ref_text"] = args.ref_text

    if args.instructions:
        payload["instructions"] = args.instructions

    print(f"Model: {args.model}")
    print(f"Text: {args.text}")
    if args.seed:
        print(f"Seed: {args.seed}")

    if args.voice:
        print(f"Voice: {args.voice}")

    if args.language:
        print(f"Language: {args.language}")
    print("Generating audio...")

    api_url = f"{args.api_base}/v1/audio/speech"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {args.api_key}",
    }
    with httpx.Client(timeout=300.0) as client:
        response = client.post(api_url, json=payload, headers=headers)

    if response.status_code != 200:
        print(f"Error: {response.status_code}")
        print(response.text)
        return

    try:
        text = response.content.decode("utf-8")
        if text.startswith('{"error"'):
            print(f"Error: {text}")
            return
    except UnicodeDecodeError:
        pass

    output_path = args.output or "omnivoice_output.wav"
    with open(output_path, "wb") as f:
        f.write(response.content)
    print(f"Audio saved to: {output_path}")


def main():
    parser = argparse.ArgumentParser(description="OmniVoice TTS client")
    parser.add_argument("--api-base", default=DEFAULT_API_BASE, help="API base URL")
    parser.add_argument("--api-key", default=DEFAULT_API_KEY, help="API key")
    parser.add_argument("--model", "-m", default="k2-fsa/OmniVoice", help="Model name")
    parser.add_argument("--text", required=True, help="Text to synthesize")
    parser.add_argument(
        "--voice",
        default=None,
        help="Voice name (omit for auto voice; must match a supported or uploaded speaker if set)",
    )
    parser.add_argument("--language", default=None, help="Language hint (e.g., English, Chinese, French)")
    parser.add_argument(
        "--response-format",
        default="wav",
        choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
        help="Audio format (default: wav)",
    )
    parser.add_argument(
        "--ref-audio",
        type=str,
        default=None,
        help="Reference audio for voice cloning (local path, URL, or data: URI)",
    )
    parser.add_argument(
        "--ref-text",
        type=str,
        default=None,
        help="Reference text for voice cloning",
    )
    parser.add_argument(
        "--instructions",
        type=str,
        default=None,
        help="Voice style/emotion instructions",
    )
    parser.add_argument(
        "--seed",
        type=int,
        default=None,
        help="Random seed for generation, default: None for stochastic output)",
    )
    parser.add_argument("--output", "-o", default=None, help="Output file path")
    args = parser.parse_args()
    run_tts(args)


if __name__ == "__main__":
    main()
qwen3_tts/batch_speech_client.py
"""Batch speech client for Qwen3-TTS via /v1/audio/speech/batch endpoint.

This script demonstrates how to synthesize multiple texts in a single request.
A particularly useful scenario is voice cloning: set ref_audio once at the
batch level and generate many utterances in the cloned voice without repeating
the reference for each item.

Start the server (with batch-optimized stage settings for best throughput):

    vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
        --omni \
        --trust-remote-code \
        --stage-overrides '{"0":{"max_num_seqs":4,"gpu_memory_utilization":0.2},
                            "1":{"max_num_seqs":4,"gpu_memory_utilization":0.2}}'

Examples:
    # Batch with a predefined voice
    python batch_speech_client.py \
        --texts "Hello, how are you?" "Goodbye, see you later!"

    # Voice cloning: one ref_audio, many outputs
    python batch_speech_client.py \
        --task-type Base \
        --ref-audio /path/to/reference.wav \
        --ref-text "Transcript of the reference audio" \
        --texts "First cloned sentence." "Second cloned sentence." \
               "Third cloned sentence."
"""

import argparse
import base64
import os

import httpx

DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"


def encode_audio_to_base64(audio_path: str) -> str:
    """Encode a local audio file to a base64 data URL."""
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")

    ext = os.path.splitext(audio_path)[1].lower()
    mime_map = {".wav": "audio/wav", ".mp3": "audio/mpeg", ".flac": "audio/flac", ".ogg": "audio/ogg"}
    mime_type = mime_map.get(ext, "audio/wav")

    with open(audio_path, "rb") as f:
        audio_b64 = base64.b64encode(f.read()).decode("utf-8")
    return f"data:{mime_type};base64,{audio_b64}"


def run_batch(args) -> None:
    """Send a batch TTS request and save each result to a file."""
    items = [{"input": text} for text in args.texts]

    payload: dict = {
        "items": items,
        "response_format": args.response_format,
    }
    if args.voice:
        payload["voice"] = args.voice
    if args.language:
        payload["language"] = args.language
    if args.task_type:
        payload["task_type"] = args.task_type
    if args.instructions:
        payload["instructions"] = args.instructions
    if args.max_new_tokens:
        payload["max_new_tokens"] = args.max_new_tokens

    # Voice cloning parameters (shared across all items)
    if args.ref_audio:
        if args.ref_audio.startswith(("http://", "https://")):
            payload["ref_audio"] = args.ref_audio
        else:
            payload["ref_audio"] = encode_audio_to_base64(args.ref_audio)
    if args.ref_text:
        payload["ref_text"] = args.ref_text

    print(f"Sending batch of {len(items)} item(s) to {args.api_base}")
    if args.ref_audio:
        print("Voice cloning mode — ref_audio applied to all items")

    url = f"{args.api_base}/v1/audio/speech/batch"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {args.api_key}",
    }

    with httpx.Client(timeout=300.0) as client:
        response = client.post(url, json=payload, headers=headers)

    if response.status_code != 200:
        print(f"Error {response.status_code}: {response.text}")
        return

    data = response.json()
    print(f"Total: {data['total']}  Succeeded: {data['succeeded']}  Failed: {data['failed']}")

    os.makedirs(args.output_dir, exist_ok=True)
    for result in data["results"]:
        idx = result["index"]
        if result["status"] == "success":
            audio_bytes = base64.b64decode(result["audio_data"])
            out_path = os.path.join(args.output_dir, f"batch_{idx}.{args.response_format}")
            with open(out_path, "wb") as f:
                f.write(audio_bytes)
            print(f"  [{idx}] saved {len(audio_bytes)} bytes -> {out_path}")
        else:
            print(f"  [{idx}] FAILED: {result['error']}")


def parse_args():
    parser = argparse.ArgumentParser(
        description="Batch speech client for /v1/audio/speech/batch",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog=__doc__,
    )

    parser.add_argument("--api-base", default=DEFAULT_API_BASE, help="API base URL")
    parser.add_argument("--api-key", default=DEFAULT_API_KEY, help="API key")

    # Texts to synthesize
    parser.add_argument(
        "--texts",
        nargs="+",
        required=True,
        help="One or more texts to synthesize",
    )

    # Shared voice settings
    parser.add_argument("--voice", default="vivian", help="Speaker name (default: vivian)")
    parser.add_argument("--language", default=None, help="Language: Auto, Chinese, English, etc.")
    parser.add_argument("--instructions", default=None, help="Voice style/emotion instructions")
    parser.add_argument(
        "--task-type",
        default=None,
        choices=["CustomVoice", "VoiceDesign", "Base"],
        help="TTS task type (default: CustomVoice)",
    )

    # Voice cloning (Base task)
    parser.add_argument("--ref-audio", default=None, help="Reference audio path or URL for voice cloning")
    parser.add_argument("--ref-text", default=None, help="Reference audio transcript for voice cloning")

    # Generation
    parser.add_argument("--max-new-tokens", type=int, default=None, help="Max new tokens per item")
    parser.add_argument(
        "--response-format",
        default="wav",
        choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
        help="Audio format (default: wav)",
    )
    parser.add_argument("--output-dir", "-o", default="batch_output", help="Output directory (default: batch_output)")

    return parser.parse_args()


if __name__ == "__main__":
    args = parse_args()
    run_batch(args)
qwen3_tts/gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/qwen3_tts/gradio_demo.py.

qwen3_tts/openai_speech_client.py
"""OpenAI-compatible client for Qwen3-TTS via /v1/audio/speech endpoint.

This script demonstrates how to use the OpenAI-compatible speech API
to generate audio from text using Qwen3-TTS models.

Examples:
    # CustomVoice task (predefined speaker)
    python openai_speech_client.py --text "Hello, how are you?" --voice vivian

    # CustomVoice with emotion instruction
    python openai_speech_client.py --text "I'm so happy!" --voice vivian \
        --instructions "Speak with excitement"

    # VoiceDesign task (voice from description)
    python openai_speech_client.py --text "Hello world" \
        --task-type VoiceDesign \
        --instructions "A warm, friendly female voice"

    # Base task (voice cloning)
    python openai_speech_client.py --text "Hello world" \
        --task-type Base \
        --ref-audio "https://example.com/reference.wav" \
        --ref-text "This is the reference transcript"

    # Base task with pre-computed speaker embedding
    python openai_speech_client.py --text "Hello world" \
        --task-type Base \
        --speaker-embedding embedding.json
"""

import argparse
import base64
import json
import os

import httpx

# Default server configuration
DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"


def encode_audio_to_base64(audio_path: str) -> str:
    """Encode a local audio file to base64 data URL."""
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")

    # Detect MIME type from extension
    audio_path_lower = audio_path.lower()
    if audio_path_lower.endswith(".wav"):
        mime_type = "audio/wav"
    elif audio_path_lower.endswith((".mp3", ".mpeg")):
        mime_type = "audio/mpeg"
    elif audio_path_lower.endswith(".flac"):
        mime_type = "audio/flac"
    elif audio_path_lower.endswith(".ogg"):
        mime_type = "audio/ogg"
    else:
        mime_type = "audio/wav"  # Default

    with open(audio_path, "rb") as f:
        audio_bytes = f.read()
    audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")
    return f"data:{mime_type};base64,{audio_b64}"


def run_tts_generation(args) -> None:
    """Run TTS generation via OpenAI-compatible /v1/audio/speech API."""

    # Build request payload
    payload = {
        "model": args.model,
        "input": args.text,
        "voice": args.speaker,
        "response_format": args.response_format,
    }

    # Add optional parameters
    if args.instructions:
        payload["instructions"] = args.instructions
    if args.task_type:
        payload["task_type"] = args.task_type
    if args.language:
        payload["language"] = args.language
    if args.max_new_tokens:
        payload["max_new_tokens"] = args.max_new_tokens

    # Voice clone parameters (Base task)
    if args.ref_audio:
        if args.ref_audio.startswith(("http://", "https://")):
            payload["ref_audio"] = args.ref_audio
        elif args.ref_audio.startswith("data:"):
            payload["ref_audio"] = args.ref_audio
        else:
            payload["ref_audio"] = encode_audio_to_base64(args.ref_audio)
    if args.ref_text:
        payload["ref_text"] = args.ref_text
    if args.x_vector_only:
        payload["x_vector_only_mode"] = True
    if args.speaker_embedding:
        with open(args.speaker_embedding) as f:
            payload["speaker_embedding"] = json.load(f)

    print(f"Model: {args.model}")
    print(f"Task type: {args.task_type or 'CustomVoice'}")
    print(f"Text: {args.text}")
    print(f"Speaker: {args.speaker}")
    print("Generating audio...")

    # Make the API call
    api_url = f"{args.api_base}/v1/audio/speech"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {args.api_key}",
    }

    with httpx.Client(timeout=300.0) as client:
        response = client.post(api_url, json=payload, headers=headers)

    if response.status_code != 200:
        print(f"Error: {response.status_code}")
        print(response.text)
        return

    # Check for JSON error response (only if content is valid UTF-8 text)
    try:
        text = response.content.decode("utf-8")
        if text.startswith('{"error"'):
            print(f"Error: {text}")
            return
    except UnicodeDecodeError:
        pass  # Binary audio data, not an error

    # Save audio response
    output_path = args.output or "tts_output.wav"
    with open(output_path, "wb") as f:
        f.write(response.content)
    print(f"Audio saved to: {output_path}")


def parse_args():
    """Parse command line arguments."""
    parser = argparse.ArgumentParser(
        description="OpenAI-compatible client for Qwen3-TTS via /v1/audio/speech",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog=__doc__,
    )

    # Server configuration
    parser.add_argument(
        "--api-base",
        type=str,
        default=DEFAULT_API_BASE,
        help=f"API base URL (default: {DEFAULT_API_BASE})",
    )
    parser.add_argument(
        "--api-key",
        type=str,
        default=DEFAULT_API_KEY,
        help="API key (default: EMPTY)",
    )
    parser.add_argument(
        "--model",
        "-m",
        type=str,
        default="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
        help="Model name/path",
    )

    # Task configuration
    parser.add_argument(
        "--task-type",
        "-t",
        type=str,
        default=None,
        choices=["CustomVoice", "VoiceDesign", "Base"],
        help="TTS task type (default: CustomVoice)",
    )

    # Input text
    parser.add_argument(
        "--text",
        type=str,
        required=True,
        help="Text to synthesize",
    )

    # Voice/speaker
    parser.add_argument(
        "--speaker",
        type=str,
        default="vivian",
        help="Speaker name (default: vivian). Options: vivian, ryan, aiden, etc.",
    )
    parser.add_argument(
        "--language",
        type=str,
        default=None,
        help="Language: Auto, Chinese, English, etc.",
    )
    parser.add_argument(
        "--instructions",
        type=str,
        default=None,
        help="Voice style/emotion instructions",
    )

    # Base (voice clone) parameters
    parser.add_argument(
        "--ref-audio",
        type=str,
        default=None,
        help="Reference audio file path, URL, or base64 for voice cloning (Base task)",
    )
    parser.add_argument(
        "--ref-text",
        type=str,
        default=None,
        help="Reference audio transcript for voice cloning (Base task)",
    )
    parser.add_argument(
        "--x-vector-only",
        action="store_true",
        help="Use x-vector only mode for voice cloning (no ICL)",
    )
    parser.add_argument(
        "--speaker-embedding",
        type=str,
        default=None,
        help="Path to JSON file containing a pre-computed speaker embedding vector (1024-dim for 0.6B, 2048-dim for 1.7B)",
    )

    # Generation parameters
    parser.add_argument(
        "--max-new-tokens",
        type=int,
        default=None,
        help="Maximum new tokens to generate",
    )

    # Output
    parser.add_argument(
        "--response-format",
        type=str,
        default="wav",
        choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
        help="Audio output format (default: wav)",
    )
    parser.add_argument(
        "--output",
        "-o",
        type=str,
        default=None,
        help="Output audio file path (default: tts_output.wav)",
    )

    return parser.parse_args()


if __name__ == "__main__":
    args = parse_args()
    run_tts_generation(args)
qwen3_tts/precompute_custom_voice.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/qwen3_tts/precompute_custom_voice.py.

qwen3_tts/run_gradio_demo.sh
#!/bin/bash
# Launch both vLLM server and Gradio demo for Qwen3-TTS
#
# Usage:
#   ./run_gradio_demo.sh                                    # Default: CustomVoice
#   ./run_gradio_demo.sh --task-type VoiceDesign            # VoiceDesign model
#   ./run_gradio_demo.sh --task-type Base --gradio-port 7861
#
# Options:
#   --task-type TYPE        Task type: CustomVoice, VoiceDesign, Base (default: CustomVoice)
#   --server-port PORT      Port for vLLM server (default: 8000)
#   --gradio-port PORT      Port for Gradio demo (default: 7860)
#   --server-host HOST      Host for vLLM server (default: 0.0.0.0)
#   --gradio-ip IP          IP for Gradio demo (default: 127.0.0.1)
#   --share                 Share Gradio demo publicly

set -e

# Default values
TASK_TYPE="CustomVoice"
SERVER_PORT=8000
GRADIO_PORT=7860
SERVER_HOST="0.0.0.0"
GRADIO_IP="127.0.0.1"
GRADIO_SHARE=false

# Parse command line arguments
while [[ $# -gt 0 ]]; do
    case $1 in
        --task-type)
            TASK_TYPE="$2"
            shift 2
            ;;
        --server-port)
            SERVER_PORT="$2"
            shift 2
            ;;
        --gradio-port)
            GRADIO_PORT="$2"
            shift 2
            ;;
        --server-host)
            SERVER_HOST="$2"
            shift 2
            ;;
        --gradio-ip)
            GRADIO_IP="$2"
            shift 2
            ;;
        --share)
            GRADIO_SHARE=true
            shift
            ;;
        --help)
            echo "Usage: $0 [OPTIONS]"
            echo ""
            echo "Options:"
            echo "  --task-type TYPE        Task type: CustomVoice, VoiceDesign, Base (default: CustomVoice)"
            echo "  --server-port PORT      Port for vLLM server (default: 8000)"
            echo "  --gradio-port PORT      Port for Gradio demo (default: 7860)"
            echo "  --server-host HOST      Host for vLLM server (default: 0.0.0.0)"
            echo "  --gradio-ip IP          IP for Gradio demo (default: 127.0.0.1)"
            echo "  --share                 Share Gradio demo publicly"
            echo ""
            exit 0
            ;;
        *)
            echo "Unknown option: $1"
            echo "Use --help for usage information"
            exit 1
            ;;
    esac
done

# Map task type to model
case "$TASK_TYPE" in
    CustomVoice)
        MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"
        ;;
    VoiceDesign)
        MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign"
        ;;
    Base)
        MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-Base"
        ;;
    *)
        echo "Unknown task type: $TASK_TYPE"
        echo "Supported: CustomVoice, VoiceDesign, Base"
        exit 1
        ;;
esac

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
API_BASE="http://localhost:${SERVER_PORT}"

echo "=========================================="
echo "Qwen3-TTS Gradio Demo"
echo "=========================================="
echo "Task Type : $TASK_TYPE"
echo "Model     : $MODEL"
echo "Server    : http://${SERVER_HOST}:${SERVER_PORT}"
echo "Gradio    : http://${GRADIO_IP}:${GRADIO_PORT}"
echo "=========================================="

# Cleanup on exit
cleanup() {
    echo ""
    echo "Shutting down..."
    if [ -n "$SERVER_PID" ]; then
        echo "Stopping vLLM server (PID: $SERVER_PID)..."
        kill "$SERVER_PID" 2>/dev/null || true
        wait "$SERVER_PID" 2>/dev/null || true
    fi
    if [ -n "$GRADIO_PID" ]; then
        echo "Stopping Gradio demo (PID: $GRADIO_PID)..."
        kill "$GRADIO_PID" 2>/dev/null || true
        wait "$GRADIO_PID" 2>/dev/null || true
    fi
    echo "Cleanup complete"
    exit 0
}
trap cleanup SIGINT SIGTERM

# Start vLLM server
echo ""
echo "Starting vLLM server..."
LOG_FILE="/tmp/vllm_tts_server_${SERVER_PORT}.log"

vllm-omni serve "$MODEL" \
    --deploy-config vllm_omni/deploy/qwen3_tts.yaml \
    --host "$SERVER_HOST" \
    --port "$SERVER_PORT" \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --omni 2>&1 | tee "$LOG_FILE" &
SERVER_PID=$!

# Wait for server startup
echo ""
echo "Waiting for vLLM server to be ready..."
STARTUP_FLAG="/tmp/vllm_tts_startup_flag_${SERVER_PORT}.tmp"
rm -f "$STARTUP_FLAG"

(
    tail -f "$LOG_FILE" 2>/dev/null | grep -m 1 "Application startup complete" > /dev/null && touch "$STARTUP_FLAG"
) &
TAIL_PID=$!

MAX_WAIT=300
ELAPSED=0
while [ $ELAPSED -lt $MAX_WAIT ]; do
    if [ -f "$STARTUP_FLAG" ]; then
        kill "$TAIL_PID" 2>/dev/null || true
        wait "$TAIL_PID" 2>/dev/null || true
        echo ""
        echo "vLLM server is ready!"
        break
    fi
    if ! kill -0 "$SERVER_PID" 2>/dev/null; then
        kill "$TAIL_PID" 2>/dev/null || true
        echo ""
        echo "Error: vLLM server failed to start"
        exit 1
    fi
    sleep 1
    ELAPSED=$((ELAPSED + 1))
done

rm -f "$STARTUP_FLAG"

if [ $ELAPSED -ge $MAX_WAIT ]; then
    kill "$TAIL_PID" 2>/dev/null || true
    echo "Error: Server startup timed out after ${MAX_WAIT}s"
    kill "$SERVER_PID" 2>/dev/null || true
    exit 1
fi

# Start Gradio demo
echo ""
echo "Starting Gradio demo..."
cd "$SCRIPT_DIR"
GRADIO_CMD=("python" "gradio_demo.py" "--api-base" "$API_BASE" "--host" "$GRADIO_IP" "--port" "$GRADIO_PORT")
if [ "$GRADIO_SHARE" = true ]; then
    GRADIO_CMD+=("--share")
fi

"${GRADIO_CMD[@]}" &
GRADIO_PID=$!

echo ""
echo "=========================================="
echo "Both services are running!"
echo "=========================================="
echo "vLLM Server : http://${SERVER_HOST}:${SERVER_PORT}"
echo "Gradio Demo : http://${GRADIO_IP}:${GRADIO_PORT}"
echo ""
echo "Press Ctrl+C to stop both services"
echo "=========================================="
echo ""

wait $SERVER_PID $GRADIO_PID || true
cleanup
qwen3_tts/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for Qwen3-TTS models
#
# Usage:
#   ./run_server.sh                           # Default: CustomVoice model
#   ./run_server.sh CustomVoice               # CustomVoice model
#   ./run_server.sh VoiceDesign               # VoiceDesign model
#   ./run_server.sh Base                      # Base (voice clone) model

set -e

TASK_TYPE="${1:-CustomVoice}"

case "$TASK_TYPE" in
    CustomVoice)
        MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"
        ;;
    VoiceDesign)
        MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign"
        ;;
    Base)
        MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-Base"
        ;;
    *)
        echo "Unknown task type: $TASK_TYPE"
        echo "Supported: CustomVoice, VoiceDesign, Base"
        exit 1
        ;;
esac

echo "Starting Qwen3-TTS server with model: $MODEL"

vllm-omni serve "$MODEL" \
    --deploy-config vllm_omni/deploy/qwen3_tts.yaml \
    --host 0.0.0.0 \
    --port 8091 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --omni
qwen3_tts/speaker_embedding_interpolation.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/qwen3_tts/speaker_embedding_interpolation.py.

qwen3_tts/streaming_speech_client.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/qwen3_tts/streaming_speech_client.py.

qwen3_tts/tts_common.py
"""Shared constants, helpers, and payload building for Qwen3-TTS Gradio demos."""

import base64
import io

try:
    import gradio as gr
except ImportError:
    raise ImportError("gradio is required to run this demo. Install it with: pip install 'vllm-omni[demo]'") from None
import httpx
import numpy as np
import soundfile as sf

SUPPORTED_LANGUAGES = [
    "Auto",
    "Chinese",
    "English",
    "Japanese",
    "Korean",
    "German",
    "French",
    "Russian",
    "Portuguese",
    "Spanish",
    "Italian",
]

TASK_TYPES = ["CustomVoice", "VoiceDesign", "Base"]

PCM_SAMPLE_RATE = 24000

DEFAULT_API_BASE = "http://localhost:8000"


def fetch_voices(api_base: str) -> list[str]:
    """Fetch available voices from the server."""
    try:
        with httpx.Client(timeout=10.0) as client:
            resp = client.get(
                f"{api_base}/v1/audio/voices",
                headers={"Authorization": "Bearer EMPTY"},
            )
        if resp.status_code == 200:
            data = resp.json()
            voices = data.get("voices") or []
            if voices:
                return voices
    except Exception:
        pass
    return ["Vivian", "Ryan"]


def encode_audio_to_base64(audio_data: tuple) -> str:
    """Encode Gradio audio input (sample_rate, numpy_array) to base64 data URL."""
    sample_rate, audio_np = audio_data

    if audio_np.dtype != np.int16:
        if audio_np.dtype in (np.float32, np.float64):
            audio_np = np.clip(audio_np, -1.0, 1.0)
            audio_np = (audio_np * 32767).astype(np.int16)
        else:
            audio_np = audio_np.astype(np.int16)

    buf = io.BytesIO()
    sf.write(buf, audio_np, sample_rate, format="WAV")
    wav_b64 = base64.b64encode(buf.getvalue()).decode("utf-8")
    return f"data:audio/wav;base64,{wav_b64}"


def build_payload(
    text: str,
    task_type: str,
    voice: str,
    language: str,
    instructions: str,
    ref_audio: tuple | None,
    ref_audio_url: str,
    ref_text: str,
    x_vector_only: bool,
    response_format: str = "pcm",
    speed: float = 1.0,
    stream: bool = True,
) -> dict:
    """Build the /v1/audio/speech request payload.

    Raises gr.Error for invalid input so callers don't need to validate.
    """
    if not text or not text.strip():
        raise gr.Error("Please enter text to synthesize.")

    payload: dict = {
        "input": text.strip(),
        "response_format": "pcm" if stream else response_format,
        "stream": stream,
    }
    if stream:
        payload["stream_format"] = "audio"
    if not stream:
        payload["speed"] = speed

    if task_type:
        payload["task_type"] = task_type
    if language:
        payload["language"] = language

    if task_type == "CustomVoice":
        if voice:
            payload["voice"] = voice
        if instructions and instructions.strip():
            payload["instructions"] = instructions.strip()

    elif task_type == "VoiceDesign":
        if not instructions or not instructions.strip():
            raise gr.Error("VoiceDesign task requires voice style instructions.")
        payload["instructions"] = instructions.strip()

    elif task_type == "Base":
        ref_audio_url_stripped = ref_audio_url.strip() if ref_audio_url else ""
        if ref_audio_url_stripped:
            payload["ref_audio"] = ref_audio_url_stripped
        elif ref_audio is not None:
            payload["ref_audio"] = encode_audio_to_base64(ref_audio)
        else:
            raise gr.Error("Base (voice clone) task requires reference audio. Upload a file or provide a URL.")
        if ref_text and ref_text.strip():
            payload["ref_text"] = ref_text.strip()
        if x_vector_only:
            payload["x_vector_only_mode"] = True

    return payload


def on_task_type_change(task_type: str):
    """Update UI visibility based on selected task type."""
    if task_type == "CustomVoice":
        return (
            gr.update(visible=True),  # voice dropdown
            gr.update(visible=True, info="Optional style/emotion instructions"),
            gr.update(visible=False),  # ref_audio
            gr.update(visible=False),  # ref_audio_url
            gr.update(visible=False),  # ref_text
            gr.update(visible=False),  # x_vector_only
        )
    elif task_type == "VoiceDesign":
        return (
            gr.update(visible=False),
            gr.update(visible=True, info="Required: describe the voice style"),
            gr.update(visible=False),
            gr.update(visible=False),
            gr.update(visible=False),
            gr.update(visible=False),
        )
    elif task_type == "Base":
        return (
            gr.update(visible=False),
            gr.update(visible=False),
            gr.update(visible=True),
            gr.update(visible=True),
            gr.update(visible=True),
            gr.update(visible=True),
        )
    return (
        gr.update(visible=True),
        gr.update(visible=True),
        gr.update(visible=False),
        gr.update(visible=False),
        gr.update(visible=False),
        gr.update(visible=False),
    )


def stream_pcm_chunks(api_base: str, payload: dict):
    """Stream raw PCM bytes from the server, yielding int16 numpy arrays.

    Handles odd-byte boundaries between network chunks.
    """
    leftover = b""
    with httpx.Client(timeout=300.0) as client:
        with client.stream(
            "POST",
            f"{api_base}/v1/audio/speech",
            json=payload,
            headers={
                "Content-Type": "application/json",
                "Authorization": "Bearer EMPTY",
            },
        ) as resp:
            if resp.status_code != 200:
                resp.read()
                raise gr.Error(f"Server error ({resp.status_code}): {resp.text}")
            for chunk in resp.iter_bytes():
                if not chunk:
                    continue
                raw = leftover + chunk
                usable = len(raw) - (len(raw) % 2)
                leftover = raw[usable:]
                if usable == 0:
                    continue
                yield np.frombuffer(raw[:usable], dtype=np.int16).copy()


def add_common_args(parser):
    """Add CLI arguments shared by both demos."""
    parser.add_argument(
        "--api-base",
        default=DEFAULT_API_BASE,
        help=f"Base URL for the vLLM API server (default: {DEFAULT_API_BASE}).",
    )
    parser.add_argument(
        "--host",
        default="0.0.0.0",
        help="Host/IP for Gradio server (default: 0.0.0.0).",
    )
    parser.add_argument(
        "--port",
        type=int,
        default=7860,
        help="Port for Gradio server (default: 7860).",
    )
    parser.add_argument(
        "--share",
        action="store_true",
        help="Share the Gradio demo publicly.",
    )
    return parser
qwen3_tts/word_timestamps_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/qwen3_tts/word_timestamps_demo.py.

soulxsinger/openai_chat_client.py
#!/usr/bin/env python3
"""SoulX-Singer OpenAI-compatible chat client (SVS / SVC).

Sends prompt audio via ``input_audio`` and target accompaniment via
``extra_args['target_audio']`` (server-local path). For integrated preprocess,
also pass ``preprocess_weights_dir`` in ``extra_args``.

Usage:
  python openai_chat_client.py \\
      --prompt-audio /path/on/server/zh_prompt.mp3 \\
      --target-audio /path/on/server/music.mp3 \\
      --preprocess-weights-dir /path/on/server/SoulX-Singer-Preprocess \\
      -o output.wav
"""

from __future__ import annotations

import argparse
import base64
import io
import sys
from pathlib import Path

import requests
import soundfile
import torch


def _audio_to_data_url(path: Path) -> str:
    with path.open("rb") as handle:
        data = base64.b64encode(handle.read()).decode("ascii")
    return f"data:audio/mpeg;base64,{data}"


def _save_wav(audio: torch.Tensor, path: Path, sample_rate: int) -> None:
    audio = audio.to(torch.float32)
    peak = audio.abs().max().clamp(min=1e-8)
    audio = audio / peak
    path.parent.mkdir(parents=True, exist_ok=True)
    soundfile.write(str(path), audio.clamp(-1.0, 1.0).cpu().T.numpy(), sample_rate, subtype="PCM_16")


def _decode_audio_from_response(body: dict) -> tuple[torch.Tensor, int]:
    for choice in body.get("choices", []):
        audio_obj = choice.get("message", {}).get("audio")
        if isinstance(audio_obj, dict) and audio_obj.get("data"):
            data, sr = soundfile.read(
                io.BytesIO(base64.b64decode(audio_obj["data"])),
                dtype="float32",
                always_2d=True,
            )
            return torch.from_numpy(data).transpose(0, 1), sr
    brief = {k: v for k, v in body.items() if k != "choices"}
    raise RuntimeError(f"no audio in response message.audio: {brief}")


def main() -> int:
    repo_root = Path(__file__).resolve().parents[4]
    default_assets = repo_root / "tests" / "assets" / "soulxsinger"

    parser = argparse.ArgumentParser(description="SoulX-Singer online chat client")
    parser.add_argument("--port", type=int, default=8192)
    parser.add_argument("--model", default="Soul-AILab/SoulX-Singer")
    parser.add_argument(
        "--prompt-audio",
        default=str(default_assets / "zh_prompt.mp3"),
        help="Prompt vocal audio (path on server if using extra_args, or local for input_audio)",
    )
    parser.add_argument(
        "--target-audio",
        default=str(default_assets / "music.mp3"),
        help="Target accompaniment path on the server (extra_args['target_audio'])",
    )
    parser.add_argument(
        "--prompt-metadata-path",
        default=None,
        help="SVS precomputed prompt metadata.json",
    )
    parser.add_argument(
        "--target-metadata-path",
        default=None,
        help="SVS precomputed target metadata.json",
    )
    parser.add_argument(
        "--audio-path",
        default=None,
        help="SVS prompt vocal wav for precomputed metadata",
    )
    parser.add_argument("--preprocess-weights-dir", default=None)
    parser.add_argument("--output", "-o", default="soulxsinger_out.wav")
    parser.add_argument("--svc", action="store_true", help="Use SVC mode knobs")
    parser.add_argument("--language", default="Mandarin")
    parser.add_argument("--num-inference-steps", type=int, default=32)
    parser.add_argument("--guidance-scale", type=float, default=3.0)
    parser.add_argument(
        "--seed",
        type=int,
        default=42,
        help="Optional CFM seed. Omit for non-deterministic sampling.",
    )
    parser.add_argument(
        "--auto-shift",
        action=argparse.BooleanOptionalAction,
        default=True,
        help="Auto pitch shift (default: on, original upstream infer.sh)",
    )
    parser.add_argument(
        "--control",
        default="melody",
        choices=["melody", "score"],
        help="SVS control mode",
    )
    parser.add_argument("--vocal-sep", action="store_true")
    args = parser.parse_args()

    meta_paths = (args.prompt_metadata_path, args.target_metadata_path, args.audio_path)
    if any(meta_paths) and not all(meta_paths):
        print(
            "ERROR: precomputed metadata requires --prompt-metadata-path, "
            "--target-metadata-path, and --audio-path together.",
            file=sys.stderr,
        )
        return 2

    extra_args: dict = {
        "vocal_sep": args.vocal_sep,
        "auto_shift": args.auto_shift,
        "pitch_shift": 0,
    }
    if all(meta_paths):
        extra_args.update(
            {
                "prompt_metadata_path": str(Path(args.prompt_metadata_path).expanduser().resolve()),
                "target_metadata_path": str(Path(args.target_metadata_path).expanduser().resolve()),
                "audio_path": str(Path(args.audio_path).expanduser().resolve()),
            }
        )
        content = [{"type": "text", "text": "soulx-singer"}]
    else:
        prompt_path = Path(args.prompt_audio).expanduser().resolve()
        if not prompt_path.is_file():
            print(f"ERROR: prompt audio not found: {prompt_path}", file=sys.stderr)
            return 2
        extra_args["prompt_audio"] = str(prompt_path)
        extra_args["target_audio"] = str(Path(args.target_audio).expanduser().resolve())
        if args.preprocess_weights_dir:
            extra_args["preprocess_weights_dir"] = str(Path(args.preprocess_weights_dir).expanduser().resolve())
        content = [
            {"type": "text", "text": "soulx-singer"},
            {
                "type": "input_audio",
                "input_audio": {"data": _audio_to_data_url(prompt_path), "format": "mp3"},
            },
        ]
    if not args.svc:
        extra_args["language"] = args.language
        extra_args["control"] = args.control

    payload = {
        "model": args.model,
        "modalities": ["audio"],
        "messages": [{"role": "user", "content": content}],
        "num_inference_steps": args.num_inference_steps,
        "guidance_scale": args.guidance_scale,
        "extra_args": extra_args,
    }
    if args.seed is not None:
        payload["seed"] = args.seed

    print(f"POST http://localhost:{args.port}/v1/chat/completions")
    response = requests.post(
        f"http://localhost:{args.port}/v1/chat/completions",
        headers={"Content-Type": "application/json"},
        json=payload,
        timeout=1800,
    )
    response.raise_for_status()
    audio, sample_rate = _decode_audio_from_response(response.json())
    _save_wav(audio, Path(args.output), sample_rate)
    duration = audio.shape[-1] / sample_rate
    print(f"saved {args.output}  sr={sample_rate}Hz  duration={duration:.2f}s")
    return 0


if __name__ == "__main__":
    raise SystemExit(main())
soulxsinger/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for SoulX-Singer (single-stage DiT, preprocess inline).
#
# Usage:
#   MODEL=/path/to/SoulX-Singer PREPROCESS=/path/to/Preprocess \
#
# Audio paths in client extra_args must be readable on the server host.

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../../../.." && pwd)"

MODEL="${MODEL:-Soul-AILab/SoulX-Singer}"
MODE="${MODE:-svs}"
PORT="${PORT:-8192}"
GPUS="${GPUS:-0}"

if [[ "$MODE" == "svc" ]]; then
  DEPLOY_CONFIG="${DEPLOY_CONFIG:-$REPO_ROOT/vllm_omni/deploy/soulxsinger_svc.yaml}"
else
  DEPLOY_CONFIG="${DEPLOY_CONFIG:-$REPO_ROOT/vllm_omni/deploy/soulxsinger_svs.yaml}"
fi

echo "Starting SoulX-Singer server"
echo "  MODEL=$MODEL"
echo "  MODE=$MODE"
echo "  PORT=$PORT"
echo "  DEPLOY_CONFIG=$DEPLOY_CONFIG"
echo "  CUDA_VISIBLE_DEVICES=$GPUS"

CUDA_VISIBLE_DEVICES="$GPUS" \
vllm serve "$MODEL" \
    --omni \
    --deploy-config "$DEPLOY_CONFIG" \
    --host 0.0.0.0 \
    --port "$PORT" \
    --trust-remote-code \
    --enforce-eager
voxcpm2/gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/voxcpm2/gradio_demo.py.

voxcpm2/openai_speech_client.py
"""OpenAI-compatible client for VoxCPM2 TTS via /v1/audio/speech endpoint.

Examples:
    # Zero-shot synthesis
    python openai_speech_client.py --text "Hello, this is VoxCPM2."

    # Voice cloning with a local reference audio file
    python openai_speech_client.py --text "Hello world" \
        --ref-audio /path/to/reference.wav

    # Voice cloning with a URL
    python openai_speech_client.py --text "Hello world" \
        --ref-audio "https://example.com/reference.wav"

Server setup:
    vllm serve openbmb/VoxCPM2 --omni --host 0.0.0.0 --port 8000
"""

from __future__ import annotations

import argparse
import base64
import os

import httpx

DEFAULT_API_BASE = "http://localhost:8000"
DEFAULT_API_KEY = "sk-empty"


def encode_audio_to_base64(audio_path: str) -> str:
    """Encode a local audio file to a base64 data URL."""
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")

    ext = audio_path.lower().rsplit(".", 1)[-1]
    mime = {
        "wav": "audio/wav",
        "mp3": "audio/mpeg",
        "flac": "audio/flac",
        "ogg": "audio/ogg",
    }.get(ext, "audio/wav")

    with open(audio_path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode("utf-8")
    return f"data:{mime};base64,{b64}"


def main() -> None:
    parser = argparse.ArgumentParser(description="VoxCPM2 OpenAI speech client")
    parser.add_argument("--text", type=str, required=True, help="Text to synthesize")
    parser.add_argument(
        "--ref-audio",
        type=str,
        default=None,
        help="Reference audio for voice cloning (local path, URL, or data: URI)",
    )
    parser.add_argument("--model", type=str, default="voxcpm2")
    parser.add_argument("--output", type=str, default="output.wav")
    parser.add_argument("--api-base", type=str, default=DEFAULT_API_BASE)
    parser.add_argument("--api-key", type=str, default=DEFAULT_API_KEY)
    parser.add_argument("--response-format", type=str, default="wav")
    args = parser.parse_args()

    # VoxCPM2 has no predefined voices. The "voice" field is required by
    # the OpenAI API schema but ignored by VoxCPM2 — use any placeholder.
    # For voice cloning, pass --ref-audio instead.
    payload: dict = {
        "model": args.model,
        "input": args.text,
        "voice": "default",
        "response_format": args.response_format,
    }

    if args.ref_audio:
        ref = args.ref_audio
        if ref.startswith(("http://", "https://", "data:")):
            payload["ref_audio"] = ref
        else:
            payload["ref_audio"] = encode_audio_to_base64(ref)

    url = f"{args.api_base}/v1/audio/speech"
    print(f"POST {url}")
    print(f"  text: {args.text}")
    if args.ref_audio:
        print(f"  ref_audio: {args.ref_audio[:80]}...")

    with httpx.Client(timeout=300) as client:
        resp = client.post(
            url,
            json=payload,
            headers={"Authorization": f"Bearer {args.api_key}"},
        )

    if resp.status_code != 200:
        print(f"Error {resp.status_code}: {resp.text[:500]}")
        return

    with open(args.output, "wb") as f:
        f.write(resp.content)
    print(f"Saved: {args.output} ({len(resp.content):,} bytes)")


if __name__ == "__main__":
    main()
voxcpm2/precompute_custom_voice.py
"""Pre-compute VoxCPM2 custom voice profiles.

The generated directory can be passed to the server via
``custom_voice_dir`` in ``vllm_omni/deploy/voxcpm2.yaml``. Requests can then
use ``/v1/audio/speech`` with ``voice="<name>"`` and no per-request ref_audio.
"""

from __future__ import annotations

import argparse
import json
import sys
from pathlib import Path
from typing import Any

import torch
from safetensors.torch import save_file

REPO_ROOT = Path(__file__).resolve().parents[4]
if str(REPO_ROOT) not in sys.path:
    sys.path.insert(0, str(REPO_ROOT))

from vllm_omni.utils.custom_voice_io import safe_voice_stem  # noqa: E402

MANIFEST_NAME = "custom_voice_manifest.json"


def _load_tts(model: str, device: torch.device):
    from vllm_omni.model_executor.models.voxcpm2.voxcpm2_import_utils import import_voxcpm2_core

    VoxCPM = import_voxcpm2_core()
    native = VoxCPM.from_pretrained(model, load_denoiser=False, optimize=False)
    return native.tts_model.to(device).eval()


def _load_manifest(output_dir: Path, model: str) -> dict[str, Any]:
    path = output_dir / MANIFEST_NAME
    if path.exists():
        return json.loads(path.read_text(encoding="utf-8"))
    return {
        "schema_version": 1,
        "model_type": "voxcpm2",
        "model": model,
        "voices": {},
    }


def _write_voice(
    *,
    model: str,
    output_dir: Path,
    voice_name: str,
    ref_audio: str,
    prompt_text: str | None,
    mode: str,
    speaker_description: str | None,
    device: torch.device,
) -> None:
    if mode in ("continuation", "ref_continuation") and not prompt_text:
        raise ValueError("--prompt-text is required for continuation/ref_continuation modes")

    tts = _load_tts(model, device)
    tensors: dict[str, torch.Tensor] = {}
    with torch.inference_mode():
        if mode in ("reference", "ref_continuation"):
            tensors["ref_audio_feat"] = tts._encode_wav(ref_audio, padding_mode="right").float().cpu().contiguous()
        if mode in ("continuation", "ref_continuation"):
            tensors["audio_feat"] = tts._encode_wav(ref_audio, padding_mode="left").float().cpu().contiguous()

    output_dir.mkdir(parents=True, exist_ok=True)
    filename = f"{safe_voice_stem(voice_name)}.safetensors"
    save_file(tensors, str(output_dir / filename))

    manifest = _load_manifest(output_dir, model)
    entry: dict[str, Any] = {
        "name": voice_name,
        "file": filename,
        "mode": mode,
    }
    if "ref_audio_feat" in tensors:
        entry["ref_audio_feat_len"] = int(tensors["ref_audio_feat"].shape[0])
    if "audio_feat" in tensors:
        entry["audio_feat_len"] = int(tensors["audio_feat"].shape[0])
    if prompt_text:
        entry["prompt_text"] = prompt_text
    if speaker_description:
        entry["speaker_description"] = speaker_description

    manifest.setdefault("voices", {})[voice_name] = entry
    (output_dir / MANIFEST_NAME).write_text(json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8")
    print(f"Wrote {output_dir / filename}")
    print(f"Updated {output_dir / MANIFEST_NAME}")


def main() -> None:
    parser = argparse.ArgumentParser(description="Pre-compute VoxCPM2 custom voice profile")
    parser.add_argument("--model", default="openbmb/VoxCPM2", help="VoxCPM2 model path or Hugging Face ID")
    parser.add_argument("--voice-name", required=True)
    parser.add_argument("--ref-audio", required=True)
    parser.add_argument(
        "--prompt-text",
        default=None,
        help="Transcript of ref audio for continuation/ref_continuation modes",
    )
    parser.add_argument(
        "--mode",
        choices=["reference", "continuation", "ref_continuation"],
        default="reference",
    )
    parser.add_argument("--speaker-description", default=None)
    parser.add_argument("--output-dir", required=True)
    parser.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu")
    args = parser.parse_args()

    _write_voice(
        model=args.model,
        output_dir=Path(args.output_dir),
        voice_name=args.voice_name,
        ref_audio=args.ref_audio,
        prompt_text=args.prompt_text,
        mode=args.mode,
        speaker_description=args.speaker_description,
        device=torch.device(args.device),
    )


if __name__ == "__main__":
    main()
voxtral_tts/gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/voxtral_tts/gradio_demo.py.

voxtral_tts/text_preprocess.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/voxtral_tts/text_preprocess.py.