Skip to content

Text-To-Speech (Online Serving)

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/text_to_speech.

vLLM-Omni exposes TTS models through the OpenAI-compatible POST /v1/audio/speech endpoint, launched with vllm serve <model> --omni. Each TTS model has its own subdirectory containing client snippets, gradio demos, and helper scripts; this README is the single doc entry point for all of them.

For offline inference, see examples/offline_inference/text_to_speech. For the full list of supported architectures across all modalities, see Supported Models.

Supported Models

Model HuggingFace repo Voice cloning Streaming Voice presets / upload Gradio demo
Fish Speech S2 Pro fishaudio/s2-pro ✓ (ref_audio+ref_text) ✓ (PCM stream)
GLM-TTS zai-org/GLM-TTS ✓ (ref_audio+ref_text, required) ✓ (PCM stream)
Ming-flash-omni-TTS Jonathan1909/Ming-flash-omni-2.0 — (caption-controlled) caption fields (instructions)
MOSS-TTS-Nano OpenMOSS-Team/MOSS-TTS-Nano ✓ (ref_audio required) ✓ (PCM stream)
OmniVoice k2-fsa/OmniVoice
Qwen3-TTS Qwen/Qwen3-TTS-12Hz-1.7B-{CustomVoice,VoiceDesign,Base} ✓ (Base) ✓ (PCM + WebSocket) ✓ (presets + /v1/audio/voices upload) ✓ (standard + FastRTC)
VoxCPM2 openbmb/VoxCPM2 ✓ (AudioWorklet via gradio)
Voxtral TTS mistralai/Voxtral-4B-TTS-2603 ✓ (gated upstream) ✓ (presets)

CosyVoice3 is intentionally absent: no online example exists for it yet. See its offline section instead.

Common Quick Start

Launch the server (defaults shown — adjust --port, --gpu-memory-utilization, etc. as needed):

vllm serve <hf-repo-or-local-path> --omni --port 8091

Send a TTS request via curl:

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Hello, how are you?",
        "voice": "default",
        "response_format": "wav"
    }' --output output.wav

Or via Python httpx:

import httpx

response = httpx.post(
    "http://localhost:8091/v1/audio/speech",
    json={
        "input": "Hello, how are you?",
        "voice": "default",
        "response_format": "wav",
    },
    timeout=300.0,
)
open("output.wav", "wb").write(response.content)

Or via the OpenAI SDK:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8091/v1", api_key="none")
response = client.audio.speech.create(
    model="<hf-repo>",
    voice="default",
    input="Hello, how are you?",
)
response.stream_to_file("output.wav")

Streaming PCM output (where supported) — set stream=true with response_format="pcm":

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Hello, how are you?",
        "voice": "default",
        "stream": true,
        "response_format": "pcm"
    }' --no-buffer | play -t raw -r 24000 -e signed -b 16 -c 1 -

Adjust the player's sample rate to match the model (44.1 kHz for Fish Speech, 48 kHz for VoxCPM2, 24 kHz for the others).

For full request-shape documentation (all parameters, response formats, error codes), see the Speech API reference.


GLM-TTS

2-stage TTS (AR + DiT flow-matching) at 24 kHz. Every request requires ref_audio + ref_text.

Launch

vllm serve zai-org/GLM-TTS --omni --trust-remote-code --port 8091
# or:
bash examples/online_serving/text_to_speech/glm_tts/run_server.sh /path/to/GLM-TTS

Sending requests

# Voice cloning (required)
python examples/online_serving/text_to_speech/glm_tts/openai_speech_client.py \
    --text "你好,这是语音克隆测试。" \
    --ref-audio file:///path/to/ref.wav \
    --ref-text "这是参考音频的文本内容。"

# Custom format
python examples/online_serving/text_to_speech/glm_tts/openai_speech_client.py \
    --text "Hello, this is a voice cloning test." \
    --ref-audio file:///path/to/ref.wav \
    --ref-text "Transcript of the reference audio." \
    --response-format mp3 -o output.mp3

Gradio demo

bash examples/online_serving/text_to_speech/glm_tts/run_gradio_demo.sh

Notes

  • Output: 24 kHz mono WAV via HiFT vocoder.
  • ref_audio + ref_text are required together on every request. Reference audio should be 3-10 seconds.
  • Voice cloning feature extraction (WhisperVQ, CampPlus, mel) runs on the model side — no external dependency on the serving layer.

Fish Speech S2 Pro

4B dual-AR TTS at 44.1 kHz. Server uses the DAC codec.

Prerequisites

pip install fish-speech

Kvcache attention fast path

Fish Speech S2 Pro uses a Triton decode-only kvcache attention fast path by default on CUDA builds. Set VLLM_OMNI_FISH_KVCACHE_ATTN=0 to disable it, or VLLM_OMNI_FISH_KVCACHE_ATTN=required to fail fast if the fast path cannot be installed.

# Verify fast path availability.
python - <<'PY'
from vllm_omni.attention import fish_kvcache_attn

print(fish_kvcache_attn.is_available())
print(fish_kvcache_attn.load_error())
PY

# Optional: disable the runtime fast path.
export VLLM_OMNI_FISH_KVCACHE_ATTN=0

Launch

vllm serve fishaudio/s2-pro --omni --port 8091
# or:
./fish_speech/run_server.sh
The deploy config auto-loads from vllm_omni/deploy/fish_qwen3_omni.yaml (the HF model_type on the fishaudio checkpoint is fish_qwen3_omni).

Voice cloning

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Hello, this is a cloned voice.",
        "voice": "default",
        "ref_audio": "https://example.com/reference.wav",
        "ref_text": "Transcript of the reference audio."
    }' --output cloned.wav

CLI client

cd examples/online_serving/text_to_speech/fish_speech
python speech_client.py --text "Hello, how are you?"
python speech_client.py --text "Hello world" --stream --output output.pcm

Gradio demo

./fish_speech/run_gradio_demo.sh             # launches server + Gradio
python fish_speech/gradio_demo.py --api-base http://localhost:8091  # if server already running

Notes

  • Output: 44.1 kHz mono.
  • Streaming PCM player command must use -r 44100.

Ming-flash-omni-TTS

Standalone talker-only deployment of Ming-flash-omni-2.0. Voice is controlled through caption text passed via instructions.

Launch

# from repo root
bash examples/online_serving/text_to_speech/ming_flash_omni_tts/run_server.sh
Equivalent manual command:
vllm serve Jonathan1909/Ming-flash-omni-2.0 \
    --deploy-config vllm_omni/deploy/ming_flash_omni_tts.yaml \
    --host 0.0.0.0 --port 8091 \
    --trust-remote-code --omni

Sending requests

python examples/online_serving/text_to_speech/ming_flash_omni_tts/speech_client.py \
    --text "我们当迎着阳光辛勤耕作,去摘取,去制作,去品尝,去馈赠。" \
    --output ming_online.wav

ASMR-style caption via instructions:

python examples/online_serving/text_to_speech/ming_flash_omni_tts/speech_client.py \
    --text "我会一直在这里陪着你,直到你慢慢、慢慢地沉入那个最温柔的梦里……好吗?" \
    --instructions "这是一种ASMR耳语,属于一种旨在引发特殊感官体验的创意风格。这个女性使用轻柔的普通话进行耳语,声音气音成分重。" \
    --output ming_online_asmr.wav

Notes


MOSS-TTS-Nano

Single-stage 0.1B AR LM + MOSS-Audio-Tokenizer-Nano codec at 48 kHz mono. Every request must include ref_audio; there are no built-in speaker presets.

The OpenAI-schema voice and ref_text fields are accepted but ignored — voice_clone does not consume a transcript, and upstream's continuation mode (the only path that accepts prompt_text) emits near-silent output, so it is not exposed here. Sample reference clips ship in the upstream repo under assets/audio/.

Launch

vllm serve OpenMOSS-Team/MOSS-TTS-Nano --omni --port 8091
# or:
./moss_tts_nano/run_server.sh
The deploy config at vllm_omni/deploy/moss_tts_nano.yaml auto-loads; no --stage-configs-path, --trust-remote-code, or --enforce-eager flags are needed.

Sending requests

# One-off fetch of a sample reference clip; cache under XDG_CACHE_HOME.
REF_DIR="${XDG_CACHE_HOME:-$HOME/.cache}/moss-tts-nano"
mkdir -p "$REF_DIR"
REF_WAV="$REF_DIR/zh_1.wav"
[ -s "$REF_WAV" ] || curl -L -o "$REF_WAV" https://raw.githubusercontent.com/OpenMOSS/MOSS-TTS-Nano/main/assets/audio/zh_1.wav
REF_AUDIO=$(base64 -w 0 "$REF_WAV")

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d "{
        \"input\": \"你好,这是语音合成测试。\",
        \"ref_audio\": \"data:audio/wav;base64,${REF_AUDIO}\",
        \"response_format\": \"wav\"
    }" --output output.wav

Streaming PCM

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d "{
        \"input\": \"Hello, streaming output from MOSS-TTS-Nano.\",
        \"ref_audio\": \"data:audio/wav;base64,${REF_AUDIO}\",
        \"stream\": true,
        \"response_format\": \"pcm\"
    }" --no-buffer | play -t raw -r 48000 -e signed -b 16 -c 1 -

Gradio demo

# Option 1: launch server + Gradio together
./moss_tts_nano/run_gradio_demo.sh

# Option 2: server already running
python moss_tts_nano/gradio_demo.py --api-base http://localhost:8091
Then open http://localhost:7860 in your browser.

Notes

  • Output is 48 kHz mono PCM (the upstream tokenizer is internally stereo at 48 kHz; the wrapper averages to mono before reaching the engine).
  • Standard /v1/audio/speech request shape: input, ref_audio (base64 data URL), response_format, stream, max_new_tokens. The voice and ref_text fields from the OpenAI schema are accepted but ignored.

OmniVoice

Zero-shot multilingual TTS (600+ languages). Online serving currently exposes auto voice only; voice cloning and voice design are available offline.

Prerequisites

huggingface-cli download k2-fsa/OmniVoice
Voice cloning (offline) needs transformers>=5.3.0; auto voice works with transformers>=4.57.0.

Launch

vllm serve k2-fsa/OmniVoice --omni --port 8091 --trust-remote-code
# or:
./omnivoice/run_server.sh

CLI client

cd examples/online_serving/text_to_speech/omnivoice
# Text-only (auto voice)
python speech_client.py --text "Hello, how are you?"

# Language hint
python speech_client.py --text "Bonjour, comment allez-vous?" --language French
# Voice cloning (reference audio + optional ref_text)
python speech_client.py \
--text "Bonjour, comment allez-vous?" \
--ref-audio /path/to/ref_audio.wav \
--ref-text "Bonjour, comment allez-vous?"

# Style instruction (voice design-style control)
python speech_client.py \
--text "Bonjour, comment allez-vous?" \
--language French \
--instructions "loud voice"

# Deterministic output with seed parameter
python speech_client.py --text "Hello, how are you?" --seed 42

The client supports --api-base, --model, --text, --response-format, --language, --voice, --ref-audio, --ref-text, --instructions, --seed, and --output.

Qwen3-TTS

Three model variants exposed via separate checkpoints:

Variant HF repo Use
CustomVoice Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice Predefined speakers (vivian, ryan, …) with optional style instructions
VoiceDesign Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign Natural-language voice style description
Base Qwen/Qwen3-TTS-12Hz-1.7B-Base Voice cloning from a reference audio

Each variant ships smaller 0.6B companions where available.

Launch

vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --omni --port 8091
# or:
./qwen3_tts/run_server.sh                # default: CustomVoice
./qwen3_tts/run_server.sh VoiceDesign
./qwen3_tts/run_server.sh Base

Executor backend

Single-GPU serves now default to the uniproc executor (lower IPC overhead, the Base cloning use case from #2603 / #2604). vllm_omni/deploy/qwen3_tts.yaml is the only Qwen3-TTS deploy config; pass --deploy-config <path> to override.

To opt out of chunked streaming, pass --no-async-chunk — the pipeline auto-dispatches to the end-to-end codec processor.

Sending requests

# CustomVoice with a predefined speaker
python qwen3_tts/openai_speech_client.py \
    --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
    --text "今天天气真好" \
    --speaker ryan \
    --instructions "用开心的语气说"

# VoiceDesign with a style description
python qwen3_tts/openai_speech_client.py \
    --model Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign \
    --task-type VoiceDesign \
    --text "哥哥,你回来啦" \
    --instructions "体现撒娇稚嫩的萝莉女声,音调偏高"

# Base voice cloning
python qwen3_tts/openai_speech_client.py \
    --model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
    --task-type Base \
    --text "Hello, this is a cloned voice" \
    --ref-audio /path/to/reference.wav \
    --ref-text "Original transcript of the reference audio"

Voices endpoint

List available voices, or upload a custom one for Base cloning:

# List
curl http://localhost:8091/v1/audio/voices

# Upload
curl -X POST http://localhost:8091/v1/audio/voices \
    -F "audio_sample=@/path/to/voice_sample.wav" \
    -F "consent=user_consent_id" \
    -F "name=custom_voice_1" \
    -F "ref_text=The exact transcript of the audio sample." \
    -F "speaker_description=warm narrator"
Uploaded voices are then usable as voice="custom_voice_1" on subsequent requests.

Precomputed custom voices

For reused Base voice-cloning speakers, precompute the reference artifacts once and load them at server startup:

python qwen3_tts/precompute_custom_voice.py \
    --model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
    --voice-name alice \
    --ref-audio /path/to/reference.wav \
    --ref-text "Original transcript of the reference audio" \
    --mode icl \
    --output-dir /path/to/custom_voices
--mode icl stores both speaker_embedding and ref_code; --mode xvec stores only the speaker embedding. Add the output directory to a deploy config:
custom_voice_dir: /path/to/custom_voices
Then start the server with that config and call the Speech API with only the voice name:
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-Base --omni --deploy-config /path/to/qwen3_tts_custom_voice.yaml

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{"input":"Hello from a precomputed voice.","voice":"alice","task_type":"Base"}' \
    --output alice.wav

Streaming PCM

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Hello, how are you?",
        "voice": "vivian",
        "language": "English",
        "stream": true,
        "response_format": "pcm"
    }' --no-buffer | play -t raw -r 24000 -e signed -b 16 -c 1 -
Streaming requires response_format="pcm" and async_chunk: true on the stage config (default in qwen3_tts.yaml). speed is not supported when streaming.

Streaming WebSocket

The /v1/audio/speech/stream endpoint accepts text incrementally, splits it at sentence boundaries, and emits one PCM stream per sentence:

python qwen3_tts/streaming_speech_client.py --text "Hello world. How are you? I am fine."
python qwen3_tts/streaming_speech_client.py --text "..." --simulate-stt --stt-delay 0.1

Gradio demos

./qwen3_tts/run_gradio_demo.sh                              # CustomVoice (default)
./qwen3_tts/run_gradio_demo.sh --task-type VoiceDesign
./qwen3_tts/run_gradio_demo.sh --task-type Base

Speaker embedding interpolation

qwen3_tts/speaker_embedding_interpolation.py blends two predefined speakers' embeddings to produce intermediate voices. See the script for usage.

Batch client

qwen3_tts/batch_speech_client.py issues many concurrent requests for throughput measurement.

Notes

  • Base voice cloning has uniproc-vs-mp tradeoffs depending on per-request reference audio cost; see the executor-backend section above.
  • With async chunking, Qwen3-TTS Base voice cloning sends the full reference context in the first Code2Wav packet, then caches that prefix on the Code2Wav stage for follow-up chunks in the same request.
  • vllm_omni/deploy/qwen3_tts.yaml is the default deploy config (loaded by HF model_type); per-stage runtime overrides are available via --stage-N-<field> <value>.

VoxCPM2

Single-stage native AR TTS at 48 kHz.

Launch

vllm serve openbmb/VoxCPM2 --omni --host 0.0.0.0 --port 8000
Deploy config auto-loads from vllm_omni/deploy/voxcpm2.yaml. Pass --deploy-config <path> to override or --stage-N-<field> <value> for per-stage runtime tweaks.

Sending requests

# Zero-shot synthesis
python voxcpm2/openai_speech_client.py --text "Hello, this is VoxCPM2."

# Voice cloning
python voxcpm2/openai_speech_client.py \
    --text "This should sound like the reference speaker." \
    --ref-audio /path/to/reference.wav
The ref_audio field accepts local file paths (auto-base64), HTTP URLs, or data:audio/wav;base64,... data URIs.

Precomputed custom voices

For repeated VoxCPM2 speakers, precompute the prompt cache and load it through custom_voice_dir:

python voxcpm2/precompute_custom_voice.py \
    --model openbmb/VoxCPM2 \
    --voice-name alice \
    --ref-audio /path/to/reference.wav \
    --mode ref_continuation \
    --prompt-text "Original transcript of the reference audio" \
    --output-dir /path/to/custom_voices
Add the output directory to the deploy config:
custom_voice_dir: /path/to/custom_voices
After startup, /v1/audio/voices lists alice, and /v1/audio/speech can use voice="alice" without sending ref_audio.

Gradio demo (gapless streaming via AudioWorklet)

python voxcpm2/gradio_demo.py
Uses an AudioWorklet-based player adapted from the Qwen3-TTS demo for gap-free playback. Audio is streamed from the OpenAI Speech endpoint with stream=true.


Voxtral TTS

Voxtral-4B-TTS (Mistral). Uses the mistral_common SpeechRequest protocol; voice presets are model-specific.

Prerequisites

Latest mistral_common with SpeechRequest support:

pip install -e /path/to/mistral-common  # or upgrade from PyPI when available

Launch

vllm serve mistralai/Voxtral-4B-TTS-2603 --omni --port 8091
Deploy config auto-loads from vllm_omni/deploy/voxtral_tts.yaml.

Gradio demo

python voxtral_tts/gradio_demo.py
The demo handles voice-preset selection and reference-audio upload. voxtral_tts/text_preprocess.py provides the text-normalization helpers used by the demo (also available for other clients).

Notes

  • Voice presets are listed on the HF model card (mistralai/Voxtral-4B-TTS-2603).
  • Voice cloning is gated upstream and may require a recent mistral_common.
  • A standalone CLI client is not yet shipped; the gradio demo is the canonical reference for now.

Example materials

cosyvoice3/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for CosyVoice3 TTS
#
# Usage:
#   ./run_server.sh
#   CUDA_VISIBLE_DEVICES=0 ./run_server.sh
#
# Streaming (async-chunk) is on by default via vllm_omni/deploy/cosyvoice3.yaml.
# Set NO_ASYNC_CHUNK=1 to use the legacy synchronous path.

set -e

MODEL="${MODEL:-FunAudioLLM/Fun-CosyVoice3-0.5B-2512}"
PORT="${PORT:-8091}"

EXTRA_ARGS=()
if [[ -n "${NO_ASYNC_CHUNK:-}" ]]; then
    EXTRA_ARGS+=(--no-async-chunk)
fi

echo "Starting CosyVoice3 server with model: $MODEL"

vllm serve "$MODEL" \
    --host 0.0.0.0 \
    --port "$PORT" \
    --trust-remote-code \
    --omni \
    "${EXTRA_ARGS[@]}"
cosyvoice3/speech_client.py
"""Client for CosyVoice3 TTS via /v1/audio/speech endpoint.

CosyVoice3 has no built-in voice presets: every request is voice cloning
driven by ``ref_audio`` + ``ref_text``. The defaults below point at the
official upstream zero-shot prompt so the script runs out of the box.

Examples:
    # Voice cloning with the default upstream prompt
    python speech_client.py --text "收到好友从远方寄来的生日礼物。"

    # Custom reference clip + transcript
    python speech_client.py --text "Hello, this is a cloned voice." \
        --ref-audio /path/to/reference.wav \
        --ref-text "Transcript of the reference audio."

    # Streaming PCM output
    python speech_client.py --text "Hello world" --stream --output output.pcm
"""

import argparse
import base64
import os

import httpx

DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"
DEFAULT_MODEL = "FunAudioLLM/Fun-CosyVoice3-0.5B-2512"

# Official CosyVoice zero-shot prompt and its transcript.
DEFAULT_REF_AUDIO = "https://raw.githubusercontent.com/FunAudioLLM/CosyVoice/main/asset/zero_shot_prompt.wav"
DEFAULT_REF_TEXT = "希望你以后能够做的比我还好呦。"


def encode_audio_to_base64(audio_path: str) -> str:
    """Encode a local audio file to a base64 data URL."""
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")
    ext = audio_path.lower().rsplit(".", 1)[-1]
    mime_map = {"wav": "audio/wav", "mp3": "audio/mpeg", "flac": "audio/flac", "ogg": "audio/ogg"}
    mime_type = mime_map.get(ext, "audio/wav")
    with open(audio_path, "rb") as f:
        audio_b64 = base64.b64encode(f.read()).decode("utf-8")
    return f"data:{mime_type};base64,{audio_b64}"


def run_tts(args) -> None:
    """Generate speech via the /v1/audio/speech API."""
    payload = {
        "model": args.model,
        "input": args.text,
        "response_format": args.response_format,
    }

    if args.ref_audio.startswith(("http://", "https://")):
        payload["ref_audio"] = args.ref_audio
    else:
        payload["ref_audio"] = encode_audio_to_base64(args.ref_audio)
    payload["ref_text"] = args.ref_text

    if args.stream:
        payload["stream"] = True
        payload["response_format"] = "pcm"

    print(f"Model: {args.model}")
    print(f"Text: {args.text}")
    print(f"Voice cloning: ref_audio={args.ref_audio}, ref_text={args.ref_text}")
    print("Generating audio...")

    api_url = f"{args.api_base}/v1/audio/speech"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {args.api_key}",
    }

    if args.stream:
        output_path = args.output or "output.pcm"
        with httpx.Client(timeout=300.0) as client:
            with client.stream("POST", api_url, json=payload, headers=headers) as resp:
                if resp.status_code != 200:
                    print(f"Error: {resp.status_code}")
                    print(resp.read().decode())
                    return
                total_bytes = 0
                with open(output_path, "wb") as f:
                    for chunk in resp.iter_bytes():
                        f.write(chunk)
                        total_bytes += len(chunk)
                print(f"Streamed {total_bytes} bytes to: {output_path}")
    else:
        with httpx.Client(timeout=300.0) as client:
            response = client.post(api_url, json=payload, headers=headers)

        if response.status_code != 200:
            print(f"Error: {response.status_code}")
            print(response.text)
            return

        try:
            text = response.content.decode("utf-8")
            if text.startswith('{"error"'):
                print(f"Error: {text}")
                return
        except UnicodeDecodeError:
            pass

        output_path = args.output or "output.wav"
        with open(output_path, "wb") as f:
            f.write(response.content)
        print(f"Audio saved to: {output_path}")


def main():
    parser = argparse.ArgumentParser(description="CosyVoice3 TTS client")
    parser.add_argument("--api-base", default=DEFAULT_API_BASE, help="API base URL")
    parser.add_argument("--api-key", default=DEFAULT_API_KEY, help="API key")
    parser.add_argument("--model", "-m", default=DEFAULT_MODEL, help="Model name")
    parser.add_argument("--text", required=True, help="Text to synthesize")
    parser.add_argument(
        "--ref-audio",
        default=DEFAULT_REF_AUDIO,
        help="Reference audio for voice cloning (path or URL)",
    )
    parser.add_argument(
        "--ref-text",
        default=DEFAULT_REF_TEXT,
        help="Transcript of the reference audio",
    )
    parser.add_argument("--stream", action="store_true", help="Enable streaming (PCM output)")
    parser.add_argument(
        "--response-format",
        default="wav",
        choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
        help="Audio format (default: wav)",
    )
    parser.add_argument("--output", "-o", default=None, help="Output file path")
    args = parser.parse_args()
    run_tts(args)


if __name__ == "__main__":
    main()
fish_speech/gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/fish_speech/gradio_demo.py.

fish_speech/run_gradio_demo.sh
#!/bin/bash
# Launch Fish Speech S2 Pro server + Gradio demo together.
#
# Usage:
#   ./run_gradio_demo.sh
#   CUDA_VISIBLE_DEVICES=0 PORT=8091 GRADIO_PORT=7860 ./run_gradio_demo.sh

set -e

MODEL="${MODEL:-fishaudio/s2-pro}"
PORT="${PORT:-8091}"
GRADIO_PORT="${GRADIO_PORT:-7860}"
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"

echo "Starting Fish Speech S2 Pro server (port $PORT)..."
FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm serve "$MODEL" \
    --omni \
    --host 0.0.0.0 \
    --port "$PORT" &
SERVER_PID=$!

cleanup() {
    echo "Stopping server (PID $SERVER_PID)..."
    kill $SERVER_PID 2>/dev/null
    wait $SERVER_PID 2>/dev/null
}
trap cleanup EXIT

# Wait for server to be ready.
echo "Waiting for server to start..."
for i in $(seq 1 120); do
    if curl -s "http://localhost:$PORT/health" > /dev/null 2>&1; then
        echo "Server ready."
        break
    fi
    sleep 2
done

echo "Starting Gradio demo (port $GRADIO_PORT)..."
python "$SCRIPT_DIR/gradio_demo.py" \
    --api-base "http://localhost:$PORT" \
    --port "$GRADIO_PORT"
fish_speech/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for Fish Speech S2 Pro
#
# Usage:
#   ./run_server.sh
#   CUDA_VISIBLE_DEVICES=0 ./run_server.sh

set -e

MODEL="${MODEL:-fishaudio/s2-pro}"
PORT="${PORT:-8091}"

echo "Starting Fish Speech S2 Pro server with model: $MODEL"

FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm serve "$MODEL" \
    --omni \
    --host 0.0.0.0 \
    --port "$PORT"
fish_speech/speech_client.py
"""Client for Fish Speech S2 Pro via /v1/audio/speech endpoint.

Examples:
    # Basic TTS
    python speech_client.py --text "Hello, how are you?"

    # Voice cloning
    python speech_client.py --text "Hello, how are you?" \
        --ref-audio ref.wav --ref-text "This is the reference transcript."

    # Streaming PCM output
    python speech_client.py --text "Hello world" --stream --output output.pcm
"""

import argparse
import base64
import os

import httpx

DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"


def encode_audio_to_base64(audio_path: str) -> str:
    """Encode a local audio file to base64 data URL."""
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")
    ext = audio_path.lower().rsplit(".", 1)[-1]
    mime_map = {"wav": "audio/wav", "mp3": "audio/mpeg", "flac": "audio/flac", "ogg": "audio/ogg"}
    mime_type = mime_map.get(ext, "audio/wav")
    with open(audio_path, "rb") as f:
        audio_b64 = base64.b64encode(f.read()).decode("utf-8")
    return f"data:{mime_type};base64,{audio_b64}"


def run_tts(args) -> None:
    """Generate speech via /v1/audio/speech API."""
    payload = {
        "model": args.model,
        "input": args.text,
        "voice": "default",
        "response_format": args.response_format,
    }

    # Voice cloning parameters.
    if args.ref_audio:
        if args.ref_audio.startswith(("http://", "https://")):
            payload["ref_audio"] = args.ref_audio
        else:
            payload["ref_audio"] = encode_audio_to_base64(args.ref_audio)
    if args.ref_text:
        payload["ref_text"] = args.ref_text

    if args.stream:
        payload["stream"] = True
        payload["response_format"] = "pcm"

    print(f"Model: {args.model}")
    print(f"Text: {args.text}")
    if args.ref_audio:
        print(f"Voice cloning: ref_audio={args.ref_audio}, ref_text={args.ref_text}")
    print("Generating audio...")

    api_url = f"{args.api_base}/v1/audio/speech"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {args.api_key}",
    }

    if args.stream:
        output_path = args.output or "output.pcm"
        with httpx.Client(timeout=300.0) as client:
            with client.stream("POST", api_url, json=payload, headers=headers) as resp:
                if resp.status_code != 200:
                    print(f"Error: {resp.status_code}")
                    print(resp.read().decode())
                    return
                total_bytes = 0
                with open(output_path, "wb") as f:
                    for chunk in resp.iter_bytes():
                        f.write(chunk)
                        total_bytes += len(chunk)
                print(f"Streamed {total_bytes} bytes to: {output_path}")
    else:
        with httpx.Client(timeout=300.0) as client:
            response = client.post(api_url, json=payload, headers=headers)

        if response.status_code != 200:
            print(f"Error: {response.status_code}")
            print(response.text)
            return

        try:
            text = response.content.decode("utf-8")
            if text.startswith('{"error"'):
                print(f"Error: {text}")
                return
        except UnicodeDecodeError:
            pass

        output_path = args.output or "output.wav"
        with open(output_path, "wb") as f:
            f.write(response.content)
        print(f"Audio saved to: {output_path}")


def main():
    parser = argparse.ArgumentParser(description="Fish Speech S2 Pro TTS client")
    parser.add_argument("--api-base", default=DEFAULT_API_BASE, help="API base URL")
    parser.add_argument("--api-key", default=DEFAULT_API_KEY, help="API key")
    parser.add_argument("--model", "-m", default="fishaudio/s2-pro", help="Model name")
    parser.add_argument("--text", required=True, help="Text to synthesize")
    parser.add_argument("--ref-audio", default=None, help="Reference audio for voice cloning (path or URL)")
    parser.add_argument("--ref-text", default=None, help="Transcript of reference audio")
    parser.add_argument("--stream", action="store_true", help="Enable streaming (PCM output)")
    parser.add_argument(
        "--response-format",
        default="wav",
        choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
        help="Audio format (default: wav)",
    )
    parser.add_argument("--output", "-o", default=None, help="Output file path")
    args = parser.parse_args()
    run_tts(args)


if __name__ == "__main__":
    main()
glm_tts/gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/glm_tts/gradio_demo.py.

glm_tts/openai_speech_client.py
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""OpenAI-compatible client for GLM-TTS via /v1/audio/speech endpoint.

GLM-TTS is a two-stage TTS system (AR + DiT) that generates audio from text
conditioned on reference speech. Each request requires ref_audio + ref_text.

Usage:
    # Voice cloning
    python openai_speech_client.py --text "你好" --ref-audio file:///path/to/ref.wav --ref-text "参考文本"

    # Streaming response, for async_chunk server mode
    python openai_speech_client.py --text "你好" --stream --ref-audio file:///path/to/ref.wav --ref-text "参考文本"

    # Specify output format
    python openai_speech_client.py --text "你好" --ref-audio file:///path/to/ref.wav \
        --ref-text "参考文本" --response-format mp3 -o output.mp3
"""

import argparse

import httpx

# Default server configuration
DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"


def run_tts_generation(args) -> None:
    """Run TTS generation via OpenAI-compatible /v1/audio/speech API."""
    if not args.ref_audio or not args.ref_text:
        raise ValueError("GLM-TTS requires --ref-audio and --ref-text for voice cloning.")

    payload = {
        "model": args.model,
        "voice": "default",
        "input": args.text,
        "response_format": args.response_format,
        "stream": bool(args.stream),
        "ref_audio": args.ref_audio,
        "ref_text": args.ref_text,
    }
    if args.max_new_tokens:
        payload["max_new_tokens"] = args.max_new_tokens

    print(f"Model: {args.model}")
    print(f"Text: {args.text}")
    print(f"Voice cloning: ref_audio={args.ref_audio}, ref_text={args.ref_text}")
    print(f"Stream: {args.stream}")
    print("Generating audio...")

    api_url = f"{args.api_base}/v1/audio/speech"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {args.api_key}",
    }

    if args.stream:
        output_path = args.output or "tts_output.pcm"
        with httpx.Client(timeout=300.0) as client, open(output_path, "wb") as f:
            with client.stream("POST", api_url, json=payload, headers=headers) as response:
                if response.status_code != 200:
                    print(f"Error: {response.status_code}")
                    response.read()
                    print(response.text)
                    return
                for chunk in response.iter_bytes():
                    f.write(chunk)
        print(f"Streaming audio saved to: {output_path}")
    else:
        with httpx.Client(timeout=300.0) as client:
            response = client.post(api_url, json=payload, headers=headers)
        if response.status_code != 200:
            print(f"Error: {response.status_code}")
            print(response.text)
            return
        try:
            text = response.content.decode("utf-8")
            if text.startswith('{"error"'):
                print(f"Error: {text}")
                return
        except UnicodeDecodeError:
            pass
        output_path = args.output or f"tts_output.{args.response_format}"
        with open(output_path, "wb") as f:
            f.write(response.content)
        print(f"Audio saved to: {output_path}")


def parse_args():
    """Parse command line arguments."""
    parser = argparse.ArgumentParser(
        description="OpenAI-compatible client for GLM-TTS via /v1/audio/speech",
    )

    # Server configuration
    parser.add_argument(
        "--api-base",
        type=str,
        default=DEFAULT_API_BASE,
        help=f"API base URL (default: {DEFAULT_API_BASE})",
    )
    parser.add_argument(
        "--api-key",
        type=str,
        default=DEFAULT_API_KEY,
        help="API key (default: EMPTY)",
    )
    parser.add_argument(
        "--model",
        "-m",
        type=str,
        default="glm-tts",
        help="Model name/path",
    )

    # Input text
    parser.add_argument(
        "--text",
        type=str,
        required=True,
        help="Text to synthesize",
    )

    # Generation parameters
    parser.add_argument(
        "--max-new-tokens",
        type=int,
        default=None,
        help="Maximum new tokens to generate (default: model default)",
    )

    # Output
    parser.add_argument(
        "--response-format",
        type=str,
        default="wav",
        choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
        help="Audio output format (default: wav)",
    )
    parser.add_argument(
        "--stream",
        action="store_true",
        help="Request a streaming audio response (use with async_chunk server mode).",
    )
    parser.add_argument(
        "--output",
        "-o",
        type=str,
        default=None,
        help="Output audio file path (default: tts_output.<format>)",
    )

    # Voice cloning parameters
    parser.add_argument(
        "--ref-audio",
        type=str,
        default=None,
        help="Reference audio URL, file:// URI, or base64 data URL for voice cloning",
    )
    parser.add_argument(
        "--ref-text",
        type=str,
        default=None,
        help="Transcript of the reference audio (required with --ref-audio)",
    )

    return parser.parse_args()


if __name__ == "__main__":
    args = parse_args()
    run_tts_generation(args)
glm_tts/run_gradio_demo.sh
#!/bin/bash
# Launch GLM-TTS server + Gradio demo together.
#
# Usage:
#   ./run_gradio_demo.sh
#   CUDA_VISIBLE_DEVICES=0 PORT=8091 GRADIO_PORT=7860 ./run_gradio_demo.sh

set -e

MODEL="${MODEL:-zai-org/GLM-TTS}"
PORT="${PORT:-8091}"
GRADIO_PORT="${GRADIO_PORT:-7860}"
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../../../.." && pwd)"

echo "Starting GLM-TTS server (port $PORT)..."
FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm-omni serve "$MODEL" \
    --deploy-config "$REPO_ROOT/vllm_omni/deploy/glm_tts.yaml" \
    --host 0.0.0.0 \
    --port "$PORT" \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --enforce-eager \
    --omni &
SERVER_PID=$!

cleanup() {
    echo "Stopping server (PID $SERVER_PID)..."
    kill $SERVER_PID 2>/dev/null
    wait $SERVER_PID 2>/dev/null
}
trap cleanup EXIT

# Wait for server to be ready.
echo "Waiting for server to start..."
for i in $(seq 1 120); do
    if curl -s "http://localhost:$PORT/health" > /dev/null 2>&1; then
        echo "Server ready."
        break
    fi
    sleep 2
done

echo "Starting Gradio demo (port $GRADIO_PORT)..."
python "$SCRIPT_DIR/gradio_demo.py" \
    --api-base "http://localhost:$PORT" \
    --port "$GRADIO_PORT"
glm_tts/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for GLM-TTS models
#
# Usage:
#   ./run_server.sh                           # Default model path, async_chunk mode
#   ./run_server.sh /path/to/GLM-TTS          # Custom model path, async_chunk mode
#   ./run_server.sh /path/to/GLM-TTS sync     # Sync two-stage mode
#
# NOTE: The model path should point to the repo ROOT (not llm/ subdirectory).
# model_subdir/tokenizer_subdir in the pipeline config resolve subdirectories.

set -e

MODEL="${1:-zai-org/GLM-TTS}"
MODE="${2:-async}"

EXTRA_ARGS=()
case "$MODE" in
    async|async_chunk)
        ;;
    sync|no_async_chunk)
        EXTRA_ARGS+=("--no-async-chunk")
        ;;
    *)
        echo "Unknown mode: $MODE (expected async or sync)" >&2
        exit 1
        ;;
esac

echo "Starting GLM-TTS server with model: $MODEL (mode: $MODE)"

vllm-omni serve "$MODEL" \
    --deploy-config vllm_omni/deploy/glm_tts.yaml \
    --host 0.0.0.0 \
    --port 8091 \
    --trust-remote-code \
    --omni \
    "${EXTRA_ARGS[@]}"
higgs_audio_v2/README.md

higgs-audio v2 online example

This directory contains the online-serving entry points for boson-ai's higgs-audio v2 as integrated by vllm-omni: a 2-stage TTS pipeline (Llama-3.2-3B talker with DualFFN audio expert + HiggsAudio codec decoder) emitting 24 kHz mono speech.

Prerequisites

Voice clone uses HF's HiggsAudioV2TokenizerModel loaded from k2-fsa/OmniVoice/audio_tokenizer/ (the boson-ai standalone tokenizer Hub repo's model.safetensors is the 3B talker LM, not the codec). Only that ~806 MB subdir is downloaded.

pip install -U "transformers>=5.3.0"

Files

  • run_server.sh — launch the vllm-omni server with the bundled vllm_omni/deploy/higgs_audio_v2.yaml deploy config.
  • batch_speech_client.py — send a list of prompts to /v1/audio/speech and save the returned WAV / PCM bytes to a directory; optionally passes --ref-audio + --ref-text for shallow voice clone.

Launching the server

GPUS=6,7 PORT=8094 bash examples/online_serving/text_to_speech/higgs_audio_v2/run_server.sh

Environment overrides:

  • MODEL — HF id of the talker (default bosonai/higgs-audio-v2-generation-3B-base).
  • PORT — server port (default 8094).
  • GPUSCUDA_VISIBLE_DEVICES value (default 6,7).
  • GPU_UTIL--gpu-memory-utilization (default 0.4).

The script also exports VLLM_USE_DEEP_GEMM=0 / VLLM_MOE_USE_DEEP_GEMM=0 so the example works on images without the optional deep_gemm backend.

The deploy YAML ships with async_chunk: false and codec_streaming: true, i.e. Stage 0 finishes its codec frames before Stage 1 starts decoding, and Stage 1 streams WAV/PCM bytes to the client chunk-by-chunk.

Driving the server

Plain TTS:

python examples/online_serving/text_to_speech/higgs_audio_v2/batch_speech_client.py \
    --base-url http://localhost:8094 \
    --model bosonai/higgs-audio-v2-generation-3B-base \
    --output-dir /tmp/higgs_audio_v2_batch \
    --prompts "Hello world." \
              "The quick brown fox jumps over the lazy dog."

Voice clone — pass a reference clip and its transcript (both required together):

python examples/online_serving/text_to_speech/higgs_audio_v2/batch_speech_client.py \
    --base-url http://localhost:8094 \
    --model bosonai/higgs-audio-v2-generation-3B-base \
    --output-dir /tmp/higgs_audio_v2_clone \
    --ref-audio /path/to/reference.wav \
    --ref-text  "Exact transcript spoken in reference.wav." \
    --prompts "Hello, this is a cloned voice."

Notes

  • --ref-text must be the real transcript of --ref-audio; mismatched text degrades cloned-voice quality.
  • Out of scope (rejected with explicit 4xx by the request validator): multi-speaker [SPEAKERn] tags inside input, profile: text-only speaker descriptions, the ref_audio_in_system_message system-block variant, chunked long-form generation, and per-request voice / instructions / task_type / language / speed != 1.0 / x_vector_only_mode / speaker_embedding.
higgs_audio_v2/batch_speech_client.py
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""Batch client for the higgs-audio v2 online server.

Sends a fixed list of prompts to ``/v1/audio/speech`` and saves the returned
WAV files (or raw PCM bytes when ``--format pcm``) into ``--output-dir``.

Usage (plain text -> speech):

  python examples/online_serving/text_to_speech/higgs_audio_v2/batch_speech_client.py \
      --base-url http://localhost:8094 \
      --output-dir /tmp/higgs_audio_v2_batch \
      --prompts "Hello world." "The quick brown fox jumps over the lazy dog."

Usage (shallow voice clone — pass a reference clip + its transcript):

  python examples/online_serving/text_to_speech/higgs_audio_v2/batch_speech_client.py \
      --base-url http://localhost:8094 \
      --output-dir /tmp/higgs_audio_v2_clone \
      --ref-audio path/to/reference.wav \
      --ref-text "the transcript of the reference clip" \
      --prompts "Hello world."
"""

from __future__ import annotations

import argparse
import base64
import sys
from pathlib import Path

DEFAULT_PROMPTS = (
    "Hello world.",
    "The quick brown fox jumps over the lazy dog.",
    "It was the night before my birthday.",
    "Innovation distinguishes between a leader and a follower.",
)


def _slug(text: str) -> str:
    import re

    s = re.sub(r"\s+", "_", text.strip().lower())
    return re.sub(r"[^a-z0-9_]+", "", s)[:32] or "prompt"


def main() -> int:
    parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
    parser.add_argument("--base-url", default="http://localhost:8094")
    parser.add_argument("--model", default="higgs_audio_v2")
    parser.add_argument("--prompts", nargs="+", default=list(DEFAULT_PROMPTS))
    parser.add_argument("--output-dir", type=Path, default=Path("/tmp/higgs_audio_v2_batch"))
    parser.add_argument("--format", choices=("wav", "pcm"), default="wav")
    parser.add_argument("--max-new-tokens", type=int, default=300)
    parser.add_argument("--seed", type=int, default=42)
    parser.add_argument("--timeout-s", type=float, default=120.0)
    parser.add_argument(
        "--ref-audio",
        type=Path,
        default=None,
        help="Reference clip for voice clone (path to a WAV file). Must be paired with --ref-text.",
    )
    parser.add_argument(
        "--ref-text",
        type=str,
        default=None,
        help="Transcript of the reference clip. Required when --ref-audio is set.",
    )
    args = parser.parse_args()

    if (args.ref_audio is None) != (args.ref_text is None):
        print("--ref-audio and --ref-text must be supplied together", file=sys.stderr)
        return 2

    ref_audio_data_url: str | None = None
    if args.ref_audio is not None:
        if not args.ref_audio.exists():
            print(f"ref-audio file not found: {args.ref_audio}", file=sys.stderr)
            return 2
        mime = "audio/wav" if args.ref_audio.suffix.lower() == ".wav" else "audio/mpeg"
        ref_b64 = base64.b64encode(args.ref_audio.read_bytes()).decode("ascii")
        ref_audio_data_url = f"data:{mime};base64,{ref_b64}"

    try:
        import httpx
    except ImportError:
        print(
            "this client needs `httpx`. Install with `pip install httpx`.",
            file=sys.stderr,
        )
        return 2

    args.output_dir.mkdir(parents=True, exist_ok=True)
    url = args.base_url.rstrip("/") + "/v1/audio/speech"
    failures = 0
    with httpx.Client(timeout=args.timeout_s) as client:
        for prompt in args.prompts:
            payload = {
                "model": args.model,
                "input": prompt,
                "response_format": args.format,
                "max_new_tokens": args.max_new_tokens,
                "seed": args.seed,
            }
            if ref_audio_data_url is not None:
                payload["ref_audio"] = ref_audio_data_url
                payload["ref_text"] = args.ref_text
            resp = client.post(url, json=payload)
            if resp.status_code != 200:
                print(f"[FAIL] {prompt!r} -> {resp.status_code}: {resp.text[:200]}", file=sys.stderr)
                failures += 1
                continue
            suffix = ".wav" if args.format == "wav" else ".pcm"
            out = args.output_dir / f"{_slug(prompt)}{suffix}"
            out.write_bytes(resp.content)
            print(f"[ ok ] {prompt!r} -> {out} ({len(resp.content)} bytes)")

    return 1 if failures else 0


if __name__ == "__main__":
    sys.exit(main())
higgs_audio_v2/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for higgs-audio v2.
#
# v1 scope: plain text -> 24 kHz speech only. Voice cloning, multi-speaker,
# ChatML rich content, and language overrides are rejected by the validator
# with explicit 4xx (see vllm_omni/entrypoints/openai/serving_speech.py).
#
# Usage:
#   ./run_server.sh                 # default port 8094, GPUs 6 and 7
#   PORT=8095 GPUS=6,7 ./run_server.sh
#   MODEL=bosonai/higgs-audio-v2-generation-3B-base ./run_server.sh

set -e

MODEL="${MODEL:-bosonai/higgs-audio-v2-generation-3B-base}"
PORT="${PORT:-8094}"
GPUS="${GPUS:-6,7}"
GPU_UTIL="${GPU_UTIL:-0.4}"

echo "Starting higgs-audio v2 server"
echo "  MODEL=$MODEL"
echo "  PORT=$PORT"
echo "  CUDA_VISIBLE_DEVICES=$GPUS"

# DeepGEMM FP8 kernels are optional and trip warmup on builds without
# the deep_gemm backend; disable them so the example works out of the box.
# Users with deep_gemm installed can re-enable via the same env vars.
CUDA_VISIBLE_DEVICES="$GPUS" \
VLLM_USE_DEEP_GEMM=0 \
VLLM_MOE_USE_DEEP_GEMM=0 \
vllm-omni serve "$MODEL" \
    --deploy-config vllm_omni/deploy/higgs_audio_v2.yaml \
    --host 0.0.0.0 \
    --port "$PORT" \
    --gpu-memory-utilization "$GPU_UTIL" \
    --trust-remote-code \
    --omni
ming_flash_omni_tts/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for Ming-flash-omni-2.0 standalone talker (TTS).
#
# Usage:
#   ./run_server.sh
#   MODEL=/path/to/local/model ./run_server.sh
#   PORT=8091 ./run_server.sh
#   HOST=127.0.0.1 ./run_server.sh   # bind only to loopback

set -e

MODEL="${MODEL:-Jonathan1909/Ming-flash-omni-2.0}"
HOST="${HOST:-0.0.0.0}"
PORT="${PORT:-8091}"
DEPLOY_CONFIG="${DEPLOY_CONFIG:-vllm_omni/deploy/ming_flash_omni_tts.yaml}"

echo "Starting Ming standalone TTS server with model: $MODEL"
echo "Deploy config: $DEPLOY_CONFIG"

vllm serve "$MODEL" \
    --deploy-config "$DEPLOY_CONFIG" \
    --host "$HOST" \
    --port "$PORT" \
    --trust-remote-code \
    --omni
ming_flash_omni_tts/speech_client.py
"""Client for Ming standalone TTS via /v1/audio/speech endpoint."""

import argparse
import json
import sys

import httpx

DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"
DEFAULT_MODEL = "Jonathan1909/Ming-flash-omni-2.0"


def run_tts(args) -> None:
    payload = {
        "model": args.model,
        "input": args.text,
        "response_format": args.response_format,
    }

    instructions = args.instructions
    if args.instruction_json:
        if instructions:
            sys.exit("--instructions and --instruction-json are mutually exclusive")

        try:
            parsed = json.loads(args.instruction_json)
        except json.JSONDecodeError as exc:
            sys.exit(f"--instruction-json must be valid JSON: {exc}")
        if not isinstance(parsed, dict):
            sys.exit("--instruction-json must decode to a JSON object")
        # Re-encode with ensure_ascii=False so UTF-8 Chinese keys/values
        # arrive at the server intact rather than as \\uXXXX escapes.
        instructions = json.dumps(parsed, ensure_ascii=False)
    if instructions:
        payload["instructions"] = instructions

    print(f"Model: {args.model}")
    print(f"Text: {args.text}")
    print("Generating audio...")

    api_url = f"{args.api_base}/v1/audio/speech"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {args.api_key}",
    }

    with httpx.Client(timeout=300.0) as client:
        response = client.post(api_url, json=payload, headers=headers)

    if response.status_code != 200:
        print(f"Error: {response.status_code}")
        print(response.text)
        return

    output_path = args.output or "ming_tts_output.wav"
    with open(output_path, "wb") as f:
        f.write(response.content)
    print(f"Audio saved to: {output_path}")


def main():
    parser = argparse.ArgumentParser(description="Ming standalone TTS speech client")
    parser.add_argument("--api-base", default=DEFAULT_API_BASE, help="API base URL")
    parser.add_argument("--api-key", default=DEFAULT_API_KEY, help="API key")
    parser.add_argument("--model", "-m", default=DEFAULT_MODEL, help="Model name or local path")
    parser.add_argument("--text", required=True, help="Text to synthesize")
    parser.add_argument(
        "--response-format",
        default="wav",
        choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
        help="Audio format (default: wav)",
    )
    parser.add_argument("--output", "-o", default=None, help="Output file path")
    parser.add_argument(
        "--instructions",
        default=None,
        help="Free-form style description (mapped to caption 风格 on the server).",
    )
    parser.add_argument(
        "--instruction-json",
        default=None,
        help=(
            "Structured caption JSON forwarded as `instructions`. Accepts Ming "
            "caption keys: 方言, 风格, 语速, 基频, 音量, 情感, IP, 说话人, BGM. "
        ),
    )
    args = parser.parse_args()
    run_tts(args)


if __name__ == "__main__":
    main()
moss_tts_nano/gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/moss_tts_nano/gradio_demo.py.

moss_tts_nano/run_gradio_demo.sh
#!/bin/bash
# Launch MOSS-TTS-Nano server + Gradio demo together.
#
# Usage:
#   ./run_gradio_demo.sh
#   CUDA_VISIBLE_DEVICES=0 PORT=8091 GRADIO_PORT=7860 ./run_gradio_demo.sh

set -e

MODEL="${MODEL:-OpenMOSS-Team/MOSS-TTS-Nano}"
PORT="${PORT:-8091}"
GRADIO_PORT="${GRADIO_PORT:-7860}"
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"

echo "Starting MOSS-TTS-Nano server (port $PORT)..."
FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm serve "$MODEL" \
    --host 0.0.0.0 \
    --port "$PORT" \
    --omni &
SERVER_PID=$!

cleanup() {
    echo "Stopping server (PID $SERVER_PID)..."
    kill $SERVER_PID 2>/dev/null
    wait $SERVER_PID 2>/dev/null
}
trap cleanup EXIT

# Wait for server to be ready.
echo "Waiting for server to start..."
for i in $(seq 1 120); do
    if curl -s "http://localhost:$PORT/health" > /dev/null 2>&1; then
        echo "Server ready."
        break
    fi
    sleep 2
done

echo "Starting Gradio demo (port $GRADIO_PORT)..."
python "$SCRIPT_DIR/gradio_demo.py" \
    --api-base "http://localhost:$PORT" \
    --port "$GRADIO_PORT"
moss_tts_nano/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for MOSS-TTS-Nano
#
# Usage:
#   ./run_server.sh
#   CUDA_VISIBLE_DEVICES=0 PORT=8091 ./run_server.sh

set -e

MODEL="${MODEL:-OpenMOSS-Team/MOSS-TTS-Nano}"
PORT="${PORT:-8091}"

echo "Starting MOSS-TTS-Nano server with model: $MODEL"

FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm serve "$MODEL" \
    --host 0.0.0.0 \
    --port "$PORT" \
    --omni
omnivoice/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for OmniVoice TTS
#
# Usage:
#   ./run_server.sh
#   CUDA_VISIBLE_DEVICES=0 ./run_server.sh

set -e

MODEL="${MODEL:-k2-fsa/OmniVoice}"
PORT="${PORT:-8091}"

echo "Starting OmniVoice server with model: $MODEL"

vllm serve "$MODEL" \
    --host 0.0.0.0 \
    --port "$PORT" \
    --trust-remote-code \
    --omni
omnivoice/speech_client.py
"""Client for OmniVoice TTS via /v1/audio/speech endpoint.

Examples:
    # Basic TTS (auto voice)
    python speech_client.py --text "Hello, how are you?"

    # Specify language
    python speech_client.py --text "Bonjour, comment allez-vous?" --language French

    # Use a specific uploaded/supported voice
    python speech_client.py --text "Hello" --voice my_uploaded_voice
"""

import argparse
import base64
import os

import httpx

DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"


def encode_audio_to_base64(audio_path: str) -> str:
    """Encode a local audio file to a base64 data URL."""
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")

    ext = audio_path.lower().rsplit(".", 1)[-1]
    mime = {
        "wav": "audio/wav",
        "mp3": "audio/mpeg",
        "flac": "audio/flac",
        "ogg": "audio/ogg",
    }.get(ext, "audio/wav")

    with open(audio_path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode("utf-8")
    return f"data:{mime};base64,{b64}"


def run_tts(args) -> None:
    """Generate speech via /v1/audio/speech API."""
    payload = {
        "model": args.model,
        "input": args.text,
        "response_format": args.response_format,
    }
    if args.seed is not None:
        payload["extra_params"] = {}
        payload["extra_params"]["seed"] = args.seed

    if args.voice:
        payload["voice"] = args.voice
    if args.language:
        payload["language"] = args.language

    if args.ref_audio:
        ref = args.ref_audio
        if ref.startswith(("http://", "https://", "data:")):
            payload["ref_audio"] = ref
        else:
            payload["ref_audio"] = encode_audio_to_base64(ref)

    if args.ref_text:
        payload["ref_text"] = args.ref_text

    if args.instructions:
        payload["instructions"] = args.instructions

    print(f"Model: {args.model}")
    print(f"Text: {args.text}")
    if args.seed:
        print(f"Seed: {args.seed}")

    if args.voice:
        print(f"Voice: {args.voice}")

    if args.language:
        print(f"Language: {args.language}")
    print("Generating audio...")

    api_url = f"{args.api_base}/v1/audio/speech"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {args.api_key}",
    }
    with httpx.Client(timeout=300.0) as client:
        response = client.post(api_url, json=payload, headers=headers)

    if response.status_code != 200:
        print(f"Error: {response.status_code}")
        print(response.text)
        return

    try:
        text = response.content.decode("utf-8")
        if text.startswith('{"error"'):
            print(f"Error: {text}")
            return
    except UnicodeDecodeError:
        pass

    output_path = args.output or "omnivoice_output.wav"
    with open(output_path, "wb") as f:
        f.write(response.content)
    print(f"Audio saved to: {output_path}")


def main():
    parser = argparse.ArgumentParser(description="OmniVoice TTS client")
    parser.add_argument("--api-base", default=DEFAULT_API_BASE, help="API base URL")
    parser.add_argument("--api-key", default=DEFAULT_API_KEY, help="API key")
    parser.add_argument("--model", "-m", default="k2-fsa/OmniVoice", help="Model name")
    parser.add_argument("--text", required=True, help="Text to synthesize")
    parser.add_argument(
        "--voice",
        default=None,
        help="Voice name (omit for auto voice; must match a supported or uploaded speaker if set)",
    )
    parser.add_argument("--language", default=None, help="Language hint (e.g., English, Chinese, French)")
    parser.add_argument(
        "--response-format",
        default="wav",
        choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
        help="Audio format (default: wav)",
    )
    parser.add_argument(
        "--ref-audio",
        type=str,
        default=None,
        help="Reference audio for voice cloning (local path, URL, or data: URI)",
    )
    parser.add_argument(
        "--ref-text",
        type=str,
        default=None,
        help="Reference text for voice cloning",
    )
    parser.add_argument(
        "--instructions",
        type=str,
        default=None,
        help="Voice style/emotion instructions",
    )
    parser.add_argument(
        "--seed",
        type=int,
        default=None,
        help="Random seed for generation, default: None for stochastic output)",
    )
    parser.add_argument("--output", "-o", default=None, help="Output file path")
    args = parser.parse_args()
    run_tts(args)


if __name__ == "__main__":
    main()
qwen3_tts/batch_speech_client.py
"""Batch speech client for Qwen3-TTS via /v1/audio/speech/batch endpoint.

This script demonstrates how to synthesize multiple texts in a single request.
A particularly useful scenario is voice cloning: set ref_audio once at the
batch level and generate many utterances in the cloned voice without repeating
the reference for each item.

Start the server (with batch-optimized stage settings for best throughput):

    vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
        --omni \
        --trust-remote-code \
        --stage-overrides '{"0":{"max_num_seqs":4,"gpu_memory_utilization":0.2},
                            "1":{"max_num_seqs":4,"gpu_memory_utilization":0.2}}'

Examples:
    # Batch with a predefined voice
    python batch_speech_client.py \
        --texts "Hello, how are you?" "Goodbye, see you later!"

    # Voice cloning: one ref_audio, many outputs
    python batch_speech_client.py \
        --task-type Base \
        --ref-audio /path/to/reference.wav \
        --ref-text "Transcript of the reference audio" \
        --texts "First cloned sentence." "Second cloned sentence." \
               "Third cloned sentence."
"""

import argparse
import base64
import os

import httpx

DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"


def encode_audio_to_base64(audio_path: str) -> str:
    """Encode a local audio file to a base64 data URL."""
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")

    ext = os.path.splitext(audio_path)[1].lower()
    mime_map = {".wav": "audio/wav", ".mp3": "audio/mpeg", ".flac": "audio/flac", ".ogg": "audio/ogg"}
    mime_type = mime_map.get(ext, "audio/wav")

    with open(audio_path, "rb") as f:
        audio_b64 = base64.b64encode(f.read()).decode("utf-8")
    return f"data:{mime_type};base64,{audio_b64}"


def run_batch(args) -> None:
    """Send a batch TTS request and save each result to a file."""
    items = [{"input": text} for text in args.texts]

    payload: dict = {
        "items": items,
        "response_format": args.response_format,
    }
    if args.voice:
        payload["voice"] = args.voice
    if args.language:
        payload["language"] = args.language
    if args.task_type:
        payload["task_type"] = args.task_type
    if args.instructions:
        payload["instructions"] = args.instructions
    if args.max_new_tokens:
        payload["max_new_tokens"] = args.max_new_tokens

    # Voice cloning parameters (shared across all items)
    if args.ref_audio:
        if args.ref_audio.startswith(("http://", "https://")):
            payload["ref_audio"] = args.ref_audio
        else:
            payload["ref_audio"] = encode_audio_to_base64(args.ref_audio)
    if args.ref_text:
        payload["ref_text"] = args.ref_text

    print(f"Sending batch of {len(items)} item(s) to {args.api_base}")
    if args.ref_audio:
        print("Voice cloning mode — ref_audio applied to all items")

    url = f"{args.api_base}/v1/audio/speech/batch"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {args.api_key}",
    }

    with httpx.Client(timeout=300.0) as client:
        response = client.post(url, json=payload, headers=headers)

    if response.status_code != 200:
        print(f"Error {response.status_code}: {response.text}")
        return

    data = response.json()
    print(f"Total: {data['total']}  Succeeded: {data['succeeded']}  Failed: {data['failed']}")

    os.makedirs(args.output_dir, exist_ok=True)
    for result in data["results"]:
        idx = result["index"]
        if result["status"] == "success":
            audio_bytes = base64.b64decode(result["audio_data"])
            out_path = os.path.join(args.output_dir, f"batch_{idx}.{args.response_format}")
            with open(out_path, "wb") as f:
                f.write(audio_bytes)
            print(f"  [{idx}] saved {len(audio_bytes)} bytes -> {out_path}")
        else:
            print(f"  [{idx}] FAILED: {result['error']}")


def parse_args():
    parser = argparse.ArgumentParser(
        description="Batch speech client for /v1/audio/speech/batch",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog=__doc__,
    )

    parser.add_argument("--api-base", default=DEFAULT_API_BASE, help="API base URL")
    parser.add_argument("--api-key", default=DEFAULT_API_KEY, help="API key")

    # Texts to synthesize
    parser.add_argument(
        "--texts",
        nargs="+",
        required=True,
        help="One or more texts to synthesize",
    )

    # Shared voice settings
    parser.add_argument("--voice", default="vivian", help="Speaker name (default: vivian)")
    parser.add_argument("--language", default=None, help="Language: Auto, Chinese, English, etc.")
    parser.add_argument("--instructions", default=None, help="Voice style/emotion instructions")
    parser.add_argument(
        "--task-type",
        default=None,
        choices=["CustomVoice", "VoiceDesign", "Base"],
        help="TTS task type (default: CustomVoice)",
    )

    # Voice cloning (Base task)
    parser.add_argument("--ref-audio", default=None, help="Reference audio path or URL for voice cloning")
    parser.add_argument("--ref-text", default=None, help="Reference audio transcript for voice cloning")

    # Generation
    parser.add_argument("--max-new-tokens", type=int, default=None, help="Max new tokens per item")
    parser.add_argument(
        "--response-format",
        default="wav",
        choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
        help="Audio format (default: wav)",
    )
    parser.add_argument("--output-dir", "-o", default="batch_output", help="Output directory (default: batch_output)")

    return parser.parse_args()


if __name__ == "__main__":
    args = parse_args()
    run_batch(args)
qwen3_tts/gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/qwen3_tts/gradio_demo.py.

qwen3_tts/openai_speech_client.py
"""OpenAI-compatible client for Qwen3-TTS via /v1/audio/speech endpoint.

This script demonstrates how to use the OpenAI-compatible speech API
to generate audio from text using Qwen3-TTS models.

Examples:
    # CustomVoice task (predefined speaker)
    python openai_speech_client.py --text "Hello, how are you?" --voice vivian

    # CustomVoice with emotion instruction
    python openai_speech_client.py --text "I'm so happy!" --voice vivian \
        --instructions "Speak with excitement"

    # VoiceDesign task (voice from description)
    python openai_speech_client.py --text "Hello world" \
        --task-type VoiceDesign \
        --instructions "A warm, friendly female voice"

    # Base task (voice cloning)
    python openai_speech_client.py --text "Hello world" \
        --task-type Base \
        --ref-audio "https://example.com/reference.wav" \
        --ref-text "This is the reference transcript"

    # Base task with pre-computed speaker embedding
    python openai_speech_client.py --text "Hello world" \
        --task-type Base \
        --speaker-embedding embedding.json
"""

import argparse
import base64
import json
import os

import httpx

# Default server configuration
DEFAULT_API_BASE = "http://localhost:8091"
DEFAULT_API_KEY = "EMPTY"


def encode_audio_to_base64(audio_path: str) -> str:
    """Encode a local audio file to base64 data URL."""
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")

    # Detect MIME type from extension
    audio_path_lower = audio_path.lower()
    if audio_path_lower.endswith(".wav"):
        mime_type = "audio/wav"
    elif audio_path_lower.endswith((".mp3", ".mpeg")):
        mime_type = "audio/mpeg"
    elif audio_path_lower.endswith(".flac"):
        mime_type = "audio/flac"
    elif audio_path_lower.endswith(".ogg"):
        mime_type = "audio/ogg"
    else:
        mime_type = "audio/wav"  # Default

    with open(audio_path, "rb") as f:
        audio_bytes = f.read()
    audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")
    return f"data:{mime_type};base64,{audio_b64}"


def run_tts_generation(args) -> None:
    """Run TTS generation via OpenAI-compatible /v1/audio/speech API."""

    # Build request payload
    payload = {
        "model": args.model,
        "input": args.text,
        "voice": args.speaker,
        "response_format": args.response_format,
    }

    # Add optional parameters
    if args.instructions:
        payload["instructions"] = args.instructions
    if args.task_type:
        payload["task_type"] = args.task_type
    if args.language:
        payload["language"] = args.language
    if args.max_new_tokens:
        payload["max_new_tokens"] = args.max_new_tokens

    # Voice clone parameters (Base task)
    if args.ref_audio:
        if args.ref_audio.startswith(("http://", "https://")):
            payload["ref_audio"] = args.ref_audio
        elif args.ref_audio.startswith("data:"):
            payload["ref_audio"] = args.ref_audio
        else:
            payload["ref_audio"] = encode_audio_to_base64(args.ref_audio)
    if args.ref_text:
        payload["ref_text"] = args.ref_text
    if args.x_vector_only:
        payload["x_vector_only_mode"] = True
    if args.speaker_embedding:
        with open(args.speaker_embedding) as f:
            payload["speaker_embedding"] = json.load(f)

    print(f"Model: {args.model}")
    print(f"Task type: {args.task_type or 'CustomVoice'}")
    print(f"Text: {args.text}")
    print(f"Speaker: {args.speaker}")
    print("Generating audio...")

    # Make the API call
    api_url = f"{args.api_base}/v1/audio/speech"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {args.api_key}",
    }

    with httpx.Client(timeout=300.0) as client:
        response = client.post(api_url, json=payload, headers=headers)

    if response.status_code != 200:
        print(f"Error: {response.status_code}")
        print(response.text)
        return

    # Check for JSON error response (only if content is valid UTF-8 text)
    try:
        text = response.content.decode("utf-8")
        if text.startswith('{"error"'):
            print(f"Error: {text}")
            return
    except UnicodeDecodeError:
        pass  # Binary audio data, not an error

    # Save audio response
    output_path = args.output or "tts_output.wav"
    with open(output_path, "wb") as f:
        f.write(response.content)
    print(f"Audio saved to: {output_path}")


def parse_args():
    """Parse command line arguments."""
    parser = argparse.ArgumentParser(
        description="OpenAI-compatible client for Qwen3-TTS via /v1/audio/speech",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog=__doc__,
    )

    # Server configuration
    parser.add_argument(
        "--api-base",
        type=str,
        default=DEFAULT_API_BASE,
        help=f"API base URL (default: {DEFAULT_API_BASE})",
    )
    parser.add_argument(
        "--api-key",
        type=str,
        default=DEFAULT_API_KEY,
        help="API key (default: EMPTY)",
    )
    parser.add_argument(
        "--model",
        "-m",
        type=str,
        default="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
        help="Model name/path",
    )

    # Task configuration
    parser.add_argument(
        "--task-type",
        "-t",
        type=str,
        default=None,
        choices=["CustomVoice", "VoiceDesign", "Base"],
        help="TTS task type (default: CustomVoice)",
    )

    # Input text
    parser.add_argument(
        "--text",
        type=str,
        required=True,
        help="Text to synthesize",
    )

    # Voice/speaker
    parser.add_argument(
        "--speaker",
        type=str,
        default="vivian",
        help="Speaker name (default: vivian). Options: vivian, ryan, aiden, etc.",
    )
    parser.add_argument(
        "--language",
        type=str,
        default=None,
        help="Language: Auto, Chinese, English, etc.",
    )
    parser.add_argument(
        "--instructions",
        type=str,
        default=None,
        help="Voice style/emotion instructions",
    )

    # Base (voice clone) parameters
    parser.add_argument(
        "--ref-audio",
        type=str,
        default=None,
        help="Reference audio file path, URL, or base64 for voice cloning (Base task)",
    )
    parser.add_argument(
        "--ref-text",
        type=str,
        default=None,
        help="Reference audio transcript for voice cloning (Base task)",
    )
    parser.add_argument(
        "--x-vector-only",
        action="store_true",
        help="Use x-vector only mode for voice cloning (no ICL)",
    )
    parser.add_argument(
        "--speaker-embedding",
        type=str,
        default=None,
        help="Path to JSON file containing a pre-computed speaker embedding vector (1024-dim for 0.6B, 2048-dim for 1.7B)",
    )

    # Generation parameters
    parser.add_argument(
        "--max-new-tokens",
        type=int,
        default=None,
        help="Maximum new tokens to generate",
    )

    # Output
    parser.add_argument(
        "--response-format",
        type=str,
        default="wav",
        choices=["wav", "mp3", "flac", "pcm", "aac", "opus"],
        help="Audio output format (default: wav)",
    )
    parser.add_argument(
        "--output",
        "-o",
        type=str,
        default=None,
        help="Output audio file path (default: tts_output.wav)",
    )

    return parser.parse_args()


if __name__ == "__main__":
    args = parse_args()
    run_tts_generation(args)
qwen3_tts/precompute_custom_voice.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/qwen3_tts/precompute_custom_voice.py.

qwen3_tts/run_gradio_demo.sh
#!/bin/bash
# Launch both vLLM server and Gradio demo for Qwen3-TTS
#
# Usage:
#   ./run_gradio_demo.sh                                    # Default: CustomVoice
#   ./run_gradio_demo.sh --task-type VoiceDesign            # VoiceDesign model
#   ./run_gradio_demo.sh --task-type Base --gradio-port 7861
#
# Options:
#   --task-type TYPE        Task type: CustomVoice, VoiceDesign, Base (default: CustomVoice)
#   --server-port PORT      Port for vLLM server (default: 8000)
#   --gradio-port PORT      Port for Gradio demo (default: 7860)
#   --server-host HOST      Host for vLLM server (default: 0.0.0.0)
#   --gradio-ip IP          IP for Gradio demo (default: 127.0.0.1)
#   --share                 Share Gradio demo publicly

set -e

# Default values
TASK_TYPE="CustomVoice"
SERVER_PORT=8000
GRADIO_PORT=7860
SERVER_HOST="0.0.0.0"
GRADIO_IP="127.0.0.1"
GRADIO_SHARE=false

# Parse command line arguments
while [[ $# -gt 0 ]]; do
    case $1 in
        --task-type)
            TASK_TYPE="$2"
            shift 2
            ;;
        --server-port)
            SERVER_PORT="$2"
            shift 2
            ;;
        --gradio-port)
            GRADIO_PORT="$2"
            shift 2
            ;;
        --server-host)
            SERVER_HOST="$2"
            shift 2
            ;;
        --gradio-ip)
            GRADIO_IP="$2"
            shift 2
            ;;
        --share)
            GRADIO_SHARE=true
            shift
            ;;
        --help)
            echo "Usage: $0 [OPTIONS]"
            echo ""
            echo "Options:"
            echo "  --task-type TYPE        Task type: CustomVoice, VoiceDesign, Base (default: CustomVoice)"
            echo "  --server-port PORT      Port for vLLM server (default: 8000)"
            echo "  --gradio-port PORT      Port for Gradio demo (default: 7860)"
            echo "  --server-host HOST      Host for vLLM server (default: 0.0.0.0)"
            echo "  --gradio-ip IP          IP for Gradio demo (default: 127.0.0.1)"
            echo "  --share                 Share Gradio demo publicly"
            echo ""
            exit 0
            ;;
        *)
            echo "Unknown option: $1"
            echo "Use --help for usage information"
            exit 1
            ;;
    esac
done

# Map task type to model
case "$TASK_TYPE" in
    CustomVoice)
        MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"
        ;;
    VoiceDesign)
        MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign"
        ;;
    Base)
        MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-Base"
        ;;
    *)
        echo "Unknown task type: $TASK_TYPE"
        echo "Supported: CustomVoice, VoiceDesign, Base"
        exit 1
        ;;
esac

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
API_BASE="http://localhost:${SERVER_PORT}"

echo "=========================================="
echo "Qwen3-TTS Gradio Demo"
echo "=========================================="
echo "Task Type : $TASK_TYPE"
echo "Model     : $MODEL"
echo "Server    : http://${SERVER_HOST}:${SERVER_PORT}"
echo "Gradio    : http://${GRADIO_IP}:${GRADIO_PORT}"
echo "=========================================="

# Cleanup on exit
cleanup() {
    echo ""
    echo "Shutting down..."
    if [ -n "$SERVER_PID" ]; then
        echo "Stopping vLLM server (PID: $SERVER_PID)..."
        kill "$SERVER_PID" 2>/dev/null || true
        wait "$SERVER_PID" 2>/dev/null || true
    fi
    if [ -n "$GRADIO_PID" ]; then
        echo "Stopping Gradio demo (PID: $GRADIO_PID)..."
        kill "$GRADIO_PID" 2>/dev/null || true
        wait "$GRADIO_PID" 2>/dev/null || true
    fi
    echo "Cleanup complete"
    exit 0
}
trap cleanup SIGINT SIGTERM

# Start vLLM server
echo ""
echo "Starting vLLM server..."
LOG_FILE="/tmp/vllm_tts_server_${SERVER_PORT}.log"

vllm-omni serve "$MODEL" \
    --deploy-config vllm_omni/deploy/qwen3_tts.yaml \
    --host "$SERVER_HOST" \
    --port "$SERVER_PORT" \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --omni 2>&1 | tee "$LOG_FILE" &
SERVER_PID=$!

# Wait for server startup
echo ""
echo "Waiting for vLLM server to be ready..."
STARTUP_FLAG="/tmp/vllm_tts_startup_flag_${SERVER_PORT}.tmp"
rm -f "$STARTUP_FLAG"

(
    tail -f "$LOG_FILE" 2>/dev/null | grep -m 1 "Application startup complete" > /dev/null && touch "$STARTUP_FLAG"
) &
TAIL_PID=$!

MAX_WAIT=300
ELAPSED=0
while [ $ELAPSED -lt $MAX_WAIT ]; do
    if [ -f "$STARTUP_FLAG" ]; then
        kill "$TAIL_PID" 2>/dev/null || true
        wait "$TAIL_PID" 2>/dev/null || true
        echo ""
        echo "vLLM server is ready!"
        break
    fi
    if ! kill -0 "$SERVER_PID" 2>/dev/null; then
        kill "$TAIL_PID" 2>/dev/null || true
        echo ""
        echo "Error: vLLM server failed to start"
        exit 1
    fi
    sleep 1
    ELAPSED=$((ELAPSED + 1))
done

rm -f "$STARTUP_FLAG"

if [ $ELAPSED -ge $MAX_WAIT ]; then
    kill "$TAIL_PID" 2>/dev/null || true
    echo "Error: Server startup timed out after ${MAX_WAIT}s"
    kill "$SERVER_PID" 2>/dev/null || true
    exit 1
fi

# Start Gradio demo
echo ""
echo "Starting Gradio demo..."
cd "$SCRIPT_DIR"
GRADIO_CMD=("python" "gradio_demo.py" "--api-base" "$API_BASE" "--host" "$GRADIO_IP" "--port" "$GRADIO_PORT")
if [ "$GRADIO_SHARE" = true ]; then
    GRADIO_CMD+=("--share")
fi

"${GRADIO_CMD[@]}" &
GRADIO_PID=$!

echo ""
echo "=========================================="
echo "Both services are running!"
echo "=========================================="
echo "vLLM Server : http://${SERVER_HOST}:${SERVER_PORT}"
echo "Gradio Demo : http://${GRADIO_IP}:${GRADIO_PORT}"
echo ""
echo "Press Ctrl+C to stop both services"
echo "=========================================="
echo ""

wait $SERVER_PID $GRADIO_PID || true
cleanup
qwen3_tts/run_server.sh
#!/bin/bash
# Launch vLLM-Omni server for Qwen3-TTS models
#
# Usage:
#   ./run_server.sh                           # Default: CustomVoice model
#   ./run_server.sh CustomVoice               # CustomVoice model
#   ./run_server.sh VoiceDesign               # VoiceDesign model
#   ./run_server.sh Base                      # Base (voice clone) model

set -e

TASK_TYPE="${1:-CustomVoice}"

case "$TASK_TYPE" in
    CustomVoice)
        MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"
        ;;
    VoiceDesign)
        MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign"
        ;;
    Base)
        MODEL="Qwen/Qwen3-TTS-12Hz-1.7B-Base"
        ;;
    *)
        echo "Unknown task type: $TASK_TYPE"
        echo "Supported: CustomVoice, VoiceDesign, Base"
        exit 1
        ;;
esac

echo "Starting Qwen3-TTS server with model: $MODEL"

vllm-omni serve "$MODEL" \
    --deploy-config vllm_omni/deploy/qwen3_tts.yaml \
    --host 0.0.0.0 \
    --port 8091 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --omni
qwen3_tts/speaker_embedding_interpolation.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/qwen3_tts/speaker_embedding_interpolation.py.

qwen3_tts/streaming_speech_client.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/qwen3_tts/streaming_speech_client.py.

qwen3_tts/tts_common.py
"""Shared constants, helpers, and payload building for Qwen3-TTS Gradio demos."""

import base64
import io

try:
    import gradio as gr
except ImportError:
    raise ImportError("gradio is required to run this demo. Install it with: pip install 'vllm-omni[demo]'") from None
import httpx
import numpy as np
import soundfile as sf

SUPPORTED_LANGUAGES = [
    "Auto",
    "Chinese",
    "English",
    "Japanese",
    "Korean",
    "German",
    "French",
    "Russian",
    "Portuguese",
    "Spanish",
    "Italian",
]

TASK_TYPES = ["CustomVoice", "VoiceDesign", "Base"]

PCM_SAMPLE_RATE = 24000

DEFAULT_API_BASE = "http://localhost:8000"


def fetch_voices(api_base: str) -> list[str]:
    """Fetch available voices from the server."""
    try:
        with httpx.Client(timeout=10.0) as client:
            resp = client.get(
                f"{api_base}/v1/audio/voices",
                headers={"Authorization": "Bearer EMPTY"},
            )
        if resp.status_code == 200:
            data = resp.json()
            voices = data.get("voices") or []
            if voices:
                return voices
    except Exception:
        pass
    return ["Vivian", "Ryan"]


def encode_audio_to_base64(audio_data: tuple) -> str:
    """Encode Gradio audio input (sample_rate, numpy_array) to base64 data URL."""
    sample_rate, audio_np = audio_data

    if audio_np.dtype != np.int16:
        if audio_np.dtype in (np.float32, np.float64):
            audio_np = np.clip(audio_np, -1.0, 1.0)
            audio_np = (audio_np * 32767).astype(np.int16)
        else:
            audio_np = audio_np.astype(np.int16)

    buf = io.BytesIO()
    sf.write(buf, audio_np, sample_rate, format="WAV")
    wav_b64 = base64.b64encode(buf.getvalue()).decode("utf-8")
    return f"data:audio/wav;base64,{wav_b64}"


def build_payload(
    text: str,
    task_type: str,
    voice: str,
    language: str,
    instructions: str,
    ref_audio: tuple | None,
    ref_audio_url: str,
    ref_text: str,
    x_vector_only: bool,
    response_format: str = "pcm",
    speed: float = 1.0,
    stream: bool = True,
) -> dict:
    """Build the /v1/audio/speech request payload.

    Raises gr.Error for invalid input so callers don't need to validate.
    """
    if not text or not text.strip():
        raise gr.Error("Please enter text to synthesize.")

    payload: dict = {
        "input": text.strip(),
        "response_format": "pcm" if stream else response_format,
        "stream": stream,
    }
    if not stream:
        payload["speed"] = speed

    if task_type:
        payload["task_type"] = task_type
    if language:
        payload["language"] = language

    if task_type == "CustomVoice":
        if voice:
            payload["voice"] = voice
        if instructions and instructions.strip():
            payload["instructions"] = instructions.strip()

    elif task_type == "VoiceDesign":
        if not instructions or not instructions.strip():
            raise gr.Error("VoiceDesign task requires voice style instructions.")
        payload["instructions"] = instructions.strip()

    elif task_type == "Base":
        ref_audio_url_stripped = ref_audio_url.strip() if ref_audio_url else ""
        if ref_audio_url_stripped:
            payload["ref_audio"] = ref_audio_url_stripped
        elif ref_audio is not None:
            payload["ref_audio"] = encode_audio_to_base64(ref_audio)
        else:
            raise gr.Error("Base (voice clone) task requires reference audio. Upload a file or provide a URL.")
        if ref_text and ref_text.strip():
            payload["ref_text"] = ref_text.strip()
        if x_vector_only:
            payload["x_vector_only_mode"] = True

    return payload


def on_task_type_change(task_type: str):
    """Update UI visibility based on selected task type."""
    if task_type == "CustomVoice":
        return (
            gr.update(visible=True),  # voice dropdown
            gr.update(visible=True, info="Optional style/emotion instructions"),
            gr.update(visible=False),  # ref_audio
            gr.update(visible=False),  # ref_audio_url
            gr.update(visible=False),  # ref_text
            gr.update(visible=False),  # x_vector_only
        )
    elif task_type == "VoiceDesign":
        return (
            gr.update(visible=False),
            gr.update(visible=True, info="Required: describe the voice style"),
            gr.update(visible=False),
            gr.update(visible=False),
            gr.update(visible=False),
            gr.update(visible=False),
        )
    elif task_type == "Base":
        return (
            gr.update(visible=False),
            gr.update(visible=False),
            gr.update(visible=True),
            gr.update(visible=True),
            gr.update(visible=True),
            gr.update(visible=True),
        )
    return (
        gr.update(visible=True),
        gr.update(visible=True),
        gr.update(visible=False),
        gr.update(visible=False),
        gr.update(visible=False),
        gr.update(visible=False),
    )


def stream_pcm_chunks(api_base: str, payload: dict):
    """Stream raw PCM bytes from the server, yielding int16 numpy arrays.

    Handles odd-byte boundaries between network chunks.
    """
    leftover = b""
    with httpx.Client(timeout=300.0) as client:
        with client.stream(
            "POST",
            f"{api_base}/v1/audio/speech",
            json=payload,
            headers={
                "Content-Type": "application/json",
                "Authorization": "Bearer EMPTY",
            },
        ) as resp:
            if resp.status_code != 200:
                resp.read()
                raise gr.Error(f"Server error ({resp.status_code}): {resp.text}")
            for chunk in resp.iter_bytes():
                if not chunk:
                    continue
                raw = leftover + chunk
                usable = len(raw) - (len(raw) % 2)
                leftover = raw[usable:]
                if usable == 0:
                    continue
                yield np.frombuffer(raw[:usable], dtype=np.int16).copy()


def add_common_args(parser):
    """Add CLI arguments shared by both demos."""
    parser.add_argument(
        "--api-base",
        default=DEFAULT_API_BASE,
        help=f"Base URL for the vLLM API server (default: {DEFAULT_API_BASE}).",
    )
    parser.add_argument(
        "--host",
        default="0.0.0.0",
        help="Host/IP for Gradio server (default: 0.0.0.0).",
    )
    parser.add_argument(
        "--port",
        type=int,
        default=7860,
        help="Port for Gradio server (default: 7860).",
    )
    parser.add_argument(
        "--share",
        action="store_true",
        help="Share the Gradio demo publicly.",
    )
    return parser
voxcpm2/gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/voxcpm2/gradio_demo.py.

voxcpm2/openai_speech_client.py
"""OpenAI-compatible client for VoxCPM2 TTS via /v1/audio/speech endpoint.

Examples:
    # Zero-shot synthesis
    python openai_speech_client.py --text "Hello, this is VoxCPM2."

    # Voice cloning with a local reference audio file
    python openai_speech_client.py --text "Hello world" \
        --ref-audio /path/to/reference.wav

    # Voice cloning with a URL
    python openai_speech_client.py --text "Hello world" \
        --ref-audio "https://example.com/reference.wav"

Server setup:
    vllm serve openbmb/VoxCPM2 --omni --host 0.0.0.0 --port 8000
"""

from __future__ import annotations

import argparse
import base64
import os

import httpx

DEFAULT_API_BASE = "http://localhost:8000"
DEFAULT_API_KEY = "sk-empty"


def encode_audio_to_base64(audio_path: str) -> str:
    """Encode a local audio file to a base64 data URL."""
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")

    ext = audio_path.lower().rsplit(".", 1)[-1]
    mime = {
        "wav": "audio/wav",
        "mp3": "audio/mpeg",
        "flac": "audio/flac",
        "ogg": "audio/ogg",
    }.get(ext, "audio/wav")

    with open(audio_path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode("utf-8")
    return f"data:{mime};base64,{b64}"


def main() -> None:
    parser = argparse.ArgumentParser(description="VoxCPM2 OpenAI speech client")
    parser.add_argument("--text", type=str, required=True, help="Text to synthesize")
    parser.add_argument(
        "--ref-audio",
        type=str,
        default=None,
        help="Reference audio for voice cloning (local path, URL, or data: URI)",
    )
    parser.add_argument("--model", type=str, default="voxcpm2")
    parser.add_argument("--output", type=str, default="output.wav")
    parser.add_argument("--api-base", type=str, default=DEFAULT_API_BASE)
    parser.add_argument("--api-key", type=str, default=DEFAULT_API_KEY)
    parser.add_argument("--response-format", type=str, default="wav")
    args = parser.parse_args()

    # VoxCPM2 has no predefined voices. The "voice" field is required by
    # the OpenAI API schema but ignored by VoxCPM2 — use any placeholder.
    # For voice cloning, pass --ref-audio instead.
    payload: dict = {
        "model": args.model,
        "input": args.text,
        "voice": "default",
        "response_format": args.response_format,
    }

    if args.ref_audio:
        ref = args.ref_audio
        if ref.startswith(("http://", "https://", "data:")):
            payload["ref_audio"] = ref
        else:
            payload["ref_audio"] = encode_audio_to_base64(ref)

    url = f"{args.api_base}/v1/audio/speech"
    print(f"POST {url}")
    print(f"  text: {args.text}")
    if args.ref_audio:
        print(f"  ref_audio: {args.ref_audio[:80]}...")

    with httpx.Client(timeout=300) as client:
        resp = client.post(
            url,
            json=payload,
            headers={"Authorization": f"Bearer {args.api_key}"},
        )

    if resp.status_code != 200:
        print(f"Error {resp.status_code}: {resp.text[:500]}")
        return

    with open(args.output, "wb") as f:
        f.write(resp.content)
    print(f"Saved: {args.output} ({len(resp.content):,} bytes)")


if __name__ == "__main__":
    main()
voxcpm2/precompute_custom_voice.py
"""Pre-compute VoxCPM2 custom voice profiles.

The generated directory can be passed to the server via
``custom_voice_dir`` in ``vllm_omni/deploy/voxcpm2.yaml``. Requests can then
use ``/v1/audio/speech`` with ``voice="<name>"`` and no per-request ref_audio.
"""

from __future__ import annotations

import argparse
import json
import sys
from pathlib import Path
from typing import Any

import torch
from safetensors.torch import save_file

REPO_ROOT = Path(__file__).resolve().parents[4]
if str(REPO_ROOT) not in sys.path:
    sys.path.insert(0, str(REPO_ROOT))

from vllm_omni.utils.custom_voice_io import safe_voice_stem  # noqa: E402

MANIFEST_NAME = "custom_voice_manifest.json"


def _load_tts(model: str, device: torch.device):
    from vllm_omni.model_executor.models.voxcpm2.voxcpm2_import_utils import import_voxcpm2_core

    VoxCPM = import_voxcpm2_core()
    native = VoxCPM.from_pretrained(model, load_denoiser=False, optimize=False)
    return native.tts_model.to(device).eval()


def _load_manifest(output_dir: Path, model: str) -> dict[str, Any]:
    path = output_dir / MANIFEST_NAME
    if path.exists():
        return json.loads(path.read_text(encoding="utf-8"))
    return {
        "schema_version": 1,
        "model_type": "voxcpm2",
        "model": model,
        "voices": {},
    }


def _write_voice(
    *,
    model: str,
    output_dir: Path,
    voice_name: str,
    ref_audio: str,
    prompt_text: str | None,
    mode: str,
    speaker_description: str | None,
    device: torch.device,
) -> None:
    if mode in ("continuation", "ref_continuation") and not prompt_text:
        raise ValueError("--prompt-text is required for continuation/ref_continuation modes")

    tts = _load_tts(model, device)
    tensors: dict[str, torch.Tensor] = {}
    with torch.inference_mode():
        if mode in ("reference", "ref_continuation"):
            tensors["ref_audio_feat"] = tts._encode_wav(ref_audio, padding_mode="right").float().cpu().contiguous()
        if mode in ("continuation", "ref_continuation"):
            tensors["audio_feat"] = tts._encode_wav(ref_audio, padding_mode="left").float().cpu().contiguous()

    output_dir.mkdir(parents=True, exist_ok=True)
    filename = f"{safe_voice_stem(voice_name)}.safetensors"
    save_file(tensors, str(output_dir / filename))

    manifest = _load_manifest(output_dir, model)
    entry: dict[str, Any] = {
        "name": voice_name,
        "file": filename,
        "mode": mode,
    }
    if "ref_audio_feat" in tensors:
        entry["ref_audio_feat_len"] = int(tensors["ref_audio_feat"].shape[0])
    if "audio_feat" in tensors:
        entry["audio_feat_len"] = int(tensors["audio_feat"].shape[0])
    if prompt_text:
        entry["prompt_text"] = prompt_text
    if speaker_description:
        entry["speaker_description"] = speaker_description

    manifest.setdefault("voices", {})[voice_name] = entry
    (output_dir / MANIFEST_NAME).write_text(json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8")
    print(f"Wrote {output_dir / filename}")
    print(f"Updated {output_dir / MANIFEST_NAME}")


def main() -> None:
    parser = argparse.ArgumentParser(description="Pre-compute VoxCPM2 custom voice profile")
    parser.add_argument("--model", default="openbmb/VoxCPM2", help="VoxCPM2 model path or Hugging Face ID")
    parser.add_argument("--voice-name", required=True)
    parser.add_argument("--ref-audio", required=True)
    parser.add_argument(
        "--prompt-text",
        default=None,
        help="Transcript of ref audio for continuation/ref_continuation modes",
    )
    parser.add_argument(
        "--mode",
        choices=["reference", "continuation", "ref_continuation"],
        default="reference",
    )
    parser.add_argument("--speaker-description", default=None)
    parser.add_argument("--output-dir", required=True)
    parser.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu")
    args = parser.parse_args()

    _write_voice(
        model=args.model,
        output_dir=Path(args.output_dir),
        voice_name=args.voice_name,
        ref_audio=args.ref_audio,
        prompt_text=args.prompt_text,
        mode=args.mode,
        speaker_description=args.speaker_description,
        device=torch.device(args.device),
    )


if __name__ == "__main__":
    main()
voxtral_tts/gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/voxtral_tts/gradio_demo.py.

voxtral_tts/text_preprocess.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/text_to_speech/voxtral_tts/text_preprocess.py.