Skip to content

Text-To-Speech

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/text_to_speech.

vLLM-Omni supports several autoregressive TTS models. They share a mostly common CLI shape (--text, --ref-audio, --ref-text, plus an output-path flag — --output-dir for most, --output for OmniVoice) and live together in this hub. Each model has its own subdirectory containing a single end2end.py script; this README is the single doc entry point.

For online serving, see examples/online_serving/text_to_speech/. For the full list of supported architectures across all modalities, see Supported Models.

Supported Models

Model HuggingFace repo Stages Voice cloning Streaming Special modes Sample rate
CosyVoice3 FunAudioLLM/Fun-CosyVoice3-0.5B-2512 2 (talker + code2wav) 24 kHz
Fish Speech S2 Pro fishaudio/s2-pro dual-AR 44.1 kHz
GLM-TTS zai-org/GLM-TTS 2 (AR + DiT) ✓ (required) 24 kHz
Ming-flash-omni-TTS Jonathan1909/Ming-flash-omni-2.0 single (talker only) — (caption-controlled) style / IP / basic captions 44.1 kHz
MOSS-TTS-Nano OpenMOSS-Team/MOSS-TTS-Nano single (AR + codec) ✓ (required) voice_clone, continuation 48 kHz
OmniVoice k2-fsa/OmniVoice 2 (gen + dec) voice design, language hint 24 kHz
Qwen3-TTS Qwen/Qwen3-TTS-12Hz-1.7B-{CustomVoice,VoiceDesign,Base} 2 (talker + code2wav) ✓ (Base) 3 task variants 24 kHz
VoxCPM2 openbmb/VoxCPM2 single (native AR) ✓ (online) continuation 48 kHz
Voxtral TTS mistralai/Voxtral-4B-TTS-2603 varies voice presets 24 kHz

Common Quick Start

Most models share this invocation shape:

python examples/offline_inference/text_to_speech/<model>/end2end.py \
    --text "Hello, this is a test." \
    --ref-audio /path/to/reference.wav \
    --ref-text  "Transcript of the reference audio."

--ref-audio and --ref-text are optional (text-only synthesis works without them) and must be provided together for voice cloning. The exotic scripts — Qwen3-TTS, Voxtral TTS, CosyVoice3 — accept additional model-specific flags documented in their per-model section below. Qwen3-TTS in particular uses its own argparse surface (--query-type, --audio-path, etc.) and does not follow the common shape; see its section.


CosyVoice3

2-stage TTS pipeline (talker + code2wav) at 24 kHz.

Prerequisites

uv pip install -e .
# Includes soundfile, onnxruntime, x-transformers, einops via requirements.

Download the model snapshot:

from huggingface_hub import snapshot_download
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512',
                  local_dir='pretrained_models/Fun-CosyVoice3-0.5B')

If your downloaded checkpoint lacks config.json, add it:

{
    "model_type": "cosyvoice3",
    "architectures": ["CosyVoice3Model"]
}
This is required because AutoConfig.register("cosyvoice3", CosyVoice3Config) only registers the class mapping; the loader still reads model_type from config.json to select the class.

Quick start

python examples/offline_inference/text_to_speech/cosyvoice3/end2end.py \
    --model pretrained_models/Fun-CosyVoice3-0.5B \
    --tokenizer pretrained_models/Fun-CosyVoice3-0.5B/CosyVoice-BlankEN

Voice cloning

If --ref-audio is omitted, the script downloads the upstream zero_shot_prompt.wav from the CosyVoice repo into the current directory. To use your own clip, pass --ref-audio /path/to/reference.wav, and modify --prompt-text correspondingly.

python examples/offline_inference/text_to_speech/cosyvoice3/end2end.py \
    --model pretrained_models/Fun-CosyVoice3-0.5B \
    --tokenizer pretrained_models/Fun-CosyVoice3-0.5B/CosyVoice-BlankEN \
    --ref-audio /path/to/reference.wav \
    --prompt-text "You are a helpful assistant.<|endofprompt|>Trascript in your ref audio clip"

Streaming

Streaming is enabled by default via async_chunk: true in vllm_omni/deploy/cosyvoice3.yaml. Pass --no-async-chunk on vllm serve to switch to the legacy synchronous path.

Notes

  • Stage 0 (talker) emits speech tokens; stage 1 (code2wav) runs flow matching + HiFiGAN to synthesize waveform.
  • Deploy config auto-loads from vllm_omni/deploy/cosyvoice3.yaml based on HF model_type. Pass --deploy-config <path> to override.

GLM-TTS

2-stage TTS pipeline (AR + DiT flow-matching) at 24 kHz. Every request requires reference audio and its transcript for zero-shot voice cloning.

Quick start

python examples/offline_inference/text_to_speech/glm_tts/end2end.py \
    --model zai-org/GLM-TTS \
    --text "你好,这是语音合成测试。" \
    --ref-audio /path/to/reference.wav \
    --ref-text "这是参考音频的文本内容。" \
    --output-dir ./output

Architecture

Text → [Stage 0: AR] → Speech Tokens → [Stage 1: DiT + HiFT] → Audio (24 kHz)
        (Llama-based)    (32k vocab)      (Flow Matching)

Notes

  • --ref-audio and --ref-text are required together; GLM-TTS does not support text-only synthesis.
  • Reference audio should be 3-10 seconds.
  • First run may be slow due to lazy loading of WhisperVQ tokenizer and CampPlus ONNX speaker embedder.
  • Default sampling: temperature=1.0, top_k=25, top_p=0.8 (RAS method).
  • The --model path should point to the repository root (not llm/ subdirectory).

Fish Speech S2 Pro

4B dual-AR text-to-speech model from FishAudio with the DAC codec at 44.1 kHz.

Prerequisites

pip install fish-speech

Quick start

python examples/offline_inference/text_to_speech/fish_speech/end2end.py \
    --text "Hello, this is a test of the Fish Speech text to speech system."

Voice cloning

python examples/offline_inference/text_to_speech/fish_speech/end2end.py \
    --text "Hello, this is a cloned voice." \
    --ref-audio /path/to/reference.wav \
    --ref-text  "Transcript of the reference audio."

Streaming

python examples/offline_inference/text_to_speech/fish_speech/end2end.py \
    --text "Hello, this is a streaming test." \
    --streaming
Streaming requires async_chunk: true in the stage config.

Notes

  • Output: 44.1 kHz mono WAV.
  • DAC codec weights (codec.pth) are loaded lazily from the model directory.

Ming-flash-omni-TTS

Standalone talker-only deployment of Ming-flash-omni-2.0 at 44.1 kHz. Voice is controlled through caption fields (风格 / IP / 语速/基频/音量) rather than reference audio.

Prerequisites

The example calls into vllm_omni.model_executor.models.ming_flash_omni.prompt_utils for the default prompt and instruction builder; no extra pip install on top of the base vLLM-Omni install.

Quick start

python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case style

Cases

# ASMR-style whisper (caption-driven)
python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case style

# IP voice (preset character voice via caption)
python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case ip

# Basic speed/pitch/volume control
python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case basic

Override the default text per case with --text, write to a custom path with --output.

Notes

  • Talker-only deployment — for the multimodal Ming-flash-omni example, see examples/offline_inference/ming_flash_omni/.
  • Deploy config: vllm_omni/deploy/ming_flash_omni_tts.yaml (single GPU, enforce_eager, max_num_seqs: 1).
  • Decode defaults from the Ming cookbook: max_decode_steps=200, cfg=2.0, sigma=0.25, temperature=0.0, use_zero_spk_emb=True.

MOSS-TTS-Nano

Single-stage 0.1B AR LM + MOSS-Audio-Tokenizer-Nano codec at 48 kHz mono (mixed down from upstream stereo). ZH / EN / JA. Every request requires a reference clip via --ref-audio.

No built-in speaker presets. --ref-audio is required on every call. Default --mode voice_clone matches upstream's recommended workflow; --mode continuation is exposed for completeness but upstream's continuation-with-prompt path emits very short / near-silent output, so it is rarely useful in practice. Sample reference clips ship in the upstream repo under assets/audio/ (e.g. zh_1.wav, en_2.wav, jp_2.wav).

Quick start

# Fetch a sample reference clip (one-off, user-scoped cache).
REF_DIR="${XDG_CACHE_HOME:-$HOME/.cache}/moss-tts-nano"
mkdir -p "$REF_DIR"
[ -s "$REF_DIR/zh_1.wav" ] || \
    curl -L -o "$REF_DIR/zh_1.wav" https://raw.githubusercontent.com/OpenMOSS/MOSS-TTS-Nano/main/assets/audio/zh_1.wav

python examples/offline_inference/text_to_speech/moss_tts_nano/end2end.py \
    --text "你好,这是MOSS-TTS-Nano的语音合成演示。" \
    --ref-audio "$REF_DIR/zh_1.wav"
The first run downloads OpenMOSS-Team/MOSS-TTS-Nano and OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano from Hugging Face.

Reproducible runs

python examples/offline_inference/text_to_speech/moss_tts_nano/end2end.py \
    --text "Deterministic test." \
    --ref-audio "$REF_DIR/en_2.wav" \
    --seed 42

Notes

  • Output: 48 kHz mono WAV (the tokenizer is internally stereo at 48 kHz; the wrapper averages to mono before reaching the engine).
  • Deploy config: vllm_omni/deploy/moss_tts_nano.yaml (auto-loaded; override with --deploy-config).
  • Default --max-new-frames 375 ≈ 14 s of audio; raise for longer outputs.
  • --ref-text is rejected in voice_clone mode and required only with --mode continuation.
  • Run --help for the full sampling-knob surface (--audio-temperature, --audio-top-k, --audio-top-p, --text-temperature).

OmniVoice

Zero-shot multilingual TTS supporting 600+ languages, with three modes (auto / clone / design).

Prerequisites

huggingface-cli download k2-fsa/OmniVoice
Voice cloning requires transformers>=5.3.0. Auto and design modes work with transformers>=4.57.0.

Quick start (auto voice)

python examples/offline_inference/text_to_speech/omnivoice/end2end.py \
    --model k2-fsa/OmniVoice \
    --text "Hello, this is a test."

Voice cloning

python examples/offline_inference/text_to_speech/omnivoice/end2end.py \
    --model k2-fsa/OmniVoice \
    --text "Hello, this is a test." \
    --ref-audio ref.wav \
    --ref-text  "This is the reference transcription."

Voice design

python examples/offline_inference/text_to_speech/omnivoice/end2end.py \
    --model k2-fsa/OmniVoice \
    --text "Hello, this is a test." \
    --instruct "female, low pitch, british accent"

Language hint

python examples/offline_inference/text_to_speech/omnivoice/end2end.py \
    --model k2-fsa/OmniVoice \
    --text "你好,这是一个测试。" \
    --lang zh

Seed for Reproducibility

python examples/offline_inference/text_to_speech/omnivoice/end2end.py \
    --model k2-fsa/OmniVoice \
    --text "Hello, this is a test." \
    --seed 42

Notes

  • Stage 0 (Generator): Qwen3-0.6B with 32-step iterative unmasking.
  • Stage 1 (Decoder): HiggsAudioV2 RVQ + DAC at 24 kHz.

Qwen3-TTS

3-task-variant TTS with 24 kHz output. Has its own argparse surface (this script does not follow the common --text / --ref-audio shape).

Prerequisites

For ROCm builds, replace onnxruntime with onnxruntime-rocm:

pip uninstall onnxruntime
pip install onnxruntime-rocm

Task variants

  • CustomVoice: predefined speaker (speaker ID) with optional style instruction.
  • VoiceDesign: text + descriptive instruction designs a new voice.
  • Base: voice cloning from reference audio + transcript.
# Single sample
python examples/offline_inference/text_to_speech/qwen3_tts/end2end.py --query-type CustomVoice
python examples/offline_inference/text_to_speech/qwen3_tts/end2end.py --query-type VoiceDesign
python examples/offline_inference/text_to_speech/qwen3_tts/end2end.py --query-type Base

# Base with a custom reference audio (Qwen3-TTS uses --audio-path, not --ref-audio):
python examples/offline_inference/text_to_speech/qwen3_tts/end2end.py \
    --query-type Base --audio-path /path/to/reference.wav

# Base variant has an additional mode flag:
python examples/offline_inference/text_to_speech/qwen3_tts/end2end.py --query-type Base --mode-tag icl       # default
python examples/offline_inference/text_to_speech/qwen3_tts/end2end.py --query-type Base --mode-tag xvec_only # x_vector_only_mode

# Batch (multiple prompts in one run)
python examples/offline_inference/text_to_speech/qwen3_tts/end2end.py --query-type CustomVoice --use-batch-sample

Streaming

python examples/offline_inference/text_to_speech/qwen3_tts/end2end.py \
    --query-type CustomVoice \
    --streaming \
    --output-dir /tmp/out_stream
Streaming requires async_chunk: true in the stage config.

Batched decoding

The Code2Wav stage supports batched decoding through the SpeechTokenizer. Pass multiple prompts via --txt-prompts and set --batch-size accordingly. To raise max_num_seqs on either stage, point --stage-configs-path at a stage configs YAML with the desired values (see vllm_omni/model_executor/stage_configs/ for templates):

python examples/offline_inference/text_to_speech/qwen3_tts/end2end.py \
    --query-type CustomVoice \
    --txt-prompts examples/offline_inference/text_to_speech/qwen3_tts/benchmark_prompts.txt \
    --batch-size 4 \
    --stage-configs-path /path/to/qwen3_tts_batched.yaml
--batch-size must match a CUDA-graph capture size (1, 2, 4, 8, 16…).

Notes

  • Run --help for the full argument surface.
  • See qwen3_tts/end2end.py for the prompt-length-estimation logic the Talker uses.

VoxCPM2

Single-stage native AR TTS at 48 kHz. Pipeline: feat_encoder → MiniCPM4 → FSQ → residual_lm → LocDiT → AudioVAE.

Prerequisites

pip install voxcpm
# or, for a local source checkout:
export VLLM_OMNI_VOXCPM_CODE_PATH=/path/to/voxcpm

Quick start

python examples/offline_inference/text_to_speech/voxcpm2/end2end.py \
    --model openbmb/VoxCPM2 \
    --text "Hello, this is a VoxCPM2 demo."

Voice cloning

Pass a reference audio for isolated cloning, or both --ref-audio + --ref-text for prompt continuation:

python examples/offline_inference/text_to_speech/voxcpm2/end2end.py \
    --text "Hello, this is a voice clone demo." \
    --ref-audio /path/to/reference.wav \
    --ref-text  "Transcript of the reference audio."

Streaming

Streaming is exposed through the online OpenAI Speech API (stream=true). See examples/online_serving/text_to_speech/voxcpm2/gradio_demo.py for an AudioWorklet-based gapless streaming player; the offline end2end.py script does not expose a streaming path.

Notes

  • Output: 48 kHz mono WAV.
  • Deploy config: vllm_omni/deploy/voxcpm2.yaml (auto-loaded by HF model_type).

Voxtral TTS

Voxtral-4B-TTS (Mistral). Has its own argparse surface; uses voice presets and the mistral_common SpeechRequest protocol.

Prerequisites

Latest mistral_common with SpeechRequest support:

pip install -e /path/to/mistral-common  # or upgrade from PyPI when available

Quick start (voice preset)

python examples/offline_inference/text_to_speech/voxtral_tts/end2end.py \
    --write-audio --voice cheerful_female \
    --model mistralai/Voxtral-4B-TTS-2603 \
    --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"

Voice cloning (capability gated upstream)

python examples/offline_inference/text_to_speech/voxtral_tts/end2end.py \
    --write-audio \
    --model mistralai/Voxtral-4B-TTS-2603 \
    --text "This is a test message." \
    --ref-audio path/to/reference_audio.wav

Streaming + concurrency

python examples/offline_inference/text_to_speech/voxtral_tts/end2end.py \
    --num-prompts 32 --concurrency 8 --streaming --write-audio --voice neutral_female \
    --model mistralai/Voxtral-4B-TTS-2603 \
    --text "..."
Available voice presets are listed on the HF model card (mistralai/Voxtral-4B-TTS-2603).

Notes

  • --num-prompts N replicates the prompt for performance measurement.
  • --concurrency M requires --streaming and must evenly divide --num-prompts.
  • Run --help for the full argument surface.

Example materials

cosyvoice3/end2end.py
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import argparse
import os
import urllib.request
from pathlib import Path

import numpy as np
import soundfile as sf
from vllm import SamplingParams
from vllm.multimodal.media.audio import load_audio

from vllm_omni.entrypoints.omni import Omni
from vllm_omni.model_executor.models.cosyvoice3.tokenizer import get_qwen_tokenizer
from vllm_omni.model_executor.models.cosyvoice3.utils import extract_text_token
from vllm_omni.transformers_utils.configs.cosyvoice3 import CosyVoice3Config

# Upstream zero-shot reference clip
ZERO_SHOT_PROMPT_URL = "https://raw.githubusercontent.com/FunAudioLLM/CosyVoice/main/asset/zero_shot_prompt.wav"


def _default_ref_audio() -> str:
    # Download the upstream zero_shot_prompt.wav into the current dir
    dest = Path("zero_shot_prompt.wav")
    if not dest.exists() or dest.stat().st_size == 0:
        print(f"Downloading default reference audio to {dest}")
        urllib.request.urlretrieve(ZERO_SHOT_PROMPT_URL, dest)

    return str(dest)


def run_e2e():
    parser = argparse.ArgumentParser()
    # ""FunAudioLLM/Fun-CosyVoice3-0.5B-2512
    parser.add_argument(
        "--model",
        type=str,
        required=True,
        help="Path to CosyVoice3 model directory (e.g., pretrained_models/Fun-CosyVoice3-0.5B/).",
    )
    parser.add_argument(
        "--deploy-config",
        type=str,
        default=None,
        help="Override the deploy config path. If unset, auto-loads "
        "vllm_omni/deploy/cosyvoice3.yaml based on the HF model_type.",
    )
    parser.add_argument("--text", type=str, default="Hello, this is a test of the CosyVoice system capability.")
    parser.add_argument(
        "--prompt-text",
        type=str,
        default="You are a helpful assistant.<|endofprompt|>希望你以后,能够做的比我还好呦!",
    )
    parser.add_argument(
        "--ref-audio",
        type=str,
        default=None,
        help="Path to reference audio for voice cloning. "
        "If unset, downloads the upstream CosyVoice3 zero-shot prompt audio clip",
    )
    parser.add_argument(
        "--tokenizer",
        type=str,
        required=True,
        help="Path to tokenizer directory (e.g., <model_path>/CosyVoice-BlankEN).",
    )
    args = parser.parse_args()
    # Ensure tokenizer directory exists
    if not os.path.exists(args.tokenizer):
        raise FileNotFoundError(f"{args.tokenizer} does not exist!")

    if args.deploy_config is not None and not os.path.exists(args.deploy_config):
        raise FileNotFoundError(f"{args.deploy_config} does not exist!")

    print(f"Initializing cosyvoice E2E with model={args.model}")

    omni = Omni(
        model=args.model,
        deploy_config=args.deploy_config,
        tokenizer=args.tokenizer,
        log_stats=True,
    )

    sampling_cfg = {"top_p": 0.8, "top_k": 25, "eos_token_id": 6561 + 1}

    print("Model initialized. Preparing inputs...")
    ref_audio_path = args.ref_audio or _default_ref_audio()
    if not os.path.exists(ref_audio_path):
        raise FileNotFoundError(f"Audio file not found: {ref_audio_path}")
    # Load at native sample rate
    audio_signal, sr = load_audio(ref_audio_path, sr=None)

    # Validate sample rate before processing (similar to original CosyVoice)
    min_sr = 16000
    if sr < min_sr:
        raise ValueError(
            f"Audio sample rate {sr} Hz is too low. "
            f"Minimum required: {min_sr} Hz. "
            f"Please provide audio with sample rate >= {min_sr} Hz."
        )

    audio_data = (audio_signal.astype(np.float32), sr)

    prompts = {
        "prompt": args.text,
        "multi_modal_data": {
            "audio": audio_data,
        },
        "mm_processor_kwargs": {
            "prompt_text": args.prompt_text,
            "sample_rate": audio_data[1],
        },
    }

    print(f"Generating for prompt: {args.text}")

    config = CosyVoice3Config()
    tokenizer = get_qwen_tokenizer(
        token_path=args.tokenizer,
        skip_special_tokens=config.skip_special_tokens,
        version=config.version,
    )
    _, text_token_len = extract_text_token(args.text, tokenizer, config.allowed_special)
    base_len = int(text_token_len)
    min_len = int(base_len * config.min_token_text_ratio)
    max_len = int(base_len * config.max_token_text_ratio)

    # Build SamplingParams for each stage (GPT, S2Mel, Vocoder)
    gpt_sampling = SamplingParams(
        temperature=1.0,
        top_p=sampling_cfg["top_p"],
        top_k=sampling_cfg["top_k"],
        repetition_penalty=2.0,
        min_tokens=min_len,
        max_tokens=max_len,
        stop_token_ids=[sampling_cfg["eos_token_id"]],
        # allowed_token_ids=list(range(6561+3)),
        detokenize=False,
    )
    # Not used
    s2mel_sampling = SamplingParams(
        temperature=1.0,
        top_p=1.0,
        top_k=-1,
        repetition_penalty=2.0,
        max_tokens=256,
        detokenize=False,
    )

    sampling_params_list = [gpt_sampling, s2mel_sampling]

    # Start profiling (requires VLLM_TORCH_PROFILER_DIR env var)
    if os.environ.get("VLLM_TORCH_PROFILER_DIR"):
        print("Starting profiler...")
        omni.start_profile()

    # Generate (Omni orchestrator requires a per-stage SamplingParams list)
    outputs = list(omni.generate(prompts, sampling_params_list=sampling_params_list[:2]))

    # Stop profiling and get results
    if os.environ.get("VLLM_TORCH_PROFILER_DIR"):
        print("Stopping profiler...")
        profile_results = omni.stop_profile()
        print(f"Profile traces saved to: {profile_results}")

    print(outputs)
    # Verify outputs
    print(f"Received {len(outputs)} outputs.")
    for i, output in enumerate(outputs):
        try:
            ro = output.request_output
            if ro is None:
                print("No request_output found.")
                continue

            # Multimodal output may be attached to RequestOutput or CompletionOutput.
            mm = getattr(ro, "multimodal_output", None)
            if not mm and ro.outputs:
                mm = getattr(ro.outputs[0], "multimodal_output", None)

            if mm:
                print(f"Multimodal output keys: {mm.keys()}")
                if "audio" in mm:
                    audio_out = mm["audio"]
                    print(f"Generated Audio Shape: {audio_out.shape}")
                    out_path = f"output_{i}.wav"
                    sf.write(out_path, audio_out.cpu().numpy().squeeze(), 22050)
                    print(f"Saved audio to {out_path}")
            else:
                print("No multimodal output found.")
        except Exception as e:
            print(f"Error inspecting output: {e}")
    omni.close()


if __name__ == "__main__":
    run_e2e()
fish_speech/end2end.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/text_to_speech/fish_speech/end2end.py.

glm_tts/end2end.py
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""GLM-TTS End-to-End Offline Inference Example.

GLM-TTS is a two-stage TTS system:
  - Stage 0 (AR): Llama-based model generates speech tokens from text
  - Stage 1 (DiT): Flow matching model converts speech tokens to audio

Usage:
    # Sync two-stage (default)
    python examples/offline_inference/text_to_speech/glm_tts/end2end.py \
        --model /path/to/GLM-TTS \
        --text "你好,这是一个语音合成测试。" \
        --ref-audio /path/to/reference.wav \
        --ref-text "参考音频的转录文本。" \
        --output-dir ./output

    # Async chunk mode (streaming DiT)
    python examples/offline_inference/text_to_speech/glm_tts/end2end.py \
        --model /path/to/GLM-TTS --async-chunk \
        --text "你好,这是一个语音合成测试。" \
        --ref-audio /path/to/reference.wav \
        --ref-text "参考音频的转录文本。" \
        --output-dir ./output
"""

import base64
import io
import logging
import os
import tempfile
import time
from typing import Any
from urllib.request import urlopen

import numpy as np
import soundfile as sf
import torch
import yaml

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from vllm.utils.argparse_utils import FlexibleArgumentParser

from vllm_omni import Omni
from vllm_omni.model_executor.models.glm_tts.glm_tts import build_glm_tts_prefill_metadata

logger = logging.getLogger(__name__)

DEFAULT_DEPLOY_CONFIG = os.path.join(
    os.path.dirname(__file__),
    "..",
    "..",
    "..",
    "..",
    "vllm_omni",
    "deploy",
    "glm_tts.yaml",
)
SAMPLE_RATE = 24000


def _load_ref_audio(ref_audio: str) -> tuple[torch.Tensor, int]:
    """Load reference audio from file path, URL, or data URI."""
    if ref_audio.startswith(("http://", "https://")):
        with urlopen(ref_audio, timeout=60) as response:
            audio_obj: Any = io.BytesIO(response.read())
    elif ref_audio.startswith("data:"):
        _, _, encoded = ref_audio.partition(",")
        audio_obj = io.BytesIO(base64.b64decode(encoded))
    else:
        audio_obj = ref_audio
    wav_np, sr = sf.read(audio_obj, dtype="float32")
    if wav_np.ndim > 1:
        wav_np = wav_np.mean(axis=1)
    return torch.from_numpy(wav_np), int(sr)


def _concat_audio(audio_val: Any) -> np.ndarray:
    """Concatenate audio tensors from multimodal output."""
    if isinstance(audio_val, list):
        tensors = [torch.as_tensor(t).float().reshape(-1) for t in audio_val if t is not None]
        if not tensors:
            return np.zeros((0,), dtype=np.float32)
        return torch.cat(tensors, dim=-1).cpu().numpy().astype(np.float32, copy=False)
    if isinstance(audio_val, torch.Tensor):
        return audio_val.float().cpu().numpy().reshape(-1)
    return np.asarray(audio_val, dtype=np.float32).reshape(-1)


def _extract_sample_rate(audio_mm: dict) -> int:
    """Extract sample rate from multimodal output dict."""
    sr_raw = audio_mm.get("sr", SAMPLE_RATE)
    if isinstance(sr_raw, list):
        sr_raw = sr_raw[-1] if sr_raw else SAMPLE_RATE
    if hasattr(sr_raw, "item"):
        return int(sr_raw.item())
    return int(sr_raw)


def _modify_deploy_config(base_path: str, async_chunk: bool) -> str:
    """Build deploy config with explicit sync/async mode and eager execution.

    Mirrors the logic in ``tests/e2e/offline_inference/test_glm_tts.py``
    (``_get_deploy_config``) so that example runs match CI behavior.
    """
    with open(base_path) as f:
        cfg = yaml.safe_load(f)
    cfg["async_chunk"] = async_chunk
    for stage in cfg.get("stages", []):
        stage["enforce_eager"] = True
        if stage.get("stage_id") == 0:
            stage["async_scheduling"] = bool(async_chunk)
    tmp = tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False, prefix="glm_tts_")
    yaml.dump(cfg, tmp)
    tmp.close()
    return tmp.name


def main(args):
    """Run offline GLM-TTS inference."""
    os.makedirs(args.output_dir, exist_ok=True)
    base_deploy_config = args.deploy_config or DEFAULT_DEPLOY_CONFIG
    deploy_config_path = _modify_deploy_config(base_deploy_config, args.async_chunk)

    ref_audio_wav, ref_audio_sr = _load_ref_audio(args.ref_audio)
    if not args.ref_text:
        raise ValueError("GLM-TTS requires --ref-audio and --ref-text.")

    inputs = [
        {
            "prompt": args.text,
            "multi_modal_data": {
                "audio": (ref_audio_wav.float().cpu().numpy(), ref_audio_sr),
            },
            "modalities": ["audio"],
            "mm_processor_kwargs": {"prompt_text": args.ref_text},
            "additional_information": build_glm_tts_prefill_metadata(
                args.model,
                args.text,
                args.ref_text,
            ),
        }
    ]

    omni = Omni(
        model=args.model,
        stage_configs_path=deploy_config_path,
        log_stats=args.log_stats,
        stage_init_timeout=args.stage_init_timeout,
    )

    t_start = time.perf_counter()
    outputs = omni.generate(inputs)
    elapsed = (time.perf_counter() - t_start) * 1000

    assert outputs, "No outputs returned"
    audio_mm = outputs[0].multimodal_output
    assert "audio" in audio_mm, "No audio output found"

    audio = _concat_audio(audio_mm["audio"])
    sr = _extract_sample_rate(audio_mm)
    out_path = os.path.join(args.output_dir, "output.wav")
    sf.write(out_path, audio, samplerate=sr, format="WAV")

    logger.info("Saved %s (%.2fs @ %dHz)", out_path, len(audio) / sr, sr)
    logger.info("Total inference: %.1f ms", elapsed)


def parse_args():
    parser = FlexibleArgumentParser(description="GLM-TTS Text-to-Speech Example")
    parser.add_argument("--model", type=str, required=True, help="Model path")
    parser.add_argument("--text", type=str, default="你好,这是一个语音合成测试。")
    parser.add_argument("--output-dir", type=str, default="./output")
    parser.add_argument("--ref-audio", type=str, required=True, help="Reference WAV path/URL")
    parser.add_argument("--ref-text", type=str, required=True, help="Transcript of ref audio")
    parser.add_argument("--deploy-config", type=str, default=None)
    parser.add_argument(
        "--async-chunk",
        action="store_true",
        default=False,
        help="Enable async_chunk mode (streaming DiT). Default: sync two-stage.",
    )
    parser.add_argument("--log-stats", action="store_true")
    parser.add_argument("--stage-init-timeout", type=int, default=600)
    return parser.parse_args()


if __name__ == "__main__":
    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    )
    main(parse_args())
higgs_audio_v2/README.md

higgs-audio v2 — offline example

Drives Stage 0 (DualFFN talker) + Stage 1 (HiggsAudio codec) for bosonai/higgs-audio-v2-generation-3B-base end-to-end through the vLLM-Omni engine and writes a 24 kHz mono WAV per prompt.

Prerequisites

Voice clone needs transformers>=5.3.0 — vllm-omni loads the audio codec via HF's HiggsAudioV2TokenizerModel, instantiated from the k2-fsa/OmniVoice/audio_tokenizer/ subdirectory (only that ~806 MB subdir is downloaded). The boson-ai standalone tokenizer repo's model.safetensors is actually a copy of the 3B talker LM, so HF can't load it directly; the k2 bundle ships the same codec weights repackaged with HF-compatible key naming.

pip install -U "transformers>=5.3.0"

Quick start

Plain TTS:

python examples/offline_inference/text_to_speech/higgs_audio_v2/end2end.py \
    --texts "Hello world." "The quick brown fox jumps over the lazy dog." \
    --output-dir results/wavs

Voice cloning

Pass both --ref-audio and --ref-text together:

python examples/offline_inference/text_to_speech/higgs_audio_v2/end2end.py \
    --texts "Hello, this is a cloned voice." \
    --ref-audio /path/to/reference.wav \
    --ref-text  "Exact transcript spoken in reference.wav." \
    --output-dir results/wavs

Notes

  • Output: 24 kHz mono WAV.
  • Deploy config: vllm_omni/deploy/higgs_audio_v2.yaml (auto-loaded by HF model_type).
  • --ref-text must be the real transcript of --ref-audio; mismatched text degrades cloned-voice quality.
  • For online serving, see examples/online_serving/text_to_speech/higgs_audio_v2/.
higgs_audio_v2/end2end.py
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""Offline higgs-audio v2 inference example.

Runs Stage 0 (DualFFN talker) + Stage 1 (HiggsAudio codec) end-to-end
through the vLLM-Omni engine without going through the HTTP server, and
saves a 24 kHz mono WAV per prompt.

Example:

    python examples/offline_inference/text_to_speech/higgs_audio_v2/end2end.py \\
        --texts "Hello world." \\
                "The quick brown fox jumps over the lazy dog." \\
        --output-dir results/wavs
"""

from __future__ import annotations

import os

# DeepGEMM FP8 kernels require an optional backend that may not be installed.
# Disable the warmup before importing vLLM so engine startup falls back to the
# regular gemm path. Users with deep_gemm installed can override these.
os.environ.setdefault("VLLM_USE_DEEP_GEMM", "0")
os.environ.setdefault("VLLM_MOE_USE_DEEP_GEMM", "0")

import time
from pathlib import Path

import numpy as np
import soundfile as sf
import torch

from vllm_omni import Omni
from vllm_omni.utils.tracking_parser import TrackingArgumentParser

SAMPLE_RATE = 24_000
DEFAULT_TEXTS = (
    "Hello world.",
    "The quick brown fox jumps over the lazy dog.",
)


def parse_args():
    parser = TrackingArgumentParser(description="Offline higgs-audio v2 inference")
    parser.add_argument(
        "--model",
        type=str,
        default="bosonai/higgs-audio-v2-generation-3B-base",
        help="Stage-0 talker model id or path.",
    )
    parser.add_argument(
        "--texts",
        type=str,
        nargs="+",
        default=list(DEFAULT_TEXTS),
        help="One or more plain-text prompts to synthesize.",
    )
    parser.add_argument(
        "--output-dir",
        type=str,
        default="results/wavs",
        help="Directory to write per-prompt WAV files.",
    )
    parser.add_argument(
        "--deploy-config",
        type=str,
        default=None,
        help="Override the deploy config path. Auto-loads "
        "vllm_omni/deploy/higgs_audio_v2.yaml from the HF model_type by default.",
    )
    parser.add_argument(
        "--max-new-tokens",
        type=int,
        default=500,
        help="Cap on Stage-0 codec frames per request.",
    )
    parser.add_argument(
        "--ref-audio",
        type=str,
        default=None,
        help="Reference clip for voice clone (path to a WAV file). Paired with --ref-text.",
    )
    parser.add_argument(
        "--ref-text",
        type=str,
        default=None,
        help="Transcript of the reference clip. Required when --ref-audio is set.",
    )
    return parser.parse_args()


def _slugify(text: str) -> str:
    import re

    slug = re.sub(r"\s+", "_", text.strip().lower())
    slug = re.sub(r"[^a-z0-9_]+", "", slug)
    return slug[:48] or "prompt"


def _extract_pcm(multimodal_output: dict) -> torch.Tensor:
    """Pull the final concatenated PCM tensor out of a request's multimodal_output."""
    audio = multimodal_output.get("model_outputs")
    if audio is None:
        audio = multimodal_output.get("audio")
    if audio is None:
        raise ValueError(f"no audio key in multimodal_output: {list(multimodal_output.keys())}")
    if isinstance(audio, list):
        valid = [torch.as_tensor(a).float().cpu().reshape(-1) for a in audio if a is not None]
        if not valid:
            raise ValueError("audio list is empty")
        return torch.cat(valid, dim=0) if len(valid) > 1 else valid[0]
    return torch.as_tensor(audio).float().cpu().reshape(-1)


def _pcm_to_int16(pcm: torch.Tensor) -> np.ndarray:
    arr = pcm.numpy()
    if arr.dtype.kind == "f":
        arr = np.clip(arr, -1.0, 1.0)
        arr = (arr * 32767.0).astype(np.int16)
    else:
        arr = arr.astype(np.int16)
    return arr


def main():
    args = parse_args()
    if (args.ref_audio is None) != (args.ref_text is None):
        raise SystemExit("--ref-audio and --ref-text must be supplied together")
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    engine = Omni(model=args.model, deploy_config=args.deploy_config)

    # Build prompt_token_ids using the same path serving_speech.py takes online.
    from transformers import AutoProcessor

    from vllm_omni.model_executor.models.higgs_audio_v2.higgs_audio_v2_tokenizer import (
        build_plain_text_prompt,
        build_voice_clone_prompt,
        input_ids_to_python_list,
    )

    processor = AutoProcessor.from_pretrained(args.model, trust_remote_code=True)

    # Voice-clone path: load the reference clip once. The HF processor will
    # encode it via the bundled HiggsAudioV2TokenizerModel each time we build
    # a prompt below. This is cheap on CPU for a few-second clip.
    ref_wav: np.ndarray | None = None
    ref_sr: int | None = None
    if args.ref_audio is not None:
        ref_wav, ref_sr = sf.read(args.ref_audio, always_2d=False)
        if ref_wav.ndim == 2:
            ref_wav = ref_wav.mean(axis=1)

    print(f"Model       : {args.model}")
    print(f"Prompts     : {len(args.texts)}")
    print(f"Output dir  : {output_dir}")
    print(f"Voice clone : {'yes' if ref_wav is not None else 'no'}")

    # Run one prompt at a time. The Stage-0 talker's per-slot audio state is
    # request-scoped; submitting multiple prompts in the same engine.generate()
    # call would batch them in the AR runner and exercise a code path that is
    # not validated for this model yet.
    total_elapsed = 0.0
    total_dur = 0.0
    for text in args.texts:
        if ref_wav is not None:
            out = build_voice_clone_prompt(processor, text, ref_wav, int(ref_sr), args.ref_text)
            prompt = {
                "prompt_token_ids": out["prompt_token_ids"],
                # Bare tensors (NOT list-wrapped): the msgspec serializer in
                # vllm_omni.data_entry_keys routes torch.Tensor → tensor_data
                # and list[Tensor] → list_data (which silently strips tensors).
                "additional_information": {
                    "audio_input_ids": out["audio_input_ids"],
                    "audio_input_ids_mask": out["audio_input_ids_mask"],
                },
            }
        else:
            inputs = build_plain_text_prompt(processor, text)
            prompt = {"prompt_token_ids": input_ids_to_python_list(inputs)}
        t_start = time.perf_counter()
        outputs = engine.generate([prompt])
        elapsed = time.perf_counter() - t_start
        total_elapsed += elapsed

        mm = outputs[0].outputs[0].multimodal_output
        pcm = _extract_pcm(mm)
        slug = _slugify(text)
        out_path = output_dir / f"{slug}.wav"
        sf.write(str(out_path), _pcm_to_int16(pcm), SAMPLE_RATE, format="WAV", subtype="PCM_16")
        dur = pcm.numel() / SAMPLE_RATE
        total_dur += dur
        print(f"  {slug:<50} dur={dur:5.2f}s  -> {out_path}")

    rtf = total_elapsed / total_dur if total_dur > 0 else float("inf")
    print(f"Total infer : {total_elapsed:.2f}s  total audio: {total_dur:.2f}s  RTF: {rtf:.3f}")


if __name__ == "__main__":
    os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")
    main()
ming_flash_omni_tts/end2end.py
"""Offline e2e example for Ming-flash-omni-2.0 standalone talker (TTS)."""

import os
from typing import Any

import soundfile as sf
import torch

from vllm_omni.utils.tracking_parser import TrackingArgumentParser

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from vllm_omni.entrypoints.omni import Omni
from vllm_omni.inputs.data import OmniTokensPrompt
from vllm_omni.model_executor.models.ming_flash_omni.prompt_utils import (
    DEFAULT_PROMPT,
    create_instruction,
)

MODEL_NAME = "Jonathan1909/Ming-flash-omni-2.0"


def get_messages(case: str, text_override: str | None) -> dict[str, Any]:
    if case == "style":
        text = text_override or "我会一直在这里陪着你,直到你慢慢、慢慢地沉入那个最温柔的梦里……好吗?"
        instruction = create_instruction(
            {
                "风格": "这是一种ASMR耳语,属于一种旨在引发特殊感官体验的创意风格。这个女性使用轻柔的普通话进行耳语,声音气音成分重。音量极低,紧贴麦克风,语速极慢,旨在制造触发听者颅内快感的声学刺激。",
            }
        )
        return {
            "prompt": DEFAULT_PROMPT,
            "text": text,
            "instruction": instruction,
            "use_zero_spk_emb": True,
        }
    if case == "ip":
        text = text_override or "这款产品的名字,叫变态坑爹牛肉丸。"
        return {
            "prompt": DEFAULT_PROMPT,
            "text": text,
            "instruction": create_instruction({"IP": "灵小甄"}),
            "use_zero_spk_emb": True,
        }
    if case == "basic":
        text = text_override or "我们当迎着阳光辛勤耕作,去摘取,去制作,去品尝,去馈赠。"
        return {
            "prompt": DEFAULT_PROMPT,
            "text": text,
            "instruction": create_instruction({"语速": "快速", "基频": "中", "音量": "中"}),
            "use_zero_spk_emb": True,
        }
    raise ValueError(f"Unknown case: {case}")


def save_audio(mm: dict[str, Any], output_path: str) -> None:
    if not mm or "audio" not in mm:
        raise RuntimeError("No audio found in model output")
    audio = mm["audio"]
    sr_raw = mm.get("sr", 44100)
    if isinstance(sr_raw, torch.Tensor):
        sample_rate = int(sr_raw.item())
    else:
        sample_rate = int(sr_raw)
    waveform = audio.squeeze().float().cpu().numpy()
    sf.write(output_path, waveform, sample_rate)
    print(f"Saved {output_path} ({len(waveform) / sample_rate:.2f}s, {sample_rate}Hz)")


def parse_args():
    parser = TrackingArgumentParser(description="Ming-flash-omni standalone talker offline e2e example")
    parser.add_argument("--model", type=str, default=MODEL_NAME, help="Model name or local path.")
    parser.add_argument(
        "--deploy-config",
        type=str,
        default="vllm_omni/deploy/ming_flash_omni_tts.yaml",
        help="Path to a custom deploy YAML for the TTS deployment. ",
    )
    parser.add_argument(
        "--case",
        type=str,
        default="style",
        choices=["style", "ip", "basic"],
        help="Example case.",
    )
    parser.add_argument("--text", type=str, default=None, help="Override default text for the selected case.")
    parser.add_argument("--output", type=str, default=None, help="Output wav path.")
    parser.add_argument("--log-stats", action="store_true", default=False, help="Enable stats logging.")
    parser.add_argument("--init-timeout", type=int, default=600, help="Engine init timeout in seconds.")
    parser.add_argument("--stage-init-timeout", type=int, default=300, help="Single stage init timeout in seconds.")

    return parser.parse_args()


def main():
    args = parse_args()

    omni = Omni(**vars(args))

    messages = get_messages(args.case, args.text)
    decode_args = {
        # Standalone TTS deployment
        "ming_task": "instruct",
        "max_decode_steps": 200,
        "cfg": 2.0,
        "sigma": 0.25,
        "temperature": 0.0,
    }
    req = OmniTokensPrompt(
        prompt_token_ids=[0],
        additional_information={**messages, **decode_args},
    )

    outputs = omni.generate(req)
    mm = outputs[0].outputs[0].multimodal_output

    output_path = args.output or f"output_{args.case}.wav"
    save_audio(mm, output_path)
    omni.close()


if __name__ == "__main__":
    main()
moss_tts/end2end.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/text_to_speech/moss_tts/end2end.py.

moss_tts_nano/end2end.py
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""Offline inference example for MOSS-TTS-Nano via vLLM-Omni.

Single-stage pipeline: the 0.1B AR LM and MOSS-Audio-Tokenizer-Nano codec
both run inside one generation stage. Output is 48 kHz mono WAV (the
upstream tokenizer is stereo at 48 kHz; the wrapper mixes down to mono so
the existing single-channel audio writer in vLLM-Omni stays correct).

MOSS-TTS-Nano upstream supports two modes (matching ``infer.py``):

* ``voice_clone`` (recommended): only ``--ref-audio`` is required.
* ``continuation``: ``--ref-audio`` + ``--ref-text`` together.

Usage:
  # Voice clone (recommended): ref audio only, no transcript needed.
  python end2end.py \\
    --text "Hello!" \\
    --ref-audio /path/to/ref.wav

  # Continuation: ref audio + its transcript.
  python end2end.py \\
    --text "Hello!" \\
    --ref-audio /path/to/ref.wav \\
    --ref-text "Transcript of the reference clip." \\
    --mode continuation

  # Sample reference clips ship in the upstream repo:
  #   https://github.com/OpenMOSS/MOSS-TTS-Nano/tree/main/assets/audio
  # e.g. zh_1.wav (Chinese), en_2.wav (English), jp_2.wav (Japanese).
"""

from __future__ import annotations

import os
from pathlib import Path

import soundfile as sf
import torch
from vllm import SamplingParams

from vllm_omni.utils.tracking_parser import TrackingArgumentParser

# Prevent multiprocessing from re-importing CUDA in the wrong context.
os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm_omni import Omni  # noqa: E402

MODEL = "OpenMOSS-Team/MOSS-TTS-Nano"


def build_request(
    text: str,
    prompt_audio_path: str,
    prompt_text: str | None = None,
    mode: str = "voice_clone",
    max_new_frames: int = 375,
    seed: int | None = None,
    audio_temperature: float = 0.8,
    audio_top_k: int = 25,
    audio_top_p: float = 0.95,
    text_temperature: float = 1.0,
) -> dict:
    """Build an Omni request payload for MOSS-TTS-Nano.

    Upstream's ``_resolve_inference_mode`` forbids ``prompt_text`` in
    ``voice_clone`` mode and requires it in ``continuation`` mode (with
    ``prompt_audio_path``), so we only forward ``prompt_text`` when it is
    actually supplied.
    """
    additional: dict = {
        "text": [text],
        "mode": [mode],
        "prompt_audio_path": [str(prompt_audio_path)],
        "max_new_frames": [max_new_frames],
        "audio_temperature": [audio_temperature],
        "audio_top_k": [audio_top_k],
        "audio_top_p": [audio_top_p],
        "text_temperature": [text_temperature],
    }
    if prompt_text is not None and prompt_text.strip():
        additional["prompt_text"] = [prompt_text]
    if seed is not None:
        additional["seed"] = [seed]

    return {
        "prompt": "<|im_start|>assistant\n",  # minimal placeholder prompt
        "additional_information": additional,
    }


def save_audio(waveform: torch.Tensor, path: str, sample_rate: int = 48000) -> None:
    """Write the model's mono waveform to ``path`` at ``sample_rate``.

    The model wrapper mixes the upstream tokenizer's stereo output down to
    mono before reaching the engine, so ``waveform`` is always 1-D here —
    no extra interleave/reshape is needed.
    """
    audio_np = waveform.float().numpy()
    sf.write(path, audio_np, sample_rate)
    print(f"  Saved {path} ({audio_np.shape}, {sample_rate} Hz)")


def main(args) -> None:
    omni = Omni(
        model=MODEL,
        deploy_config=args.deploy_config,
        stage_init_timeout=args.stage_init_timeout,
    )

    sampling_params = SamplingParams(
        temperature=1.0,
        top_p=1.0,
        top_k=50,
        max_tokens=4096,
        seed=args.seed if args.seed is not None else 42,
        detokenize=False,
    )

    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    print(f"Synthesizing: {args.text!r}")
    print(f"  ref_audio: {args.ref_audio}")
    inputs = build_request(
        text=args.text,
        prompt_audio_path=args.ref_audio,
        prompt_text=args.ref_text,
        mode=args.mode,
        max_new_frames=args.max_new_frames,
        seed=args.seed,
        audio_temperature=args.audio_temperature,
        audio_top_k=args.audio_top_k,
        audio_top_p=args.audio_top_p,
        text_temperature=args.text_temperature,
    )
    params_list = sampling_params

    for stage_outputs in omni.generate(inputs, params_list):
        for i, req_output in enumerate(stage_outputs.request_output):
            for j, out in enumerate(req_output.outputs):
                mm = out.multimodal_output
                if mm is None:
                    print(f"  [req {i}] No audio output.")
                    continue
                audio = mm.get("audio")
                sr_tensor = mm.get("sr")
                if audio is None:
                    print(f"  [req {i}] No waveform in multimodal_output.")
                    continue
                sr = int(sr_tensor.item()) if sr_tensor is not None else 48000
                out_path = str(output_dir / f"output_{i}_{j}.wav")
                save_audio(audio.cpu(), out_path, sr)

    print("Done.")


def parse_args():
    parser = TrackingArgumentParser(description="MOSS-TTS-Nano offline inference")
    parser.add_argument("--text", default="Hello, this is MOSS-TTS-Nano speaking.", help="Text to synthesize.")
    parser.add_argument(
        "--ref-audio",
        required=True,
        help="Path to reference audio for voice cloning / continuation (required).",
    )
    parser.add_argument(
        "--ref-text",
        default=None,
        help=(
            "Optional transcript of --ref-audio. Required (and only meaningful) "
            "in --mode continuation; rejected by upstream in --mode voice_clone."
        ),
    )
    parser.add_argument("--mode", default="voice_clone", choices=["voice_clone", "continuation"])
    parser.add_argument("--max-new-frames", type=int, default=375, help="Max AR frames (~14s at default).")
    parser.add_argument("--seed", type=int, default=None, help="Random seed.")
    parser.add_argument("--audio-temperature", type=float, default=0.8)
    parser.add_argument("--audio-top-k", type=int, default=25)
    parser.add_argument("--audio-top-p", type=float, default=0.95)
    parser.add_argument("--text-temperature", type=float, default=1.0)
    parser.add_argument(
        "--output-dir",
        default=os.path.join(
            os.environ.get("XDG_CACHE_HOME", os.path.join(os.path.expanduser("~"), ".cache")),
            "moss_tts_nano_output",
        ),
        help="Directory for WAV outputs (default: ~/.cache/moss_tts_nano_output).",
    )
    parser.add_argument(
        "--deploy-config",
        default=None,
        help="Path to a deploy YAML; leave unset to auto-load vllm_omni/deploy/moss_tts_nano.yaml.",
    )
    parser.add_argument("--stage-init-timeout", type=int, default=120)
    return parser.parse_args()


if __name__ == "__main__":
    main(parse_args())
omnivoice/end2end.py
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""End-to-end OmniVoice TTS inference via vLLM-Omni.

Supports:
- Auto voice mode: text only → generated speech
- Voice cloning mode: text + reference audio → cloned voice speech

Usage:
    # Auto voice
    python end2end.py --model k2-fsa/OmniVoice --text "Hello world"

    # Voice cloning
    python end2end.py --model k2-fsa/OmniVoice --text "Hello" \
        --ref-audio ref.wav --ref-text "reference transcription"
"""

import argparse
import os

import numpy as np
import soundfile as sf

from vllm_omni.entrypoints.omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams


def run_e2e():
    parser = argparse.ArgumentParser(description="OmniVoice E2E TTS inference")
    parser.add_argument(
        "--model",
        type=str,
        default="k2-fsa/OmniVoice",
        help="Model name or path (HuggingFace or local)",
    )
    parser.add_argument(
        "--stage-config",
        type=str,
        default="vllm_omni/deploy/omnivoice.yaml",
    )
    parser.add_argument(
        "--text",
        type=str,
        default="Hello, this is a test of the OmniVoice text to speech system.",
    )
    parser.add_argument(
        "--ref-audio",
        type=str,
        default=None,
        help="Reference audio for voice cloning (WAV file)",
    )
    parser.add_argument(
        "--ref-text",
        type=str,
        default=None,
        help="Transcription of reference audio",
    )
    parser.add_argument(
        "--lang",
        type=str,
        default=None,
        help="Language code (e.g., 'en', 'zh')",
    )
    parser.add_argument(
        "--instruct",
        type=str,
        default=None,
        help="Voice design instruction (e.g., 'female, low pitch, british accent')",
    )
    parser.add_argument(
        "--output",
        type=str,
        default="output.wav",
        help="Output audio file path",
    )
    parser.add_argument(
        "--stage-init-timeout",
        type=int,
        default=600,
        help="Stage initialization timeout in seconds",
    )
    parser.add_argument(
        "--seed",
        type=int,
        default=None,
        help="Random seed for generation",
    )
    args = parser.parse_args()

    if not os.path.exists(args.stage_config):
        raise FileNotFoundError(f"Stage config not found: {args.stage_config}")

    print(f"Initializing OmniVoice with model={args.model}")

    omni = Omni(
        model=args.model,
        stage_configs_path=args.stage_config,
        log_stats=True,
    )

    print("Model initialized. Preparing inputs...")

    # Build prompt
    mm_processor_kwargs = {}
    multi_modal_data = {}

    if args.ref_audio:
        if not os.path.exists(args.ref_audio):
            raise FileNotFoundError(f"Reference audio not found: {args.ref_audio}")

        from vllm.multimodal.media.audio import load_audio

        audio_signal, sr = load_audio(args.ref_audio, sr=None)
        multi_modal_data["audio"] = (audio_signal.astype(np.float32), sr)
        mm_processor_kwargs["ref_text"] = args.ref_text or ""
        mm_processor_kwargs["sample_rate"] = sr

    if args.lang:
        mm_processor_kwargs["lang"] = args.lang
    if args.instruct:
        mm_processor_kwargs["instruct"] = args.instruct

    prompts = {"prompt": args.text}
    if multi_modal_data:
        prompts["multi_modal_data"] = multi_modal_data
    if mm_processor_kwargs:
        prompts["mm_processor_kwargs"] = mm_processor_kwargs

    sampling_params_list = [OmniDiffusionSamplingParams(extra_args={"seed": args.seed})]

    print(f"Generating speech for: {args.text}")

    outputs = list(omni.generate(prompts, sampling_params_list=sampling_params_list))

    print(f"Received {len(outputs)} outputs.")
    for i, output in enumerate(outputs):
        try:
            ro = output.request_output
            if ro is None:
                print("No request_output found.")
                continue

            mm = getattr(ro, "multimodal_output", None)
            if not mm and ro.outputs:
                mm = getattr(ro.outputs[0], "multimodal_output", None)

            if mm:
                print(f"Multimodal output keys: {mm.keys()}")
                if "audio" in mm:
                    audio_out = mm["audio"]
                    sr = mm.get("sr", 24000)
                    if isinstance(audio_out, np.ndarray):
                        audio_np = audio_out
                    else:
                        audio_np = audio_out.cpu().numpy().squeeze()
                    out_path = args.output if i == 0 else f"output_{i}.wav"
                    sf.write(out_path, audio_np, sr)
                    print(f"Saved audio to {out_path} ({sr}Hz, {len(audio_np) / sr:.2f}s)")
            else:
                print("No multimodal output found.")
        except Exception as e:
            print(f"Error inspecting output: {e}")

    omni.close()
    print("Done.")


if __name__ == "__main__":
    run_e2e()
qwen3_tts/benchmark_prompts.txt
Hello, welcome to the voice synthesis benchmark test.
She said she would be here by noon, but nobody showed up.
The quick brown fox jumps over the lazy dog near the riverbank.
I can't believe how beautiful the sunset looks from up here on the mountain.
Please remember to bring your identification documents to the appointment tomorrow morning.
Have you ever wondered what it would be like to travel through time and visit ancient civilizations?
The restaurant on the corner serves the best pasta I have ever tasted in my entire life.
After the meeting, we should discuss the quarterly results and plan for the next phase.
Learning a new language takes patience, practice, and a genuine curiosity about other cultures.
The train leaves at half past seven, so we need to arrive at the station before then.
Could you please turn down the music a little bit, I'm trying to concentrate on my work.
It was a dark and stormy night when the old lighthouse keeper heard a knock at the door.
qwen3_tts/end2end.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/text_to_speech/qwen3_tts/end2end.py.

voxcpm2/end2end.py
"""Offline VoxCPM2 inference example (native AR pipeline).

Uses the single-stage native AR config (voxcpm2.yaml).
Requires the `voxcpm` package or VLLM_OMNI_VOXCPM_CODE_PATH env var.
"""

from __future__ import annotations

import os
import time
from pathlib import Path

import soundfile as sf
import torch

from vllm_omni import Omni
from vllm_omni.utils.tracking_parser import TrackingArgumentParser

REPO_ROOT = Path(__file__).resolve().parents[4]
SAMPLE_RATE = 48_000


def parse_args():
    parser = TrackingArgumentParser(description="Offline VoxCPM2 native AR inference")
    parser.add_argument(
        "--model",
        type=str,
        default="openbmb/VoxCPM2",
        help="VoxCPM2 model path or HuggingFace repo ID.",
    )
    parser.add_argument(
        "--text",
        type=str,
        default="This is a VoxCPM2 native AR synthesis example running on vLLM Omni.",
        help="Text to synthesize.",
    )
    parser.add_argument(
        "--output-dir",
        type=str,
        default="output_audio",
        help="Directory for output WAV files.",
    )
    parser.add_argument(
        "--deploy-config",
        type=str,
        default=None,
        help="Override the deploy config path. If unset, auto-loads "
        "vllm_omni/deploy/voxcpm2.yaml based on the HF model_type.",
    )
    parser.add_argument(
        "--ref-audio",
        type=str,
        default=None,
        help="Path to reference audio for voice cloning.",
    )
    parser.add_argument(
        "--ref-text",
        type=str,
        default=None,
        help="Optional transcript of --ref-audio (enables continuation mode).",
    )
    return parser.parse_args()


def extract_audio(multimodal_output: dict) -> torch.Tensor:
    """Extract the final complete audio tensor from multimodal output.

    The output processor concatenates per-step delta tensors under
    ``model_outputs``.  Falls back to ``audio`` for backwards compat.
    """
    audio = multimodal_output.get("model_outputs")
    if audio is None:
        audio = multimodal_output.get("audio")
    if audio is None:
        raise ValueError(f"No audio key in multimodal_output: {list(multimodal_output.keys())}")

    if isinstance(audio, list):
        # Defensive: usually the output processor consolidates into a single
        # tensor at request completion, but concatenate here too in case the
        # caller consumes intermediate (pre-consolidation) outputs.
        valid = [torch.as_tensor(a).float().cpu().reshape(-1) for a in audio if a is not None]
        if not valid:
            raise ValueError("Audio list is empty or all elements are None.")
        return torch.cat(valid, dim=0) if len(valid) > 1 else valid[0]

    return torch.as_tensor(audio).float().cpu().reshape(-1)


def main():
    args = parse_args()

    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    engine = Omni(
        model=args.model,
        deploy_config=args.deploy_config,
    )

    from transformers import AutoTokenizer

    from vllm_omni.model_executor.models.voxcpm2.voxcpm2_talker import (
        build_cjk_split_map,
        build_voxcpm2_prompt,
    )

    tokenizer = AutoTokenizer.from_pretrained(args.model, trust_remote_code=True)
    split_map = build_cjk_split_map(tokenizer)
    hf_config = engine.engine.stage_vllm_configs[0].model_config.hf_config

    ref_audio_arg = args.ref_audio
    ref_text_arg = args.ref_text
    ref_wav, ref_sr = (None, None)
    if ref_audio_arg:
        ref_wav_arr, ref_sr = sf.read(ref_audio_arg)
        ref_wav = ref_wav_arr.mean(axis=-1).tolist() if ref_wav_arr.ndim > 1 else ref_wav_arr.tolist()

    prompt = build_voxcpm2_prompt(
        hf_config=hf_config,
        tokenizer=tokenizer,
        split_map=split_map,
        text=args.text,
        ref_audio=ref_wav,
        ref_sr=ref_sr,
        ref_text=ref_text_arg,
    )

    print(f"Model       : {args.model}")
    print(f"Text        : {args.text}")
    if ref_audio_arg:
        print(f"Ref audio   : {ref_audio_arg}")
    if ref_text_arg:
        print(f"Ref text    : {ref_text_arg}")
    print(f"Output dir  : {output_dir}")

    t_start = time.perf_counter()
    outputs = engine.generate([prompt])
    elapsed = time.perf_counter() - t_start

    # outputs[0].outputs[0].multimodal_output["audio"] is a list of tensors
    request_output = outputs[0]
    mm = request_output.outputs[0].multimodal_output
    audio = extract_audio(mm)

    duration = audio.numel() / SAMPLE_RATE
    rtf = elapsed / duration if duration > 0 else float("inf")

    output_path = output_dir / "output.wav"
    sf.write(str(output_path), audio.numpy(), SAMPLE_RATE, format="WAV")

    print(f"Saved       : {output_path}")
    print(f"Duration    : {duration:.2f}s")
    print(f"Inference   : {elapsed:.2f}s")
    print(f"RTF         : {rtf:.3f}")


if __name__ == "__main__":
    os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
    main()
voxtral_tts/end2end.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/text_to_speech/voxtral_tts/end2end.py.