Text-To-Speech¶

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/text_to_speech.

vLLM-Omni supports several autoregressive TTS models. They share a mostly common CLI shape (--text, --ref-audio, --ref-text, plus an output-path flag — --output-dir for most, --output for OmniVoice) and live together in this hub. Each model has its own subdirectory containing a single end2end.py script; this README is the single doc entry point.

For online serving, see examples/online_serving/text_to_speech/. For the full list of supported architectures across all modalities, see Supported Models.

Supported Models¶

Model	HuggingFace repo	Stages	Voice cloning	Streaming	Special modes	Sample rate
CosyVoice3	`FunAudioLLM/Fun-CosyVoice3-0.5B-2512`	2 (talker + code2wav)	✓	✓	—	24 kHz
Fish Speech S2 Pro	`fishaudio/s2-pro`	dual-AR	✓	✓	—	44.1 kHz
GLM-TTS	`zai-org/GLM-TTS`	2 (AR + DiT)	✓ (required)	✓	—	24 kHz
Ming-omni-tts	`inclusionAI/Ming-omni-tts-0.5B`	2 (AR + audio VAE)	✓	✓	style / IP / dialect / TTA / podcast	44.1 kHz
Ming-flash-omni-TTS	`Jonathan1909/Ming-flash-omni-2.0`	single (talker only)	— (caption-controlled)	—	style / IP / basic captions	44.1 kHz
MOSS-TTS-Nano	`OpenMOSS-Team/MOSS-TTS-Nano`	single (AR + codec)	✓ (required)	✓	voice_clone, continuation	48 kHz
OmniVoice	`k2-fsa/OmniVoice`	2 (gen + dec)	✓	—	voice design, language hint	24 kHz
Qwen3-TTS	`Qwen/Qwen3-TTS-12Hz-1.7B-{CustomVoice,VoiceDesign,Base}`	2 (talker + code2wav)	✓ (Base)	✓	3 task variants	24 kHz
VoxCPM2	`openbmb/VoxCPM2`	single (native AR)	✓	✓ (online)	continuation	48 kHz
IndexTTS-2	`IndexTeam/IndexTTS-2`	2 (AR talker + S2Mel DiT + BigVGAN)	✓ (required)	—	emotion control (`--emo-audio`, `--emo-text`, `--emo-vector`)	22.05 kHz
Voxtral TTS	`mistralai/Voxtral-4B-TTS-2603`	varies	✓	✓	voice presets	24 kHz

Common Quick Start¶

Most models share this invocation shape:

python examples/offline_inference/text_to_speech/<model>/end2end.py \
    --text "Hello, this is a test." \
    --ref-audio /path/to/reference.wav \
    --ref-text  "Transcript of the reference audio."

--ref-audio and --ref-text are optional (text-only synthesis works without them) and must be provided together for voice cloning. The exotic scripts — Qwen3-TTS, Voxtral TTS, CosyVoice3 — accept additional model-specific flags documented in their per-model section below. Qwen3-TTS in particular uses its own argparse surface (--query-type, --audio-path, etc.) and does not follow the common shape; see its section.

CosyVoice3¶

2-stage TTS pipeline (talker + code2wav) at 24 kHz.

Prerequisites¶

uv pip install -e .
# Includes soundfile, onnxruntime, x-transformers, einops via requirements.

Download the model snapshot:

from huggingface_hub import snapshot_download
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512',
                  local_dir='pretrained_models/Fun-CosyVoice3-0.5B')

If your downloaded checkpoint lacks config.json, add it:

{
    "model_type": "cosyvoice3",
    "architectures": ["CosyVoice3Model"]
}

This is required because AutoConfig.register("cosyvoice3", CosyVoice3Config) only registers the class mapping; the loader still reads model_type from config.json to select the class.

Quick start¶

python examples/offline_inference/text_to_speech/cosyvoice3/end2end.py \
    --model pretrained_models/Fun-CosyVoice3-0.5B \
    --tokenizer pretrained_models/Fun-CosyVoice3-0.5B/CosyVoice-BlankEN

Voice cloning¶

If --ref-audio is omitted, the script downloads the upstream zero_shot_prompt.wav from the CosyVoice repo into the current directory. To use your own clip, pass --ref-audio /path/to/reference.wav, and modify --prompt-text correspondingly.

python examples/offline_inference/text_to_speech/cosyvoice3/end2end.py \
    --model pretrained_models/Fun-CosyVoice3-0.5B \
    --tokenizer pretrained_models/Fun-CosyVoice3-0.5B/CosyVoice-BlankEN \
    --ref-audio /path/to/reference.wav \
    --prompt-text "You are a helpful assistant.<|endofprompt|>Trascript in your ref audio clip"

Streaming¶

Streaming is enabled by default via async_chunk: true in vllm_omni/deploy/cosyvoice3.yaml. Pass --no-async-chunk on vllm serve to switch to the legacy synchronous path.

Notes¶

Stage 0 (talker) emits speech tokens; stage 1 (code2wav) runs flow matching + HiFiGAN to synthesize waveform.
Deploy config auto-loads from vllm_omni/deploy/cosyvoice3.yaml based on HF model_type. Pass --deploy-config <path> to override.

GLM-TTS¶

2-stage TTS pipeline (AR + DiT flow-matching) at 24 kHz. Every request requires reference audio and its transcript for zero-shot voice cloning.

Quick start¶

python examples/offline_inference/text_to_speech/glm_tts/end2end.py \
    --model zai-org/GLM-TTS \
    --text "你好，这是语音合成测试。" \
    --ref-audio /path/to/reference.wav \
    --ref-text "这是参考音频的文本内容。" \
    --output-dir ./output

Architecture¶

Text → [Stage 0: AR] → Speech Tokens → [Stage 1: DiT + HiFT] → Audio (24 kHz)
        (Llama-based)    (32k vocab)      (Flow Matching)

Notes¶

--ref-audio and --ref-text are required together; GLM-TTS does not support text-only synthesis.
Reference audio should be 3-10 seconds.
First run may be slow due to lazy loading of WhisperVQ tokenizer and CampPlus ONNX speaker embedder.
Default sampling: temperature=1.0, top_k=25, top_p=0.8 (RAS method).
The --model path should point to the repository root (not llm/ subdirectory).

Fish Speech S2 Pro¶

4B dual-AR text-to-speech model from FishAudio with the DAC codec at 44.1 kHz.

Prerequisites¶

pip install fish-speech

Quick start¶

python examples/offline_inference/text_to_speech/fish_speech/end2end.py \
    --text "Hello, this is a test of the Fish Speech text to speech system."

Voice cloning¶

python examples/offline_inference/text_to_speech/fish_speech/end2end.py \
    --text "Hello, this is a cloned voice." \
    --ref-audio /path/to/reference.wav \
    --ref-text  "Transcript of the reference audio."

Streaming¶

python examples/offline_inference/text_to_speech/fish_speech/end2end.py \
    --text "Hello, this is a streaming test." \
    --streaming

Streaming requires async_chunk: true in the stage config.

Notes¶

Output: 44.1 kHz mono WAV.
DAC codec weights (codec.pth) are loaded lazily from the model directory.

Ming-omni-tts¶

Dense 0.5B two-stage TTS pipeline (AR + flow + audio VAE) at 44.1 kHz. The example covers style, IP voice, music-only generation, text-to-audio events, emotion, dialect, zero-shot cloning, podcast, speech+BGM, and speech+environment-sound cases.

Quick start¶

python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
    --case style \
    --deploy-config vllm_omni/deploy/ming_tts.yaml \
    --enforce-eager

Voice cloning¶

python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
    --case zero_shot \
    --ref-audio /path/to/reference.wav \
    --ref-text "在此奉劝大家别乱打美白针。" \
    --deploy-config vllm_omni/deploy/ming_tts.yaml \
    --enforce-eager

Streaming¶

python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
    --case basic \
    --ref-audio /path/to/reference.wav \
    --streaming \
    --deploy-config vllm_omni/deploy/ming_tts.yaml \
    --enforce-eager

Notes¶

style, ip, bgm, and tta do not require reference audio.
Reference-audio cases use --ref-audio; zero_shot also requires --ref-text.
podcast uses multiple references via --ref-audio-paths.
Full case details live in ming_tts/README.md.

Ming-flash-omni-TTS¶

Standalone talker-only deployment of Ming-flash-omni-2.0 at 44.1 kHz. Voice is controlled through caption fields (风格 / IP / 语速/基频/音量) rather than reference audio.

Prerequisites¶

The example calls into vllm_omni.model_executor.models.ming_flash_omni.prompt_utils for the default prompt and instruction builder; no extra pip install on top of the base vLLM-Omni install.

Quick start¶

python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case style

Cases¶

# ASMR-style whisper (caption-driven)
python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case style

# IP voice (preset character voice via caption)
python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case ip

# Basic speed/pitch/volume control
python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case basic

Override the default text per case with --text, write to a custom path with --output.

Notes¶

Talker-only deployment — for the multimodal Ming-flash-omni example, see examples/offline_inference/ming_flash_omni/.
Deploy config: vllm_omni/deploy/ming_flash_omni_tts.yaml (single GPU, enforce_eager, max_num_seqs: 1).
Decode defaults from the Ming cookbook: max_decode_steps=200, cfg=2.0, sigma=0.25, temperature=0.0, use_zero_spk_emb=True.

MOSS-TTS-Nano¶

Single-stage 0.1B AR LM + MOSS-Audio-Tokenizer-Nano codec at 48 kHz mono (mixed down from upstream stereo). ZH / EN / JA. Every request requires a reference clip via --ref-audio.

No built-in speaker presets. --ref-audio is required on every call. Default --mode voice_clone matches upstream's recommended workflow; --mode continuation is exposed for completeness but upstream's continuation-with-prompt path emits very short / near-silent output, so it is rarely useful in practice. Sample reference clips ship in the upstream repo under assets/audio/ (e.g. zh_1.wav, en_2.wav, jp_2.wav).

Quick start¶

# Fetch a sample reference clip (one-off, user-scoped cache).
REF_DIR="${XDG_CACHE_HOME:-$HOME/.cache}/moss-tts-nano"
mkdir -p "$REF_DIR"
[ -s "$REF_DIR/zh_1.wav" ] || \
    curl -L -o "$REF_DIR/zh_1.wav" https://raw.githubusercontent.com/OpenMOSS/MOSS-TTS-Nano/main/assets/audio/zh_1.wav

python examples/offline_inference/text_to_speech/moss_tts_nano/end2end.py \
    --text "你好，这是MOSS-TTS-Nano的语音合成演示。" \
    --ref-audio "$REF_DIR/zh_1.wav"

The first run downloads OpenMOSS-Team/MOSS-TTS-Nano and OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano from Hugging Face.

Reproducible runs¶

python examples/offline_inference/text_to_speech/moss_tts_nano/end2end.py \
    --text "Deterministic test." \
    --ref-audio "$REF_DIR/en_2.wav" \
    --seed 42

Notes¶

Output: 48 kHz mono WAV (the tokenizer is internally stereo at 48 kHz; the wrapper averages to mono before reaching the engine).
Deploy config: vllm_omni/deploy/moss_tts_nano.yaml (auto-loaded; override with --deploy-config).
Default --max-new-frames 375 ≈ 14 s of audio; raise for longer outputs.
--ref-text is rejected in voice_clone mode and required only with --mode continuation.
Run --help for the full sampling-knob surface (--audio-temperature, --audio-top-k, --audio-top-p, --text-temperature).

OmniVoice¶

Zero-shot multilingual TTS supporting 600+ languages, with three modes (auto / clone / design).

Prerequisites¶

huggingface-cli download k2-fsa/OmniVoice

Voice cloning requires transformers>=5.3.0. Auto and design modes work with transformers>=4.57.0.

Quick start (auto voice)¶

python examples/offline_inference/text_to_speech/omnivoice/end2end.py \
    --model k2-fsa/OmniVoice \
    --text "Hello, this is a test."

Voice cloning¶

python examples/offline_inference/text_to_speech/omnivoice/end2end.py \
    --model k2-fsa/OmniVoice \
    --text "Hello, this is a test." \
    --ref-audio ref.wav \
    --ref-text  "This is the reference transcription."

Voice design¶

python examples/offline_inference/text_to_speech/omnivoice/end2end.py \
    --model k2-fsa/OmniVoice \
    --text "Hello, this is a test." \
    --instruct "female, low pitch, british accent"

Language hint¶

python examples/offline_inference/text_to_speech/omnivoice/end2end.py \
    --model k2-fsa/OmniVoice \
    --text "你好，这是一个测试。" \
    --lang zh

Seed for Reproducibility¶

python examples/offline_inference/text_to_speech/omnivoice/end2end.py \
    --model k2-fsa/OmniVoice \
    --text "Hello, this is a test." \
    --seed 42

Notes¶

Stage 0 (Generator): Qwen3-0.6B with 32-step iterative unmasking.
Stage 1 (Decoder): HiggsAudioV2 RVQ + DAC at 24 kHz.

Qwen3-TTS¶

3-task-variant TTS with 24 kHz output. Has its own argparse surface (this script does not follow the common --text / --ref-audio shape).

Prerequisites¶

For ROCm builds, replace onnxruntime with onnxruntime-rocm:

pip uninstall onnxruntime
pip install onnxruntime-rocm

Task variants¶

CustomVoice: predefined speaker (speaker ID) with optional style instruction.
VoiceDesign: text + descriptive instruction designs a new voice.
Base: voice cloning from reference audio + transcript.

# Single sample
python examples/offline_inference/text_to_speech/qwen3_tts/end2end.py --query-type CustomVoice
python examples/offline_inference/text_to_speech/qwen3_tts/end2end.py --query-type VoiceDesign
python examples/offline_inference/text_to_speech/qwen3_tts/end2end.py --query-type Base

# Base with a custom reference audio (Qwen3-TTS uses --audio-path, not --ref-audio):
python examples/offline_inference/text_to_speech/qwen3_tts/end2end.py \
    --query-type Base --audio-path /path/to/reference.wav

# Base variant has an additional mode flag:
python examples/offline_inference/text_to_speech/qwen3_tts/end2end.py --query-type Base --mode-tag icl       # default
python examples/offline_inference/text_to_speech/qwen3_tts/end2end.py --query-type Base --mode-tag xvec_only # x_vector_only_mode

# Batch (multiple prompts in one run)
python examples/offline_inference/text_to_speech/qwen3_tts/end2end.py --query-type CustomVoice --use-batch-sample

Streaming¶

python examples/offline_inference/text_to_speech/qwen3_tts/end2end.py \
    --query-type CustomVoice \
    --streaming \
    --output-dir /tmp/out_stream

Streaming requires async_chunk: true in the stage config.

Word Timestamps¶

Generate a WAV offline and a JSON sidecar with word-level timestamps from Qwen/Qwen3-ForcedAligner-0.6B:

python examples/offline_inference/text_to_speech/qwen3_tts/word_timestamps.py \
    --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
    --forced-aligner Qwen/Qwen3-ForcedAligner-0.6B \
    --text "Hello world." \
    --output-dir /tmp/qwen3_tts_timestamps

The script writes qwen3_tts_word_timestamps.wav and qwen3_tts_word_timestamps.json. On machines without a local CUDA toolkit, set VLLM_USE_FLASHINFER_SAMPLER=0 to avoid FlashInfer sampler JIT.

Batched decoding¶

The Code2Wav stage supports batched decoding through the SpeechTokenizer. Pass multiple prompts via --txt-prompts and set --batch-size accordingly. To raise max_num_seqs on either stage, point --stage-configs-path at a stage configs YAML with the desired values (see vllm_omni/model_executor/stage_configs/ for templates):

python examples/offline_inference/text_to_speech/qwen3_tts/end2end.py \
    --query-type CustomVoice \
    --txt-prompts examples/offline_inference/text_to_speech/qwen3_tts/benchmark_prompts.txt \
    --batch-size 4 \
    --stage-configs-path /path/to/qwen3_tts_batched.yaml

--batch-size must match a CUDA-graph capture size (1, 2, 4, 8, 16…).

Notes¶

Run --help for the full argument surface.
See qwen3_tts/end2end.py for the prompt-length-estimation logic the Talker uses.

VoxCPM2¶

Single-stage native AR TTS at 48 kHz. Pipeline: feat_encoder → MiniCPM4 → FSQ → residual_lm → LocDiT → AudioVAE.

Prerequisites¶

pip install voxcpm
# or, for a local source checkout:
export VLLM_OMNI_VOXCPM_CODE_PATH=/path/to/voxcpm

Quick start¶

python examples/offline_inference/text_to_speech/voxcpm2/end2end.py \
    --model openbmb/VoxCPM2 \
    --text "Hello, this is a VoxCPM2 demo."

Voice cloning¶

Pass a reference audio for isolated cloning, or both --ref-audio + --ref-text for prompt continuation:

python examples/offline_inference/text_to_speech/voxcpm2/end2end.py \
    --text "Hello, this is a voice clone demo." \
    --ref-audio /path/to/reference.wav \
    --ref-text  "Transcript of the reference audio."

Streaming¶

Streaming is exposed through the online OpenAI Speech API (stream=true). See examples/online_serving/text_to_speech/voxcpm2/gradio_demo.py for an AudioWorklet-based gapless streaming player; the offline end2end.py script does not expose a streaming path.

Notes¶

Output: 48 kHz mono WAV.
Deploy config: vllm_omni/deploy/voxcpm2.yaml (auto-loaded by HF model_type).

IndexTTS-2¶

2-stage TTS pipeline (GPT AR talker + S2Mel CFM DiT + BigVGAN vocoder) at 22.05 kHz. Every request requires reference audio for zero-shot voice cloning. Supports emotion conditioning via audio, text, or 8-dim vector.

Quick start¶

python examples/offline_inference/text_to_speech/indextts2/end2end.py \
    --model IndexTeam/IndexTTS-2 \
    --text "你好，这是一个语音合成测试。" \
    --ref-audio /path/to/reference.wav

Emotion control¶

# Emotion from reference audio
python examples/offline_inference/text_to_speech/indextts2/end2end.py \
    --model IndexTeam/IndexTTS-2 \
    --text "今天天气真好！" \
    --ref-audio /path/to/ref.wav \
    --emo-audio /path/to/happy.wav

# Emotion from 8-dim vector (happy angry sad afraid disgusted melancholy surprised calm)
python examples/offline_inference/text_to_speech/indextts2/end2end.py \
    --model IndexTeam/IndexTTS-2 \
    --text "今天天气真好！" \
    --ref-audio /path/to/ref.wav \
    --emo-vector 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

# Emotion from text description
python examples/offline_inference/text_to_speech/indextts2/end2end.py \
    --model IndexTeam/IndexTTS-2 \
    --text "今天天气真好！" \
    --ref-audio /path/to/ref.wav \
    --emo-text "happy and excited"

Notes¶

--ref-audio is required — IndexTTS-2 does not support text-only synthesis.
Stage 0 (AR Talker): GPT-2 generates mel codes from text + reference audio.
Stage 1 (S2Mel + BigVGAN): CFM DiT converts mel codes to waveform at 22.05 kHz.
Deploy config: vllm_omni/deploy/indextts2.yaml. Stage 1 runs with enforce_eager: true (DiT has dynamic shapes).

Voxtral TTS¶

Voxtral-4B-TTS (Mistral). Has its own argparse surface; uses voice presets and the mistral_common SpeechRequest protocol.

Prerequisites¶

Latest mistral_common with SpeechRequest support:

pip install -e /path/to/mistral-common  # or upgrade from PyPI when available

Quick start (voice preset)¶

python examples/offline_inference/text_to_speech/voxtral_tts/end2end.py \
    --write-audio --voice cheerful_female \
    --model mistralai/Voxtral-4B-TTS-2603 \
    --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"

Voice cloning (capability gated upstream)¶

python examples/offline_inference/text_to_speech/voxtral_tts/end2end.py \
    --write-audio \
    --model mistralai/Voxtral-4B-TTS-2603 \
    --text "This is a test message." \
    --ref-audio path/to/reference_audio.wav

Streaming + concurrency¶

python examples/offline_inference/text_to_speech/voxtral_tts/end2end.py \
    --num-prompts 32 --concurrency 8 --streaming --write-audio --voice neutral_female \
    --model mistralai/Voxtral-4B-TTS-2603 \
    --text "..."

Available voice presets are listed on the HF model card (mistralai/Voxtral-4B-TTS-2603).

Notes¶

--num-prompts N replicates the prompt for performance measurement.
--concurrency M requires --streaming and must evenly divide --num-prompts.
Run --help for the full argument surface.

SoulX-Singer¶

Singing voice synthesis (SVS) and conversion (SVC) at 24 kHz. Script: soulxsinger/end2end.py. Deploy: vllm_omni/deploy/soulxsinger_svs.yaml or soulxsinger_svc.yaml.

Prerequisites¶

Download DiT and preprocess weights, then set up separate SVS / SVC view directories. Copy soulxsinger/utils/phoneme/phone_set.json from upstream SoulX-Singer into the model weights dir as phoneme/phone_set.json — HuggingFace does not ship it.

# 1. DiT weights
export BASE=path/to/SoulX-Singer
export PREPROCESS=path/to/SoulX-Singer-Preprocess
export SVC_DIR=path/to/SoulX-Singer-svc

huggingface-cli download Soul-AILab/SoulX-Singer --local-dir "$BASE"

# 2. Preprocess weights (required)
huggingface-cli download Soul-AILab/SoulX-Singer-Preprocess --local-dir "$PREPROCESS"
export SOULX_PREPROCESS_WEIGHTS_DIR="$PREPROCESS"

# 3. SVS / SVC view directories
mkdir -p "$SVC_DIR"
cp $BASE/{config.yaml,README.md,assets} $SVC_DIR
mv $BASE/model-svc.pt $SVC_DIR/model-svc.pt

cat > "$BASE/config.json" <<'EOF'
{
  "model_type": "soulxsinger",
  "architectures": ["SoulXSingerPipeline"],
  "max_num_seqs": 1
}
EOF

cat > "$SVC_DIR/config.json" <<'EOF'
{
  "model_type": "soulxsinger",
  "architectures": ["SoulXSingerSVCPipeline"],
  "max_num_seqs": 1
}
EOF

config.yaml hyper-parameters live under $BASE; each view's config.json architectures field is the single source of truth for SVS vs SVC. Point --model at the matching directory ($BASE for SVS, $SVC_DIR for SVC). Deploy YAML is chosen automatically from config.json; optional --svs / --svc only assert the mode matches.

Online preprocess is the default: pass --prompt-audio and --target-audio, and the worker runs vocal separation, F0, and (for SVS) lyrics/MIDI before DiT. Install only what your run needs:

pip install "BS-RoFormer"   # vocal sep + F0 on GPU — SVS and SVC

Mandarin SVS also needs FunASR and Chinese G2P; ffmpeg must be on PATH:

# install optional dependencies:
pip install -e ".[soulx-svs]"

English SVS adds NeMo ASR and NLTK data; pass --language English:

pip install "nemo_toolkit[asr]==2.6.1" lhotse==1.32.2
python -c "import nltk; nltk.download('cmudict'); nltk.download('averaged_perceptron_tagger_eng')"

Precomputed metadata is the alternative: pass both --prompt-metadata-path and --target-metadata-path and skip online ASR/ROSVOT — none of the packages above are required. JSON can be produced by integrated preprocess on a prior run, or by upstream SoulX-Singer preprocess/ scripts if you prefer to run that outside vLLM-Omni.

Quick start¶

# SVS — default demo audio: tests/assets/soulxsinger/zh_prompt.mp3 + music.mp3
python examples/offline_inference/text_to_speech/soulxsinger/end2end.py \
    --model "$BASE" \
    --preprocess-weights-dir "$PREPROCESS" \
    --control score \
    --num-inference-steps 32 \
    -o output.wav

python examples/offline_inference/text_to_speech/soulxsinger/end2end.py \
    --model "$SVC_DIR" \
    --preprocess-weights-dir "$PREPROCESS" \
    --svc \
    --num-inference-steps 32 \
    -o output_svc.wav

SOULX_PREPROCESS_WEIGHTS_DIR makes --preprocess-weights-dir optional. Long SVS targets are handled in one request. See end2end.py --help for --pitch-shift, --vocal-sep, --auto-shift, and language/control options.

Notes¶

Output: 24 kHz mono WAV; batch only.
Defaults match upstream: --guidance-scale 3.0, --seed 42, --auto-shift on.
SVS --control: score or melody. MIDI / lyric QC: upstream midi_editor only.

Example materials¶

cosyvoice3/end2end.py

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import argparse
import os
import urllib.request
from pathlib import Path

import numpy as np
import soundfile as sf
from vllm import SamplingParams
from vllm.multimodal.media.audio import load_audio

from vllm_omni.entrypoints.omni import Omni
from vllm_omni.model_executor.models.cosyvoice3.tokenizer import get_qwen_tokenizer
from vllm_omni.model_executor.models.cosyvoice3.utils import extract_text_token
from vllm_omni.transformers_utils.configs.cosyvoice3 import CosyVoice3Config

# Upstream zero-shot reference clip
ZERO_SHOT_PROMPT_URL = "https://raw.githubusercontent.com/FunAudioLLM/CosyVoice/main/asset/zero_shot_prompt.wav"


def _default_ref_audio() -> str:
    # Download the upstream zero_shot_prompt.wav into the current dir
    dest = Path("zero_shot_prompt.wav")
    if not dest.exists() or dest.stat().st_size == 0:
        print(f"Downloading default reference audio to {dest}")
        urllib.request.urlretrieve(ZERO_SHOT_PROMPT_URL, dest)

    return str(dest)


def run_e2e():
    parser = argparse.ArgumentParser()
    # ""FunAudioLLM/Fun-CosyVoice3-0.5B-2512
    parser.add_argument(
        "--model",
        type=str,
        required=True,
        help="Path to CosyVoice3 model directory (e.g., pretrained_models/Fun-CosyVoice3-0.5B/).",
    )
    parser.add_argument(
        "--deploy-config",
        type=str,
        default=None,
        help="Override the deploy config path. If unset, auto-loads "
        "vllm_omni/deploy/cosyvoice3.yaml based on the HF model_type.",
    )
    parser.add_argument("--text", type=str, default="Hello, this is a test of the CosyVoice system capability.")
    parser.add_argument(
        "--prompt-text",
        type=str,
        default="You are a helpful assistant.<|endofprompt|>希望你以后，能够做的比我还好呦!",
    )
    parser.add_argument(
        "--ref-audio",
        type=str,
        default=None,
        help="Path to reference audio for voice cloning. "
        "If unset, downloads the upstream CosyVoice3 zero-shot prompt audio clip",
    )
    parser.add_argument(
        "--tokenizer",
        type=str,
        required=True,
        help="Path to tokenizer directory (e.g., <model_path>/CosyVoice-BlankEN).",
    )
    args = parser.parse_args()
    # Ensure tokenizer directory exists
    if not os.path.exists(args.tokenizer):
        raise FileNotFoundError(f"{args.tokenizer} does not exist!")

    if args.deploy_config is not None and not os.path.exists(args.deploy_config):
        raise FileNotFoundError(f"{args.deploy_config} does not exist!")

    print(f"Initializing cosyvoice E2E with model={args.model}")

    omni = Omni(
        model=args.model,
        deploy_config=args.deploy_config,
        tokenizer=args.tokenizer,
        log_stats=True,
    )

    sampling_cfg = {"top_p": 0.8, "top_k": 25, "eos_token_id": 6561 + 1}

    print("Model initialized. Preparing inputs...")
    ref_audio_path = args.ref_audio or _default_ref_audio()
    if not os.path.exists(ref_audio_path):
        raise FileNotFoundError(f"Audio file not found: {ref_audio_path}")
    # Load at native sample rate
    audio_signal, sr = load_audio(ref_audio_path, sr=None)

    # Validate sample rate before processing (similar to original CosyVoice)
    min_sr = 16000
    if sr < min_sr:
        raise ValueError(
            f"Audio sample rate {sr} Hz is too low. "
            f"Minimum required: {min_sr} Hz. "
            f"Please provide audio with sample rate >= {min_sr} Hz."
        )

    audio_data = (audio_signal.astype(np.float32), sr)

    prompts = {
        "prompt": args.text,
        "multi_modal_data": {
            "audio": audio_data,
        },
        "mm_processor_kwargs": {
            "prompt_text": args.prompt_text,
            "sample_rate": audio_data[1],
        },
    }

    print(f"Generating for prompt: {args.text}")

    config = CosyVoice3Config()
    tokenizer = get_qwen_tokenizer(
        token_path=args.tokenizer,
        skip_special_tokens=config.skip_special_tokens,
        version=config.version,
    )
    _, text_token_len = extract_text_token(args.text, tokenizer, config.allowed_special)
    base_len = int(text_token_len)
    min_len = int(base_len * config.min_token_text_ratio)
    max_len = int(base_len * config.max_token_text_ratio)

    # Build SamplingParams for each stage (GPT, S2Mel, Vocoder)
    gpt_sampling = SamplingParams(
        temperature=1.0,
        top_p=sampling_cfg["top_p"],
        top_k=sampling_cfg["top_k"],
        repetition_penalty=2.0,
        min_tokens=min_len,
        max_tokens=max_len,
        stop_token_ids=[sampling_cfg["eos_token_id"]],
        # allowed_token_ids=list(range(6561+3)),
        detokenize=False,
    )
    # Not used
    s2mel_sampling = SamplingParams(
        temperature=1.0,
        top_p=1.0,
        top_k=-1,
        repetition_penalty=2.0,
        max_tokens=256,
        detokenize=False,
    )

    sampling_params_list = [gpt_sampling, s2mel_sampling]

    # Start profiling (requires VLLM_TORCH_PROFILER_DIR env var)
    if os.environ.get("VLLM_TORCH_PROFILER_DIR"):
        print("Starting profiler...")
        omni.start_profile()

    # Generate (Omni orchestrator requires a per-stage SamplingParams list)
    outputs = list(omni.generate(prompts, sampling_params_list=sampling_params_list[:2]))

    # Stop profiling and get results
    if os.environ.get("VLLM_TORCH_PROFILER_DIR"):
        print("Stopping profiler...")
        profile_results = omni.stop_profile()
        print(f"Profile traces saved to: {profile_results}")

    print(outputs)
    # Verify outputs
    print(f"Received {len(outputs)} outputs.")
    for i, output in enumerate(outputs):
        try:
            ro = output.request_output
            if ro is None:
                print("No request_output found.")
                continue

            # Multimodal output may be attached to RequestOutput or CompletionOutput.
            mm = getattr(ro, "multimodal_output", None)
            if not mm and ro.outputs:
                mm = getattr(ro.outputs[0], "multimodal_output", None)

            if mm:
                print(f"Multimodal output keys: {mm.keys()}")
                if "audio" in mm:
                    audio_out = mm["audio"]
                    print(f"Generated Audio Shape: {audio_out.shape}")
                    out_path = f"output_{i}.wav"
                    sf.write(out_path, audio_out.cpu().numpy().squeeze(), 22050)
                    print(f"Saved audio to {out_path}")
            else:
                print("No multimodal output found.")
        except Exception as e:
            print(f"Error inspecting output: {e}")
    omni.close()


if __name__ == "__main__":
    run_e2e()

fish_speech/end2end.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/text_to_speech/fish_speech/end2end.py.

glm_tts/end2end.py

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""GLM-TTS End-to-End Offline Inference Example.

GLM-TTS is a two-stage TTS system:
  - Stage 0 (AR): Llama-based model generates speech tokens from text
  - Stage 1 (DiT): Flow matching model converts speech tokens to audio

Usage:
    # Sync two-stage (default)
    python examples/offline_inference/text_to_speech/glm_tts/end2end.py \
        --model /path/to/GLM-TTS \
        --text "你好，这是一个语音合成测试。" \
        --ref-audio /path/to/reference.wav \
        --ref-text "参考音频的转录文本。" \
        --output-dir ./output

    # Async chunk mode (streaming DiT)
    python examples/offline_inference/text_to_speech/glm_tts/end2end.py \
        --model /path/to/GLM-TTS --async-chunk \
        --text "你好，这是一个语音合成测试。" \
        --ref-audio /path/to/reference.wav \
        --ref-text "参考音频的转录文本。" \
        --output-dir ./output
"""

import base64
import io
import logging
import os
import tempfile
import time
from typing import Any
from urllib.request import urlopen

import numpy as np
import soundfile as sf
import torch
import yaml

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from vllm.utils.argparse_utils import FlexibleArgumentParser

from vllm_omni import Omni
from vllm_omni.model_executor.models.glm_tts.glm_tts import build_glm_tts_prefill_metadata

logger = logging.getLogger(__name__)

DEFAULT_DEPLOY_CONFIG = os.path.join(
    os.path.dirname(__file__),
    "..",
    "..",
    "..",
    "..",
    "vllm_omni",
    "deploy",
    "glm_tts.yaml",
)
SAMPLE_RATE = 24000


def _load_ref_audio(ref_audio: str) -> tuple[torch.Tensor, int]:
    """Load reference audio from file path, URL, or data URI."""
    if ref_audio.startswith(("http://", "https://")):
        with urlopen(ref_audio, timeout=60) as response:
            audio_obj: Any = io.BytesIO(response.read())
    elif ref_audio.startswith("data:"):
        _, _, encoded = ref_audio.partition(",")
        audio_obj = io.BytesIO(base64.b64decode(encoded))
    else:
        audio_obj = ref_audio
    wav_np, sr = sf.read(audio_obj, dtype="float32")
    if wav_np.ndim > 1:
        wav_np = wav_np.mean(axis=1)
    return torch.from_numpy(wav_np), int(sr)


def _concat_audio(audio_val: Any) -> np.ndarray:
    """Concatenate audio tensors from multimodal output."""
    if isinstance(audio_val, list):
        tensors = [torch.as_tensor(t).float().reshape(-1) for t in audio_val if t is not None]
        if not tensors:
            return np.zeros((0,), dtype=np.float32)
        return torch.cat(tensors, dim=-1).cpu().numpy().astype(np.float32, copy=False)
    if isinstance(audio_val, torch.Tensor):
        return audio_val.float().cpu().numpy().reshape(-1)
    return np.asarray(audio_val, dtype=np.float32).reshape(-1)


def _extract_sample_rate(audio_mm: dict) -> int:
    """Extract sample rate from multimodal output dict."""
    sr_raw = audio_mm.get("sr", SAMPLE_RATE)
    if isinstance(sr_raw, list):
        sr_raw = sr_raw[-1] if sr_raw else SAMPLE_RATE
    if hasattr(sr_raw, "item"):
        return int(sr_raw.item())
    return int(sr_raw)


def _modify_deploy_config(base_path: str, async_chunk: bool) -> str:
    """Build deploy config with explicit sync/async mode and eager execution.

    Mirrors the logic in ``tests/e2e/offline_inference/test_glm_tts_expansion.py``
    (``_get_deploy_config``) so that example runs match CI behavior.
    """
    with open(base_path) as f:
        cfg = yaml.safe_load(f)
    cfg["async_chunk"] = async_chunk
    for stage in cfg.get("stages", []):
        stage["enforce_eager"] = True
        if stage.get("stage_id") == 0:
            stage["async_scheduling"] = bool(async_chunk)
    tmp = tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False, prefix="glm_tts_")
    yaml.dump(cfg, tmp)
    tmp.close()
    return tmp.name


def main(args):
    """Run offline GLM-TTS inference."""
    os.makedirs(args.output_dir, exist_ok=True)
    base_deploy_config = args.deploy_config or DEFAULT_DEPLOY_CONFIG
    deploy_config_path = _modify_deploy_config(base_deploy_config, args.async_chunk)

    ref_audio_wav, ref_audio_sr = _load_ref_audio(args.ref_audio)
    if not args.ref_text:
        raise ValueError("GLM-TTS requires --ref-audio and --ref-text.")

    inputs = [
        {
            "prompt": args.text,
            "multi_modal_data": {
                "audio": (ref_audio_wav.float().cpu().numpy(), ref_audio_sr),
            },
            "modalities": ["audio"],
            "mm_processor_kwargs": {"prompt_text": args.ref_text},
            "additional_information": build_glm_tts_prefill_metadata(
                args.model,
                args.text,
                args.ref_text,
            ),
        }
    ]

    omni = Omni(
        model=args.model,
        stage_configs_path=deploy_config_path,
        log_stats=args.log_stats,
        stage_init_timeout=args.stage_init_timeout,
    )

    t_start = time.perf_counter()
    outputs = omni.generate(inputs)
    elapsed = (time.perf_counter() - t_start) * 1000

    assert outputs, "No outputs returned"
    audio_mm = outputs[0].multimodal_output
    assert "audio" in audio_mm, "No audio output found"

    audio = _concat_audio(audio_mm["audio"])
    sr = _extract_sample_rate(audio_mm)
    out_path = os.path.join(args.output_dir, "output.wav")
    sf.write(out_path, audio, samplerate=sr, format="WAV")

    logger.info("Saved %s (%.2fs @ %dHz)", out_path, len(audio) / sr, sr)
    logger.info("Total inference: %.1f ms", elapsed)


def parse_args():
    parser = FlexibleArgumentParser(description="GLM-TTS Text-to-Speech Example")
    parser.add_argument("--model", type=str, required=True, help="Model path")
    parser.add_argument("--text", type=str, default="你好，这是一个语音合成测试。")
    parser.add_argument("--output-dir", type=str, default="./output")
    parser.add_argument("--ref-audio", type=str, required=True, help="Reference WAV path/URL")
    parser.add_argument("--ref-text", type=str, required=True, help="Transcript of ref audio")
    parser.add_argument("--deploy-config", type=str, default=None)
    parser.add_argument(
        "--async-chunk",
        action="store_true",
        default=False,
        help="Enable async_chunk mode (streaming DiT). Default: sync two-stage.",
    )
    parser.add_argument("--log-stats", action="store_true")
    parser.add_argument("--stage-init-timeout", type=int, default=600)
    return parser.parse_args()


if __name__ == "__main__":
    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    )
    main(parse_args())

higgs_audio_v2/README.md

higgs-audio v2 — offline example¶

Drives Stage 0 (DualFFN talker) + Stage 1 (HiggsAudio codec) for bosonai/higgs-audio-v2-generation-3B-base end-to-end through the vLLM-Omni engine and writes a 24 kHz mono WAV per prompt.

Prerequisites¶

Voice clone needs transformers>=5.3.0 — vllm-omni loads the audio codec via HF's HiggsAudioV2TokenizerModel, instantiated from the k2-fsa/OmniVoice/audio_tokenizer/ subdirectory (only that ~806 MB subdir is downloaded). The boson-ai standalone tokenizer repo's model.safetensors is actually a copy of the 3B talker LM, so HF can't load it directly; the k2 bundle ships the same codec weights repackaged with HF-compatible key naming.

pip install -U "transformers>=5.3.0"

Quick start¶

Plain TTS:

python examples/offline_inference/text_to_speech/higgs_audio_v2/end2end.py \
    --texts "Hello world." "The quick brown fox jumps over the lazy dog." \
    --output-dir results/wavs

Voice cloning¶

Pass both --ref-audio and --ref-text together:

python examples/offline_inference/text_to_speech/higgs_audio_v2/end2end.py \
    --texts "Hello, this is a cloned voice." \
    --ref-audio /path/to/reference.wav \
    --ref-text  "Exact transcript spoken in reference.wav." \
    --output-dir results/wavs

Notes¶

Output: 24 kHz mono WAV.
Deploy config: vllm_omni/deploy/higgs_audio_v2.yaml (auto-loaded by HF model_type).
--ref-text must be the real transcript of --ref-audio; mismatched text degrades cloned-voice quality.
For online serving, see examples/online_serving/text_to_speech/higgs_audio_v2/.

higgs_audio_v2/end2end.py

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""Offline higgs-audio v2 inference example.

Runs Stage 0 (DualFFN talker) + Stage 1 (HiggsAudio codec) end-to-end
through the vLLM-Omni engine without going through the HTTP server, and
saves a 24 kHz mono WAV per prompt.

Example:

    python examples/offline_inference/text_to_speech/higgs_audio_v2/end2end.py \\
        --texts "Hello world." \\
                "The quick brown fox jumps over the lazy dog." \\
        --output-dir results/wavs
"""

from __future__ import annotations

import os

# DeepGEMM FP8 kernels require an optional backend that may not be installed.
# Disable the warmup before importing vLLM so engine startup falls back to the
# regular gemm path. Users with deep_gemm installed can override these.
os.environ.setdefault("VLLM_USE_DEEP_GEMM", "0")
os.environ.setdefault("VLLM_MOE_USE_DEEP_GEMM", "0")

import time
from pathlib import Path

import numpy as np
import soundfile as sf
import torch

from vllm_omni import Omni
from vllm_omni.utils.tracking_parser import TrackingArgumentParser

SAMPLE_RATE = 24_000
DEFAULT_TEXTS = (
    "Hello world.",
    "The quick brown fox jumps over the lazy dog.",
)


def parse_args():
    parser = TrackingArgumentParser(description="Offline higgs-audio v2 inference")
    parser.add_argument(
        "--model",
        type=str,
        default="bosonai/higgs-audio-v2-generation-3B-base",
        help="Stage-0 talker model id or path.",
    )
    parser.add_argument(
        "--texts",
        type=str,
        nargs="+",
        default=list(DEFAULT_TEXTS),
        help="One or more plain-text prompts to synthesize.",
    )
    parser.add_argument(
        "--output-dir",
        type=str,
        default="results/wavs",
        help="Directory to write per-prompt WAV files.",
    )
    parser.add_argument(
        "--deploy-config",
        type=str,
        default=None,
        help="Override the deploy config path. Auto-loads "
        "vllm_omni/deploy/higgs_audio_v2.yaml from the HF model_type by default.",
    )
    parser.add_argument(
        "--max-new-tokens",
        type=int,
        default=500,
        help="Cap on Stage-0 codec frames per request.",
    )
    parser.add_argument(
        "--ref-audio",
        type=str,
        default=None,
        help="Reference clip for voice clone (path to a WAV file). Paired with --ref-text.",
    )
    parser.add_argument(
        "--ref-text",
        type=str,
        default=None,
        help="Transcript of the reference clip. Required when --ref-audio is set.",
    )
    return parser.parse_args()


def _slugify(text: str) -> str:
    import re

    slug = re.sub(r"\s+", "_", text.strip().lower())
    slug = re.sub(r"[^a-z0-9_]+", "", slug)
    return slug[:48] or "prompt"


def _extract_pcm(multimodal_output: dict) -> torch.Tensor:
    """Pull the final concatenated PCM tensor out of a request's multimodal_output."""
    audio = multimodal_output.get("model_outputs")
    if audio is None:
        audio = multimodal_output.get("audio")
    if audio is None:
        raise ValueError(f"no audio key in multimodal_output: {list(multimodal_output.keys())}")
    if isinstance(audio, list):
        valid = [torch.as_tensor(a).float().cpu().reshape(-1) for a in audio if a is not None]
        if not valid:
            raise ValueError("audio list is empty")
        return torch.cat(valid, dim=0) if len(valid) > 1 else valid[0]
    return torch.as_tensor(audio).float().cpu().reshape(-1)


def _pcm_to_int16(pcm: torch.Tensor) -> np.ndarray:
    arr = pcm.numpy()
    if arr.dtype.kind == "f":
        arr = np.clip(arr, -1.0, 1.0)
        arr = (arr * 32767.0).astype(np.int16)
    else:
        arr = arr.astype(np.int16)
    return arr


def main():
    args = parse_args()
    if (args.ref_audio is None) != (args.ref_text is None):
        raise SystemExit("--ref-audio and --ref-text must be supplied together")
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    engine = Omni(model=args.model, deploy_config=args.deploy_config)

    # Build prompt_token_ids using the same path serving_speech.py takes online.
    from transformers import AutoProcessor

    from vllm_omni.model_executor.models.higgs_audio_v2.higgs_audio_v2_tokenizer import (
        build_plain_text_prompt,
        build_voice_clone_prompt,
        input_ids_to_python_list,
    )

    processor = AutoProcessor.from_pretrained(args.model, trust_remote_code=True)

    # Voice-clone path: load the reference clip once. The HF processor will
    # encode it via the bundled HiggsAudioV2TokenizerModel each time we build
    # a prompt below. This is cheap on CPU for a few-second clip.
    ref_wav: np.ndarray | None = None
    ref_sr: int | None = None
    if args.ref_audio is not None:
        ref_wav, ref_sr = sf.read(args.ref_audio, always_2d=False)
        if ref_wav.ndim == 2:
            ref_wav = ref_wav.mean(axis=1)

    print(f"Model       : {args.model}")
    print(f"Prompts     : {len(args.texts)}")
    print(f"Output dir  : {output_dir}")
    print(f"Voice clone : {'yes' if ref_wav is not None else 'no'}")

    # Run one prompt at a time. The Stage-0 talker's per-slot audio state is
    # request-scoped; submitting multiple prompts in the same engine.generate()
    # call would batch them in the AR runner and exercise a code path that is
    # not validated for this model yet.
    total_elapsed = 0.0
    total_dur = 0.0
    for text in args.texts:
        if ref_wav is not None:
            out = build_voice_clone_prompt(processor, text, ref_wav, int(ref_sr), args.ref_text)
            prompt = {
                "prompt_token_ids": out["prompt_token_ids"],
                # Bare tensors (NOT list-wrapped): the msgspec serializer in
                # vllm_omni.data_entry_keys routes torch.Tensor → tensor_data
                # and list[Tensor] → list_data (which silently strips tensors).
                "additional_information": {
                    "audio_input_ids": out["audio_input_ids"],
                    "audio_input_ids_mask": out["audio_input_ids_mask"],
                },
            }
        else:
            inputs = build_plain_text_prompt(processor, text)
            prompt = {"prompt_token_ids": input_ids_to_python_list(inputs)}
        t_start = time.perf_counter()
        outputs = engine.generate([prompt])
        elapsed = time.perf_counter() - t_start
        total_elapsed += elapsed

        mm = outputs[0].outputs[0].multimodal_output
        pcm = _extract_pcm(mm)
        slug = _slugify(text)
        out_path = output_dir / f"{slug}.wav"
        sf.write(str(out_path), _pcm_to_int16(pcm), SAMPLE_RATE, format="WAV", subtype="PCM_16")
        dur = pcm.numel() / SAMPLE_RATE
        total_dur += dur
        print(f"  {slug:<50} dur={dur:5.2f}s  -> {out_path}")

    rtf = total_elapsed / total_dur if total_dur > 0 else float("inf")
    print(f"Total infer : {total_elapsed:.2f}s  total audio: {total_dur:.2f}s  RTF: {rtf:.3f}")


if __name__ == "__main__":
    os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")
    main()

higgs_audio_v3/end2end.py

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""Offline higgs-audio v3 inference example.

Runs Stage 0 (Qwen3 talker) + Stage 1 (HiggsAudio codec) end-to-end
through the vLLM-Omni engine without going through the HTTP server, and
saves a 24 kHz mono WAV per prompt.

Example (plain TTS):

    python examples/offline_inference/text_to_speech/higgs_audio_v3/end2end.py \\
        --texts "Hello world." \\
        --output-dir results/higgs_v3_wavs

Example (voice clone):

    python examples/offline_inference/text_to_speech/higgs_audio_v3/end2end.py \\
        --texts "Hello world." \\
        --ref-audio path/to/reference.wav \\
        --ref-text "Transcript of the reference clip." \\
        --output-dir results/higgs_v3_wavs
"""

from __future__ import annotations

import os

os.environ.setdefault("VLLM_USE_DEEP_GEMM", "0")
os.environ.setdefault("VLLM_MOE_USE_DEEP_GEMM", "0")

import re
import time
from pathlib import Path

import numpy as np
import soundfile as sf
import torch

from vllm_omni import Omni

SAMPLE_RATE = 24_000
DEFAULT_TEXTS = (
    "Hello world.",
    "The quick brown fox jumps over the lazy dog.",
    "Today is a beautiful day for a walk in the park.",
)


def parse_args():
    import argparse

    parser = argparse.ArgumentParser(description="Offline higgs-audio v3 inference")
    parser.add_argument(
        "--model",
        type=str,
        default="bosonai/higgs-audio-v3-tts-4b",
        help="Stage-0 talker model id or path.",
    )
    parser.add_argument(
        "--texts",
        type=str,
        nargs="+",
        default=list(DEFAULT_TEXTS),
        help="One or more plain-text prompts to synthesize.",
    )
    parser.add_argument(
        "--output-dir",
        type=str,
        default="results/higgs_v3_wavs",
        help="Directory to write per-prompt WAV files.",
    )
    parser.add_argument(
        "--deploy-config",
        type=str,
        default=None,
        help="Override the deploy config path.",
    )
    parser.add_argument(
        "--max-new-tokens",
        type=int,
        default=2048,
        help="Cap on Stage-0 codec frames per request.",
    )
    parser.add_argument(
        "--ref-audio",
        type=str,
        default=None,
        help="Reference audio for voice clone (WAV/FLAC/MP3 path).",
    )
    parser.add_argument(
        "--ref-text",
        type=str,
        default=None,
        help="Transcript of the reference audio. Optional but improves fidelity.",
    )
    return parser.parse_args()


def _slugify(text: str) -> str:
    slug = re.sub(r"\s+", "_", text.strip().lower())
    slug = re.sub(r"[^a-z0-9_]+", "", slug)
    return slug[:48] or "prompt"


def _extract_pcm(multimodal_output: dict) -> torch.Tensor:
    audio = multimodal_output.get("model_outputs")
    if audio is None:
        audio = multimodal_output.get("audio")
    if audio is None:
        raise ValueError(f"no audio key in multimodal_output: {list(multimodal_output.keys())}")
    if isinstance(audio, list):
        valid = [torch.as_tensor(a).float().cpu().reshape(-1) for a in audio if a is not None]
        if not valid:
            raise ValueError("audio list is empty")
        return torch.cat(valid, dim=0) if len(valid) > 1 else valid[0]
    return torch.as_tensor(audio).float().cpu().reshape(-1)


def _pcm_to_int16(pcm: torch.Tensor) -> np.ndarray:
    arr = pcm.numpy()
    if arr.dtype.kind == "f":
        arr = np.clip(arr, -1.0, 1.0)
        arr = (arr * 32767.0).astype(np.int16)
    else:
        arr = arr.astype(np.int16)
    return arr


def main():
    args = parse_args()
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    engine = Omni(model=args.model, deploy_config=args.deploy_config, trust_remote_code=True)

    from transformers import AutoTokenizer

    from vllm_omni.model_executor.models.higgs_audio_v3.higgs_audio_v3_tokenizer import (
        HiggsAudioV3TokenizerAdapter,
        apply_delay_pattern,
        encode_reference_audio,
    )

    tokenizer = AutoTokenizer.from_pretrained(args.model, trust_remote_code=True)
    adapter = HiggsAudioV3TokenizerAdapter(tokenizer)

    # Load and encode reference audio once (if voice cloning)
    ref_codes_delayed: torch.Tensor | None = None
    if args.ref_audio is not None:
        ref_wav, ref_sr = sf.read(args.ref_audio, always_2d=False)
        if ref_wav.ndim == 2:
            ref_wav = ref_wav.mean(axis=1)
        ref_codes_raw = encode_reference_audio(ref_wav, int(ref_sr))
        ref_codes_delayed = apply_delay_pattern(ref_codes_raw)
        print(f"Reference   : {args.ref_audio}")
        print(f"Ref codes   : {ref_codes_raw.shape[0]} frames -> {ref_codes_delayed.shape[0]} delayed")
        if args.ref_text:
            print(f"Ref text    : {args.ref_text}")

    print(f"Model       : {args.model}")
    print(f"Prompts     : {len(args.texts)}")
    print(f"Output dir  : {output_dir}")
    print(f"Voice clone : {'yes' if ref_codes_delayed is not None else 'no'}")

    total_elapsed = 0.0
    total_dur = 0.0
    for text in args.texts:
        if ref_codes_delayed is not None:
            prompt_ids = adapter.build_prompt(
                text,
                num_ref_tokens=int(ref_codes_delayed.shape[0]),
                reference_text=args.ref_text,
            )
            prompt = {
                "prompt_token_ids": prompt_ids,
                "additional_information": {
                    "audio_input_ids": ref_codes_delayed.to(torch.long),
                    "audio_input_ids_mask": torch.ones(ref_codes_delayed.shape[0], dtype=torch.bool),
                },
            }
        else:
            prompt_ids = adapter.build_prompt(text)
            prompt = {"prompt_token_ids": prompt_ids}

        t_start = time.perf_counter()
        outputs = engine.generate([prompt])
        elapsed = time.perf_counter() - t_start
        total_elapsed += elapsed

        mm = outputs[0].outputs[0].multimodal_output
        pcm = _extract_pcm(mm)
        slug = _slugify(text)
        out_path = output_dir / f"{slug}.wav"
        sf.write(str(out_path), _pcm_to_int16(pcm), SAMPLE_RATE, format="WAV", subtype="PCM_16")
        dur = pcm.numel() / SAMPLE_RATE
        total_dur += dur
        print(f"  {slug:<50} dur={dur:5.2f}s  -> {out_path}")

    rtf = total_elapsed / total_dur if total_dur > 0 else float("inf")
    print(f"Total infer : {total_elapsed:.2f}s  total audio: {total_dur:.2f}s  RTF: {rtf:.3f}")


if __name__ == "__main__":
    main()

indextts2/end2end.py

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""Offline inference example for IndexTTS2 via vLLM-Omni.

Two-stage pipeline: GPT AR (Stage 0) → S2Mel + BigVGAN (Stage 1).
Output is 22050 Hz mono WAV.

Usage:
  python end2end.py \
    --model /path/to/IndexTeam/IndexTTS-2 \
    --text "你好，这是一个语音合成测试。" \
    --ref-audio /path/to/ref.wav

  # With emotion audio:
  python end2end.py \
    --model /path/to/IndexTeam/IndexTTS-2 \
    --text "今天天气真好！" \
    --ref-audio /path/to/ref.wav \
    --emo-audio /path/to/happy.wav
"""

from __future__ import annotations

import os
from pathlib import Path

import soundfile as sf
import torch
from vllm import SamplingParams
from vllm.utils.argparse_utils import FlexibleArgumentParser

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm_omni import Omni  # noqa: E402
from vllm_omni.model_executor.models.indextts2.prompt_utils import (  # noqa: E402
    build_indextts2_prefill_prompt_ids,
)


def build_request(
    model: str,
    text: str,
    ref_audio_path: str | None = None,
    emo_audio_path: str | None = None,
    emo_text: str | None = None,
    emo_vector: list[float] | None = None,
    emo_alpha: float | None = None,
    use_emo_text: bool = False,
    use_random: bool = False,
) -> dict:
    additional: dict = {"text": [text]}
    if ref_audio_path:
        additional["voice"] = [str(ref_audio_path)]
    if emo_audio_path:
        additional["emo_audio"] = [str(emo_audio_path)]
    if emo_text:
        additional["emo_text"] = [emo_text]
    if emo_vector is not None:
        additional["emo_vector"] = [emo_vector]
    if emo_alpha is not None:
        additional["emo_alpha"] = [emo_alpha]
    if use_emo_text:
        additional["use_emo_text"] = [True]
    if use_random:
        additional["use_random"] = [True]
    return {
        "prompt_token_ids": build_indextts2_prefill_prompt_ids(model, text),
        "additional_information": additional,
    }


def save_audio(waveform: torch.Tensor, path: str, sample_rate: int = 22050) -> None:
    audio_np = waveform.float().numpy()
    sf.write(path, audio_np, sample_rate)
    print(f"  Saved {path} ({audio_np.shape}, {sample_rate} Hz)")


def extract_audio(mm: dict) -> tuple[torch.Tensor | None, int]:
    audio = mm.get("audio")
    if audio is None:
        audio = mm.get("model_outputs")
    if isinstance(audio, list):
        chunks = [chunk.reshape(-1) for chunk in audio if isinstance(chunk, torch.Tensor) and chunk.numel() > 0]
        audio = torch.cat(chunks, dim=0) if chunks else None

    sr_val = mm.get("sr")
    if isinstance(sr_val, list):
        sr_val = sr_val[-1] if sr_val else None
    if hasattr(sr_val, "item"):
        sample_rate = int(sr_val.item())
    else:
        sample_rate = int(sr_val) if sr_val is not None else 22050

    return audio if isinstance(audio, torch.Tensor) else None, sample_rate


def main(args) -> None:
    omni = Omni(
        model=args.model,
        deploy_config=args.deploy_config,
        stage_init_timeout=args.stage_init_timeout,
    )

    # Stage 0: GPT AR (text → mel codes)
    gpt_sampling = SamplingParams(
        temperature=0.8,
        top_p=0.8,
        top_k=30,
        max_tokens=1500,
        repetition_penalty=10.0,
        stop_token_ids=[8193],
        seed=args.seed if args.seed is not None else 42,
        detokenize=False,
    )
    # Stage 1: S2Mel + BigVGAN (non-AR, params mostly ignored)
    s2mel_sampling = SamplingParams(
        temperature=0.0,
        max_tokens=65536,
        detokenize=True,
    )
    sampling_params = [gpt_sampling, s2mel_sampling]

    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    print(f"Synthesizing: {args.text!r}")
    if args.ref_audio:
        print(f"  ref_audio: {args.ref_audio}")
    inputs = build_request(
        model=args.model,
        text=args.text,
        ref_audio_path=args.ref_audio,
        emo_audio_path=args.emo_audio,
        emo_text=args.emo_text,
        emo_vector=args.emo_vector,
        emo_alpha=args.emo_alpha,
        use_emo_text=args.use_emo_text,
        use_random=args.use_random,
    )

    for i, omni_out in enumerate(omni.generate(inputs, sampling_params_list=sampling_params)):
        mm = omni_out.multimodal_output
        if not mm:
            print(f"  [req {i}] No multimodal output.")
            continue
        audio, sr = extract_audio(mm)
        if audio is None:
            print(f"  [req {i}] No waveform in multimodal_output.")
            continue
        out_path = str(output_dir / f"output_{i}.wav")
        save_audio(audio.cpu(), out_path, sr)

    print("Done.")


def parse_args():
    parser = FlexibleArgumentParser(description="IndexTTS2 offline inference")
    parser.add_argument("--model", required=True, help="HF model path for IndexTTS2.")
    parser.add_argument("--text", default="你好，这是IndexTTS2语音合成测试。")
    parser.add_argument("--ref-audio", required=True, help="Reference audio for voice cloning.")
    parser.add_argument("--emo-audio", default=None, help="Emotion reference audio.")
    parser.add_argument("--emo-text", default=None, help="Emotion description text.")
    parser.add_argument(
        "--emo-vector",
        type=float,
        nargs=8,
        default=None,
        help="8-dim emotion vector: happy angry sad afraid disgusted melancholic surprised calm.",
    )
    parser.add_argument("--emo-alpha", type=float, default=None, help="Emotion weight in [0, 1].")
    parser.add_argument("--use-emo-text", action="store_true", help="Infer emotion vector from emo-text or text.")
    parser.add_argument("--use-random", action="store_true", help="Use random emotion prototypes.")
    parser.add_argument("--seed", type=int, default=None)
    parser.add_argument(
        "--output-dir",
        default=os.path.join(
            os.environ.get("XDG_CACHE_HOME", os.path.join(os.path.expanduser("~"), ".cache")),
            "indextts2_output",
        ),
    )
    parser.add_argument("--deploy-config", default=None)
    parser.add_argument("--stage-init-timeout", type=int, default=600)
    return parser.parse_args()


if __name__ == "__main__":
    main(parse_args())

ming_flash_omni_tts/end2end.py

"""Offline e2e example for Ming-flash-omni-2.0 standalone talker (TTS)."""

import os
from typing import Any

import soundfile as sf
import torch

from vllm_omni.utils.tracking_parser import TrackingArgumentParser

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from vllm_omni.entrypoints.omni import Omni
from vllm_omni.inputs.data import OmniTokensPrompt
from vllm_omni.model_executor.models.ming_flash_omni.prompt_utils import (
    DEFAULT_PROMPT,
    create_instruction,
)

MODEL_NAME = "Jonathan1909/Ming-flash-omni-2.0"


def get_messages(case: str, text_override: str | None) -> dict[str, Any]:
    if case == "style":
        text = text_override or "我会一直在这里陪着你，直到你慢慢、慢慢地沉入那个最温柔的梦里……好吗？"
        instruction = create_instruction(
            {
                "风格": "这是一种ASMR耳语，属于一种旨在引发特殊感官体验的创意风格。这个女性使用轻柔的普通话进行耳语，声音气音成分重。音量极低，紧贴麦克风，语速极慢，旨在制造触发听者颅内快感的声学刺激。",
            }
        )
        return {
            "prompt": DEFAULT_PROMPT,
            "text": text,
            "instruction": instruction,
            "use_zero_spk_emb": True,
        }
    if case == "ip":
        text = text_override or "这款产品的名字，叫变态坑爹牛肉丸。"
        return {
            "prompt": DEFAULT_PROMPT,
            "text": text,
            "instruction": create_instruction({"IP": "灵小甄"}),
            "use_zero_spk_emb": True,
        }
    if case == "basic":
        text = text_override or "我们当迎着阳光辛勤耕作，去摘取，去制作，去品尝，去馈赠。"
        return {
            "prompt": DEFAULT_PROMPT,
            "text": text,
            "instruction": create_instruction({"语速": "快速", "基频": "中", "音量": "中"}),
            "use_zero_spk_emb": True,
        }
    raise ValueError(f"Unknown case: {case}")


def save_audio(mm: dict[str, Any], output_path: str) -> None:
    if not mm or "audio" not in mm:
        raise RuntimeError("No audio found in model output")
    audio = mm["audio"]
    sr_raw = mm.get("sr", 44100)
    if isinstance(sr_raw, torch.Tensor):
        sample_rate = int(sr_raw.item())
    else:
        sample_rate = int(sr_raw)
    waveform = audio.squeeze().float().cpu().numpy()
    sf.write(output_path, waveform, sample_rate)
    print(f"Saved {output_path} ({len(waveform) / sample_rate:.2f}s, {sample_rate}Hz)")


def parse_args():
    parser = TrackingArgumentParser(description="Ming-flash-omni standalone talker offline e2e example")
    parser.add_argument("--model", type=str, default=MODEL_NAME, help="Model name or local path.")
    parser.add_argument(
        "--deploy-config",
        type=str,
        default="vllm_omni/deploy/ming_flash_omni_tts.yaml",
        help="Path to a custom deploy YAML for the TTS deployment. ",
    )
    parser.add_argument(
        "--case",
        type=str,
        default="style",
        choices=["style", "ip", "basic"],
        help="Example case.",
    )
    parser.add_argument("--text", type=str, default=None, help="Override default text for the selected case.")
    parser.add_argument("--output", type=str, default=None, help="Output wav path.")
    parser.add_argument("--log-stats", action="store_true", default=False, help="Enable stats logging.")
    parser.add_argument("--init-timeout", type=int, default=600, help="Engine init timeout in seconds.")
    parser.add_argument("--stage-init-timeout", type=int, default=300, help="Single stage init timeout in seconds.")

    return parser.parse_args()


def main():
    args = parse_args()

    omni = Omni(**vars(args))

    messages = get_messages(args.case, args.text)
    decode_args = {
        # Standalone TTS deployment
        "ming_task": "instruct",
        "max_decode_steps": 200,
        "cfg": 2.0,
        "sigma": 0.25,
        "temperature": 0.0,
    }
    req = OmniTokensPrompt(
        prompt_token_ids=[0],
        additional_information={**messages, **decode_args},
    )

    outputs = omni.generate(req)
    mm = outputs[0].outputs[0].multimodal_output

    output_path = args.output or f"output_{args.case}.wav"
    save_audio(mm, output_path)
    omni.close()


if __name__ == "__main__":
    main()

ming_tts/README.md

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/text_to_speech/ming_tts/README.md.

ming_tts/cases.yaml

style:
  prompt: "Please generate speech based on the following description.\n"
  text: "我会一直在这里陪着你，直到你慢慢、慢慢地沉入那个最温柔的梦里……好吗？"
  instruction:
    风格: >-
      这是一种ASMR耳语，属于一种旨在引发特殊感官体验的创意风格。这个女性使用轻柔的普通话进行耳语，声音气音成分重。音量极低，紧贴麦克风，语速极慢，旨在制造触发听者颅内快感的声学刺激。
  use_zero_spk_emb: true
  max_decode_steps: 200

ip:
  prompt: "Please generate speech based on the following description.\n"
  text: "这款产品的名字，叫变态坑爹牛肉丸。"
  instruction:
    IP: "灵小甄"
  use_zero_spk_emb: true
  max_decode_steps: 200

bgm:
  prompt: "Please generate music based on the following description.\n"
  text: " Genre: 电子舞曲. Mood: 自信 / 坚定. Instrument: 架子鼓. Theme: 节日. Duration: 30s."
  instruction: null
  use_zero_spk_emb: false
  max_decode_steps: 400

tta:
  prompt: "Please generate audio events based on given text.\n"
  text: "Thunder and a gentle rain"
  instruction: null
  use_zero_spk_emb: false
  max_decode_steps: 200
  cfg: 4.5
  sigma: 0.3
  temperature: 2.5

emotion:
  prompt: "Please generate speech based on the following description.\n"
  text: "我竟然抢到了陈奕迅的演唱会门票！太棒了！终于可以现场听一听他的歌声了！"
  instruction:
    情感: "高兴"
  requires_ref_audio: true
  auto_extract_speaker_embeddings: true
  max_decode_steps: 200

basic:
  prompt: "Please generate speech based on the following description.\n"
  text: "简单地说，这相当于惠普把消费领域市场拱手相让了。"
  instruction:
    语速: "快速"
    基频: "中"
    音量: "中"
  requires_ref_audio: true
  auto_extract_speaker_embeddings: true
  max_decode_steps: 200

dialect:
  prompt: "Please generate speech based on the following description.\n"
  text: "我觉得社会企业同个人都有责任"
  instruction:
    方言: "广粤话"
  requires_ref_audio: true
  auto_extract_speaker_embeddings: true
  max_decode_steps: 200

zero_shot:
  prompt: "Please generate speech based on the following description.\n"
  text: "我们的愿景是构建未来服务业的数字化基础设施，为世界带来更多微小而美好的改变。"
  instruction: null
  requires_ref_audio: true
  requires_ref_text: true
  auto_extract_speaker_embeddings: true
  max_decode_steps: 200

podcast:
  prompt: "Please generate speech based on the following description.\n"
  text: " speaker_1:你可以说一下，就大概说一下，可能虽然我也不知道，我看过那部电影没有。\n speaker_2:就是那个叫什么，变相一节课的嘛。\n speaker_1:嗯。\n speaker_2:一部搞笑的电影。\n speaker_1:一部搞笑的。\n"
  instruction: null
  prompt_text: " speaker_1:并且我们还要进行每个月还要考核 笔试的话还要进行笔试，做个，当服务员还要去笔试了\n speaker_2:对啊，这真的很奇怪，就是 单纯的因，单纯自己工资不高，只是因为可能人家那个店比较出名一点，就对你苛刻要求\n"
  requires_ref_audio_count: 2
  auto_extract_speaker_embeddings: true
  max_decode_steps: 200

speech_bgm:
  prompt: "Please generate speech based on the following description.\n"
  text: "此次业绩下滑原因，可归结为企业停止服务某些品牌，而带来的负面影响。"
  instruction:
    BGM:
      Genre: "当代古典音乐."
      Mood: "温暖 / 友善."
      Instrument: "电吉他"
      Theme: "节日."
      SNR: 10.0
      ENV: null
  requires_ref_audio: true
  auto_extract_speaker_embeddings: true
  max_decode_steps: 200

speech_sound:
  prompt: "Please generate speech based on the following description.\n"
  text: "此次业绩下滑原因，可归结为企业停止服务某些品牌，而带来的负面影响。"
  instruction:
    BGM:
      ENV: "Birds chirping"
      SNR: 10.0
      Genre: null
      Mood: null
      Instrument: null
      Theme: null
  requires_ref_audio: true
  auto_extract_speaker_embeddings: true
  max_decode_steps: 200

ming_tts/end2end.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/text_to_speech/ming_tts/end2end.py.

ming_tts/runner.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/text_to_speech/ming_tts/runner.py.

moss_tts/end2end.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/text_to_speech/moss_tts/end2end.py.

moss_tts_nano/end2end.py

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""Offline inference example for MOSS-TTS-Nano via vLLM-Omni.

Single-stage pipeline: the 0.1B AR LM and MOSS-Audio-Tokenizer-Nano codec
both run inside one generation stage. Output is 48 kHz mono WAV (the
upstream tokenizer is stereo at 48 kHz; the wrapper mixes down to mono so
the existing single-channel audio writer in vLLM-Omni stays correct).

MOSS-TTS-Nano upstream supports two modes (matching ``infer.py``):

* ``voice_clone`` (recommended): only ``--ref-audio`` is required.
* ``continuation``: ``--ref-audio`` + ``--ref-text`` together.

Usage:
  # Voice clone (recommended): ref audio only, no transcript needed.
  python end2end.py \\
    --text "Hello!" \\
    --ref-audio /path/to/ref.wav

  # Continuation: ref audio + its transcript.
  python end2end.py \\
    --text "Hello!" \\
    --ref-audio /path/to/ref.wav \\
    --ref-text "Transcript of the reference clip." \\
    --mode continuation

  # Sample reference clips ship in the upstream repo:
  #   https://github.com/OpenMOSS/MOSS-TTS-Nano/tree/main/assets/audio
  # e.g. zh_1.wav (Chinese), en_2.wav (English), jp_2.wav (Japanese).
"""

from __future__ import annotations

import os
from pathlib import Path

import soundfile as sf
import torch
from vllm import SamplingParams

from vllm_omni.utils.tracking_parser import TrackingArgumentParser

# Prevent multiprocessing from re-importing CUDA in the wrong context.
os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm_omni import Omni  # noqa: E402

MODEL = "OpenMOSS-Team/MOSS-TTS-Nano"


def build_request(
    text: str,
    prompt_audio_path: str,
    prompt_text: str | None = None,
    mode: str = "voice_clone",
    max_new_frames: int = 375,
    seed: int | None = None,
    audio_temperature: float = 0.8,
    audio_top_k: int = 25,
    audio_top_p: float = 0.95,
    text_temperature: float = 1.0,
) -> dict:
    """Build an Omni request payload for MOSS-TTS-Nano.

    Upstream's ``_resolve_inference_mode`` forbids ``prompt_text`` in
    ``voice_clone`` mode and requires it in ``continuation`` mode (with
    ``prompt_audio_path``), so we only forward ``prompt_text`` when it is
    actually supplied.
    """
    additional: dict = {
        "text": [text],
        "mode": [mode],
        "prompt_audio_path": [str(prompt_audio_path)],
        "max_new_frames": [max_new_frames],
        "audio_temperature": [audio_temperature],
        "audio_top_k": [audio_top_k],
        "audio_top_p": [audio_top_p],
        "text_temperature": [text_temperature],
    }
    if prompt_text is not None and prompt_text.strip():
        additional["prompt_text"] = [prompt_text]
    if seed is not None:
        additional["seed"] = [seed]

    return {
        "prompt": "<|im_start|>assistant\n",  # minimal placeholder prompt
        "additional_information": additional,
    }


def save_audio(waveform: torch.Tensor, path: str, sample_rate: int = 48000) -> None:
    """Write the model's mono waveform to ``path`` at ``sample_rate``.

    The model wrapper mixes the upstream tokenizer's stereo output down to
    mono before reaching the engine, so ``waveform`` is always 1-D here —
    no extra interleave/reshape is needed.
    """
    audio_np = waveform.float().numpy()
    sf.write(path, audio_np, sample_rate)
    print(f"  Saved {path} ({audio_np.shape}, {sample_rate} Hz)")


def main(args) -> None:
    omni = Omni(
        model=MODEL,
        deploy_config=args.deploy_config,
        stage_init_timeout=args.stage_init_timeout,
    )

    sampling_params = SamplingParams(
        temperature=1.0,
        top_p=1.0,
        top_k=50,
        max_tokens=4096,
        seed=args.seed if args.seed is not None else 42,
        detokenize=False,
    )

    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    print(f"Synthesizing: {args.text!r}")
    print(f"  ref_audio: {args.ref_audio}")
    inputs = build_request(
        text=args.text,
        prompt_audio_path=args.ref_audio,
        prompt_text=args.ref_text,
        mode=args.mode,
        max_new_frames=args.max_new_frames,
        seed=args.seed,
        audio_temperature=args.audio_temperature,
        audio_top_k=args.audio_top_k,
        audio_top_p=args.audio_top_p,
        text_temperature=args.text_temperature,
    )
    params_list = sampling_params

    for stage_outputs in omni.generate(inputs, params_list):
        for i, req_output in enumerate(stage_outputs.request_output):
            for j, out in enumerate(req_output.outputs):
                mm = out.multimodal_output
                if mm is None:
                    print(f"  [req {i}] No audio output.")
                    continue
                audio = mm.get("audio")
                sr_tensor = mm.get("sr")
                if audio is None:
                    print(f"  [req {i}] No waveform in multimodal_output.")
                    continue
                sr = int(sr_tensor.item()) if sr_tensor is not None else 48000
                out_path = str(output_dir / f"output_{i}_{j}.wav")
                save_audio(audio.cpu(), out_path, sr)

    print("Done.")


def parse_args():
    parser = TrackingArgumentParser(description="MOSS-TTS-Nano offline inference")
    parser.add_argument("--text", default="Hello, this is MOSS-TTS-Nano speaking.", help="Text to synthesize.")
    parser.add_argument(
        "--ref-audio",
        required=True,
        help="Path to reference audio for voice cloning / continuation (required).",
    )
    parser.add_argument(
        "--ref-text",
        default=None,
        help=(
            "Optional transcript of --ref-audio. Required (and only meaningful) "
            "in --mode continuation; rejected by upstream in --mode voice_clone."
        ),
    )
    parser.add_argument("--mode", default="voice_clone", choices=["voice_clone", "continuation"])
    parser.add_argument("--max-new-frames", type=int, default=375, help="Max AR frames (~14s at default).")
    parser.add_argument("--seed", type=int, default=None, help="Random seed.")
    parser.add_argument("--audio-temperature", type=float, default=0.8)
    parser.add_argument("--audio-top-k", type=int, default=25)
    parser.add_argument("--audio-top-p", type=float, default=0.95)
    parser.add_argument("--text-temperature", type=float, default=1.0)
    parser.add_argument(
        "--output-dir",
        default=os.path.join(
            os.environ.get("XDG_CACHE_HOME", os.path.join(os.path.expanduser("~"), ".cache")),
            "moss_tts_nano_output",
        ),
        help="Directory for WAV outputs (default: ~/.cache/moss_tts_nano_output).",
    )
    parser.add_argument(
        "--deploy-config",
        default=None,
        help="Path to a deploy YAML; leave unset to auto-load vllm_omni/deploy/moss_tts_nano.yaml.",
    )
    parser.add_argument("--stage-init-timeout", type=int, default=120)
    return parser.parse_args()


if __name__ == "__main__":
    main(parse_args())

omnivoice/end2end.py

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""End-to-end OmniVoice TTS inference via vLLM-Omni.

Supports:
- Auto voice mode: text only → generated speech
- Voice cloning mode: text + reference audio → cloned voice speech

Usage:
    # Auto voice
    python end2end.py --model k2-fsa/OmniVoice --text "Hello world"

    # Voice cloning
    python end2end.py --model k2-fsa/OmniVoice --text "Hello" \
        --ref-audio ref.wav --ref-text "reference transcription"
"""

import argparse
import os

import numpy as np
import soundfile as sf

from vllm_omni.entrypoints.omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams


def run_e2e():
    parser = argparse.ArgumentParser(description="OmniVoice E2E TTS inference")
    parser.add_argument(
        "--model",
        type=str,
        default="k2-fsa/OmniVoice",
        help="Model name or path (HuggingFace or local)",
    )
    parser.add_argument(
        "--stage-config",
        type=str,
        default="vllm_omni/deploy/omnivoice.yaml",
    )
    parser.add_argument(
        "--text",
        type=str,
        default="Hello, this is a test of the OmniVoice text to speech system.",
    )
    parser.add_argument(
        "--ref-audio",
        type=str,
        default=None,
        help="Reference audio for voice cloning (WAV file)",
    )
    parser.add_argument(
        "--ref-text",
        type=str,
        default=None,
        help="Transcription of reference audio",
    )
    parser.add_argument(
        "--lang",
        type=str,
        default=None,
        help="Language code (e.g., 'en', 'zh')",
    )
    parser.add_argument(
        "--instruct",
        type=str,
        default=None,
        help="Voice design instruction (e.g., 'female, low pitch, british accent')",
    )
    parser.add_argument(
        "--output",
        type=str,
        default="output.wav",
        help="Output audio file path",
    )
    parser.add_argument(
        "--stage-init-timeout",
        type=int,
        default=600,
        help="Stage initialization timeout in seconds",
    )
    parser.add_argument(
        "--seed",
        type=int,
        default=None,
        help="Random seed for generation",
    )
    args = parser.parse_args()

    if not os.path.exists(args.stage_config):
        raise FileNotFoundError(f"Stage config not found: {args.stage_config}")

    print(f"Initializing OmniVoice with model={args.model}")

    omni = Omni(
        model=args.model,
        stage_configs_path=args.stage_config,
        log_stats=True,
    )

    print("Model initialized. Preparing inputs...")

    # Build prompt
    mm_processor_kwargs = {}
    multi_modal_data = {}

    if args.ref_audio:
        if not os.path.exists(args.ref_audio):
            raise FileNotFoundError(f"Reference audio not found: {args.ref_audio}")

        from vllm.multimodal.media.audio import load_audio

        audio_signal, sr = load_audio(args.ref_audio, sr=None)
        multi_modal_data["audio"] = (audio_signal.astype(np.float32), sr)
        mm_processor_kwargs["ref_text"] = args.ref_text or ""
        mm_processor_kwargs["sample_rate"] = sr

    if args.lang:
        mm_processor_kwargs["lang"] = args.lang
    if args.instruct:
        mm_processor_kwargs["instruct"] = args.instruct

    prompts = {"prompt": args.text}
    if multi_modal_data:
        prompts["multi_modal_data"] = multi_modal_data
    if mm_processor_kwargs:
        prompts["mm_processor_kwargs"] = mm_processor_kwargs

    sampling_params_list = [OmniDiffusionSamplingParams(extra_args={"seed": args.seed})]

    print(f"Generating speech for: {args.text}")

    outputs = list(omni.generate(prompts, sampling_params_list=sampling_params_list))

    print(f"Received {len(outputs)} outputs.")
    for i, output in enumerate(outputs):
        try:
            ro = output.request_output
            if ro is None:
                print("No request_output found.")
                continue

            mm = getattr(ro, "multimodal_output", None)
            if not mm and ro.outputs:
                mm = getattr(ro.outputs[0], "multimodal_output", None)

            if mm:
                print(f"Multimodal output keys: {mm.keys()}")
                if "audio" in mm:
                    audio_out = mm["audio"]
                    sr = mm.get("sr", 24000)
                    if isinstance(audio_out, np.ndarray):
                        audio_np = audio_out
                    else:
                        audio_np = audio_out.cpu().numpy().squeeze()
                    out_path = args.output if i == 0 else f"output_{i}.wav"
                    sf.write(out_path, audio_np, sr)
                    print(f"Saved audio to {out_path} ({sr}Hz, {len(audio_np) / sr:.2f}s)")
            else:
                print("No multimodal output found.")
        except Exception as e:
            print(f"Error inspecting output: {e}")

    omni.close()
    print("Done.")


if __name__ == "__main__":
    run_e2e()

qwen3_tts/benchmark_prompts.txt

Hello, welcome to the voice synthesis benchmark test.
She said she would be here by noon, but nobody showed up.
The quick brown fox jumps over the lazy dog near the riverbank.
I can't believe how beautiful the sunset looks from up here on the mountain.
Please remember to bring your identification documents to the appointment tomorrow morning.
Have you ever wondered what it would be like to travel through time and visit ancient civilizations?
The restaurant on the corner serves the best pasta I have ever tasted in my entire life.
After the meeting, we should discuss the quarterly results and plan for the next phase.
Learning a new language takes patience, practice, and a genuine curiosity about other cultures.
The train leaves at half past seven, so we need to arrive at the station before then.
Could you please turn down the music a little bit, I'm trying to concentrate on my work.
It was a dark and stormy night when the old lighthouse keeper heard a knock at the door.

qwen3_tts/end2end.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/text_to_speech/qwen3_tts/end2end.py.

qwen3_tts/word_timestamps.py

"""Offline Qwen3-TTS example with word-level timestamps.

This demo runs Qwen3-TTS offline, aligns the synthesized audio in memory with
the same shared forced aligner used by the streaming server path, then writes
the WAV and JSON sidecar:

    python examples/offline_inference/text_to_speech/qwen3_tts/word_timestamps.py \
        --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
        --forced-aligner Qwen/Qwen3-ForcedAligner-0.6B

On machines without a local CUDA toolkit, set ``VLLM_USE_FLASHINFER_SAMPLER=0``
to avoid FlashInfer sampler JIT during warmup.
"""

from __future__ import annotations

import asyncio
import json
import os
from pathlib import Path
from typing import Any

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")
# The in-process offline aligner consumes token_classify outputs. Keeping V1
# multiprocessing off avoids msgspec serialization issues for token-wise logits.
os.environ.setdefault("VLLM_ENABLE_V1_MULTIPROCESSING", "0")

import numpy as np
import soundfile as sf
import torch
from end2end import _estimate_prompt_len
from vllm.utils.argparse_utils import FlexibleArgumentParser

from vllm_omni import Omni
from vllm_omni.engine.arg_utils import nullify_stage_engine_defaults
from vllm_omni.utils.forced_aligner import align, build_forced_aligner_config


def _default_stage_config() -> str:
    repo_root = Path(__file__).resolve().parents[4]
    return str(repo_root / "vllm_omni" / "deploy" / "qwen3_tts.yaml")


def _build_custom_voice_input(args: Any) -> dict[str, Any]:
    additional_information = {
        "task_type": ["CustomVoice"],
        "text": [args.text],
        "language": [args.language],
        "speaker": [args.speaker],
        "instruct": [args.instructions],
        "max_new_tokens": [args.max_new_tokens],
    }
    return {
        "prompt_token_ids": [0] * _estimate_prompt_len(additional_information, args.model),
        "additional_information": additional_information,
    }


def _audio_tensor_and_sample_rate(mm: dict[str, Any]) -> tuple[torch.Tensor, int]:
    audio_data = mm["audio"]
    sr_raw = mm["sr"]
    sr_val = sr_raw[-1] if isinstance(sr_raw, list) and sr_raw else sr_raw
    sample_rate = int(sr_val.item()) if hasattr(sr_val, "item") else int(sr_val)
    audio_tensor = torch.cat(audio_data, dim=-1) if isinstance(audio_data, list) else audio_data
    return audio_tensor.float().detach().cpu().flatten(), sample_rate


def _float_audio_to_pcm16_bytes(audio: torch.Tensor) -> bytes:
    samples = audio.numpy()
    pcm = (np.clip(samples, -1.0, 1.0) * 32767.0).astype("<i2")
    return pcm.tobytes()


async def _run_alignment(args: Any, audio: torch.Tensor, sample_rate: int) -> list[dict[str, Any]] | None:
    aligner_config = build_forced_aligner_config(args)
    if aligner_config is None:
        raise ValueError("--forced-aligner or --forced-aligner-config is required")
    timestamps = await align(
        audio=_float_audio_to_pcm16_bytes(audio),
        text=args.text,
        sample_rate=sample_rate,
        config=aligner_config,
        language=args.language,
    )
    if timestamps is None:
        return None
    return [
        {
            "word": item.word,
            "start_ms": item.start_ms,
            "end_ms": item.end_ms,
        }
        for item in timestamps
    ]


def main(args: Any) -> None:
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    omni_kwargs = vars(args).copy()
    for key in (
        "forced_aligner",
        "text",
        "language",
        "speaker",
        "instructions",
        "max_new_tokens",
        "output_dir",
    ):
        omni_kwargs.pop(key, None)
    omni_kwargs["stage_configs_path"] = args.stage_configs_path or _default_stage_config()
    omni_kwargs["log_stats"] = args.log_stats
    omni = Omni(**omni_kwargs)

    prompt = _build_custom_voice_input(args)
    final_output = None
    for stage_outputs in omni.generate([prompt]):
        final_output = stage_outputs.request_output
    if final_output is None:
        raise RuntimeError("Qwen3-TTS did not produce an output.")

    mm = final_output.outputs[0].multimodal_output
    audio, sample_rate = _audio_tensor_and_sample_rate(mm)

    wav_path = output_dir / "qwen3_tts_word_timestamps.wav"
    # Align the in-memory tensor, not the saved WAV. The WAV is only a demo
    # artifact so users can listen to the same audio referenced by the sidecar.
    timestamps = asyncio.run(_run_alignment(args, audio, sample_rate))
    sf.write(wav_path, audio.numpy(), samplerate=sample_rate, format="WAV")

    sidecar = {
        "text": args.text,
        "sample_rate": sample_rate,
        "audio_path": str(wav_path),
        "timestamps": timestamps,
    }
    json_path = output_dir / "qwen3_tts_word_timestamps.json"
    json_path.write_text(json.dumps(sidecar, ensure_ascii=False, indent=2) + "\n", encoding="utf-8")

    print(f"Saved audio: {wav_path}")
    print(f"Saved timestamps: {json_path}")
    print(json.dumps(sidecar["timestamps"], ensure_ascii=False, indent=2))


def parse_args() -> Any:
    parser = FlexibleArgumentParser(description="Offline Qwen3-TTS word timestamp example")
    parser.add_argument("--model", default="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice", help="Qwen3-TTS model path/name")
    parser.add_argument(
        "--forced-aligner",
        default=None,
        help="Qwen3 forced aligner model path/name",
    )
    parser.add_argument(
        "--forced-aligner-config",
        default=None,
        help="Optional YAML file for forced aligner settings (incl. gpu_memory_utilization)",
    )
    parser.add_argument("--stage-configs-path", default=None, help="Qwen3-TTS deploy YAML")
    parser.add_argument("--text", default="Hello world.", help="Text to synthesize and align")
    parser.add_argument("--language", default="English", help="Qwen3-TTS language field")
    parser.add_argument("--speaker", default="Vivian", help="CustomVoice speaker name")
    parser.add_argument("--instructions", default="", help="Optional speaking style instruction")
    parser.add_argument("--max-new-tokens", type=int, default=2048, help="TTS max_new_tokens")
    parser.add_argument("--output-dir", default="output_audio", help="Directory for WAV and JSON sidecar")
    parser.add_argument("--log-stats", action="store_true", default=False, help="Enable vLLM-Omni stats logging")
    nullify_stage_engine_defaults(parser)
    return parser.parse_args()


if __name__ == "__main__":
    main(parse_args())

soulxsinger/benchmark.py

"""Repeatable SoulX-Singer offline benchmark (preprocess + DiT, RTF + stage timings).

Usage:
    python benchmark.py --model /path/to/SoulX-Singer --svs \\
        --prompt-audio ../SoulX-Singer/example/audio/zh_prompt.mp3 \\
        --target-audio ../SoulX-Singer/example/audio/music.mp3 \\
        --preprocess-weights-dir /path/to/SoulX-Singer-Preprocess \\
        -o benchmark.wav

    python benchmark.py ... --enable-diffusion-pipeline-profiler
"""

import argparse
import statistics
import time
from pathlib import Path

import soundfile as sf
from end2end import (
    SVC_DEPLOY_CONFIG,
    SVS_DEPLOY_CONFIG,
    add_inference_args,
    build_sampling,
    extract_audio,
    resolve_preprocess_weights_dir,
)

from vllm_omni.entrypoints.omni import Omni


def _audio_duration_sec(outputs) -> float:
    audio_np, sr = extract_audio(outputs)
    return audio_np.size / sr


def _stage_durations_from_output(outputs) -> dict[str, float]:
    omni_out = outputs[0]
    durations = getattr(omni_out, "stage_durations", None) or {}
    if durations:
        return dict(durations)
    ro = getattr(omni_out, "request_output", None)
    if ro is not None:
        return dict(getattr(ro, "stage_durations", None) or {})
    return {}


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(description="SoulX-Singer offline benchmark")
    add_inference_args(parser)
    parser.add_argument("--warmup", type=int, default=1, help="Warmup runs (excluded from stats)")
    parser.add_argument("--runs", type=int, default=3, help="Measured runs")
    parser.add_argument(
        "--enable-diffusion-pipeline-profiler",
        action="store_true",
        help="Enable low-overhead stage timing in DiffusionOutput.stage_durations",
    )
    parser.add_argument(
        "--enforce-eager",
        action="store_true",
        help="Disable torch.compile on diff_estimator (eager baseline for A/B)",
    )
    parser.add_argument(
        "--dtype",
        type=str,
        default=None,
        choices=["bfloat16", "bf16", "float16", "fp16", "half"],
        help="DiT trunk dtype (default from deploy YAML). Use float16 to match upstream --fp16.",
    )
    parser.add_argument(
        "--output",
        "-o",
        type=str,
        default=None,
        help="Save WAV from the last measured run (after warmup)",
    )
    parser.add_argument(
        "--save-each-run",
        action="store_true",
        help="With --output, also write measured runs as <stem>_run1.wav, _run2.wav, ...",
    )
    return parser.parse_args()


def main() -> None:
    args = parse_args()
    preprocess_dir = resolve_preprocess_weights_dir(args.preprocess_weights_dir)
    deploy_config = args.deploy_config or str(SVS_DEPLOY_CONFIG if args.svs else SVC_DEPLOY_CONFIG)
    mode = "SVS" if args.svs else "SVC"

    omni_kwargs: dict = {
        "model": args.model,
        "deploy_config": deploy_config,
        "async_chunk": False,  # SoulX-Singer currently supports only batch mode (pseudo-streaming was stashed)
    }
    if args.enable_diffusion_pipeline_profiler:
        omni_kwargs["enable_diffusion_pipeline_profiler"] = True
    if args.enforce_eager:
        omni_kwargs["enforce_eager"] = True
    if args.dtype is not None:
        omni_kwargs["dtype"] = args.dtype

    compile_mode = "eager" if args.enforce_eager else "torch.compile"
    dtype_label = args.dtype or "from deploy YAML"
    print(f"Loading SoulX-Singer {mode} from {args.model} [{compile_mode}, dtype={dtype_label}]")

    omni = Omni(**omni_kwargs)
    kind = "svs" if args.svs else "svc"
    sampling = build_sampling(args, preprocess_weights_dir=preprocess_dir, kind=kind)
    prompt = {"prompt_token_ids": [0]}

    latencies_ms: list[float] = []
    rtfs: list[float] = []
    last_stages: dict[str, float] = {}
    last_measured_outputs = None

    total_runs = args.warmup + args.runs
    for run_idx in range(total_runs):
        is_warmup = run_idx < args.warmup
        label = "warmup" if is_warmup else f"run {run_idx - args.warmup + 1}/{args.runs}"
        t0 = time.perf_counter()
        outputs = list(omni.generate([prompt], sampling))
        elapsed_ms = (time.perf_counter() - t0) * 1000.0
        audio_sec = _audio_duration_sec(outputs)
        rtf = (elapsed_ms / 1000.0) / audio_sec if audio_sec > 0 else float("inf")
        print(f"[{label}] client={elapsed_ms:.1f} ms, audio={audio_sec:.2f}s, RTF={rtf:.3f}")
        if not is_warmup:
            latencies_ms.append(elapsed_ms)
            rtfs.append(rtf)
            last_stages = _stage_durations_from_output(outputs)
            last_measured_outputs = outputs
            if args.output and args.save_each_run:
                measured_idx = run_idx - args.warmup + 1
                out_path = Path(args.output)
                run_path = out_path.with_name(f"{out_path.stem}_run{measured_idx}{out_path.suffix}")
                audio_np, sr = extract_audio(outputs)
                sf.write(str(run_path), audio_np, sr)
                print(f"  saved {run_path} ({sr} Hz, {audio_np.size / sr:.2f}s)")

    omni.close()

    if args.output and last_measured_outputs is not None:
        audio_np, sr = extract_audio(last_measured_outputs)
        out_path = Path(args.output)
        out_path.parent.mkdir(parents=True, exist_ok=True)
        sf.write(str(out_path), audio_np, sr)
        print(f"\nSaved last measured run → {out_path} ({sr} Hz, {audio_np.size / sr:.2f}s)")

    if latencies_ms:
        print("\n=== Summary (measured runs) ===")
        print(f"client_ms: mean={statistics.mean(latencies_ms):.1f}, stdev={statistics.pstdev(latencies_ms):.1f}")
        print(f"RTF:       mean={statistics.mean(rtfs):.3f}, stdev={statistics.pstdev(rtfs):.3f}")
        if last_stages:
            print("\nStage durations (last measured run):")
            for name, value in sorted(last_stages.items()):
                if name.endswith("_ms"):
                    print(f"  {name}: {value:.1f} ms")
                else:
                    print(f"  {name}: {value * 1000.0:.1f} ms")


if __name__ == "__main__":
    main()

soulxsinger/end2end.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/text_to_speech/soulxsinger/end2end.py.

voxcpm2/end2end.py

"""Offline VoxCPM2 inference example (native AR pipeline).

Uses the single-stage native AR config (voxcpm2.yaml).
Requires the `voxcpm` package or VLLM_OMNI_VOXCPM_CODE_PATH env var.
"""

from __future__ import annotations

import os
import time
from pathlib import Path

import soundfile as sf
import torch

from vllm_omni import Omni
from vllm_omni.utils.tracking_parser import TrackingArgumentParser

REPO_ROOT = Path(__file__).resolve().parents[4]
SAMPLE_RATE = 48_000


def parse_args():
    parser = TrackingArgumentParser(description="Offline VoxCPM2 native AR inference")
    parser.add_argument(
        "--model",
        type=str,
        default="openbmb/VoxCPM2",
        help="VoxCPM2 model path or HuggingFace repo ID.",
    )
    parser.add_argument(
        "--text",
        type=str,
        default="This is a VoxCPM2 native AR synthesis example running on vLLM Omni.",
        help="Text to synthesize.",
    )
    parser.add_argument(
        "--output-dir",
        type=str,
        default="output_audio",
        help="Directory for output WAV files.",
    )
    parser.add_argument(
        "--deploy-config",
        type=str,
        default=None,
        help="Override the deploy config path. If unset, auto-loads "
        "vllm_omni/deploy/voxcpm2.yaml based on the HF model_type.",
    )
    parser.add_argument(
        "--ref-audio",
        type=str,
        default=None,
        help="Path to reference audio for voice cloning.",
    )
    parser.add_argument(
        "--ref-text",
        type=str,
        default=None,
        help="Optional transcript of --ref-audio (enables continuation mode).",
    )
    return parser.parse_args()


def extract_audio(multimodal_output: dict) -> torch.Tensor:
    """Extract the final complete audio tensor from multimodal output.

    The output processor concatenates per-step delta tensors under
    ``model_outputs``.  Falls back to ``audio`` for backwards compat.
    """
    audio = multimodal_output.get("model_outputs")
    if audio is None:
        audio = multimodal_output.get("audio")
    if audio is None:
        raise ValueError(f"No audio key in multimodal_output: {list(multimodal_output.keys())}")

    if isinstance(audio, list):
        # Defensive: usually the output processor consolidates into a single
        # tensor at request completion, but concatenate here too in case the
        # caller consumes intermediate (pre-consolidation) outputs.
        valid = [torch.as_tensor(a).float().cpu().reshape(-1) for a in audio if a is not None]
        if not valid:
            raise ValueError("Audio list is empty or all elements are None.")
        return torch.cat(valid, dim=0) if len(valid) > 1 else valid[0]

    return torch.as_tensor(audio).float().cpu().reshape(-1)


def main():
    args = parse_args()

    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    engine = Omni(
        model=args.model,
        deploy_config=args.deploy_config,
    )

    from transformers import AutoTokenizer

    from vllm_omni.model_executor.models.voxcpm2.voxcpm2_talker import (
        build_cjk_split_map,
        build_voxcpm2_prompt,
    )

    tokenizer = AutoTokenizer.from_pretrained(args.model, trust_remote_code=True)
    split_map = build_cjk_split_map(tokenizer)
    hf_config = engine.engine.stage_vllm_configs[0].model_config.hf_config

    ref_audio_arg = args.ref_audio
    ref_text_arg = args.ref_text
    ref_wav, ref_sr = (None, None)
    if ref_audio_arg:
        ref_wav_arr, ref_sr = sf.read(ref_audio_arg)
        ref_wav = ref_wav_arr.mean(axis=-1).tolist() if ref_wav_arr.ndim > 1 else ref_wav_arr.tolist()

    prompt = build_voxcpm2_prompt(
        hf_config=hf_config,
        tokenizer=tokenizer,
        split_map=split_map,
        text=args.text,
        ref_audio=ref_wav,
        ref_sr=ref_sr,
        ref_text=ref_text_arg,
    )

    print(f"Model       : {args.model}")
    print(f"Text        : {args.text}")
    if ref_audio_arg:
        print(f"Ref audio   : {ref_audio_arg}")
    if ref_text_arg:
        print(f"Ref text    : {ref_text_arg}")
    print(f"Output dir  : {output_dir}")

    t_start = time.perf_counter()
    outputs = engine.generate([prompt])
    elapsed = time.perf_counter() - t_start

    # outputs[0].outputs[0].multimodal_output["audio"] is a list of tensors
    request_output = outputs[0]
    mm = request_output.outputs[0].multimodal_output
    audio = extract_audio(mm)

    duration = audio.numel() / SAMPLE_RATE
    rtf = elapsed / duration if duration > 0 else float("inf")

    output_path = output_dir / "output.wav"
    sf.write(str(output_path), audio.numpy(), SAMPLE_RATE, format="WAV")

    print(f"Saved       : {output_path}")
    print(f"Duration    : {duration:.2f}s")
    print(f"Inference   : {elapsed:.2f}s")
    print(f"RTF         : {rtf:.3f}")


if __name__ == "__main__":
    os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
    main()

voxtral_tts/end2end.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/text_to_speech/voxtral_tts/end2end.py.