vllm_omni.model_executor.models.indextts2.preprocess_utils ¶
External model loading, audio I/O, and emotion conditioning for IndexTTS2.
compute_fbank ¶
Compute 80-dim fbank features for CAMPPlus. Input: [T] at 16kHz.
load_qwen_emotion ¶
load_reference_audio ¶
load_reference_audio(
audio_path: str | tuple | list,
device: device,
max_audio_length_seconds: float | None = 15,
mode: str = "speaker",
) -> tuple[Tensor, Tensor]
Load reference audio and resample to 16kHz and 22.05kHz.
Accepts either a file path (str) or a pre-loaded (wav_list, sr) tuple from the serving layer.
mode mirrors official IndexTTS2 v2: - speaker: librosa default path first normalizes to 22.05kHz, truncates, then derives the 16kHz wav2vec/CAMPPlus input from that 22.05kHz signal. - emotion: librosa loads directly at 16kHz, then truncates.
resolve_model_file ¶
Resolve an IndexTTS2 asset from a local model dir or HF repo id.