Skip to content

vllm_omni.model_executor.models.indextts2.preprocess_utils

External model loading, audio I/O, and emotion conditioning for IndexTTS2.

logger module-attribute

logger = init_logger(__name__)

compute_fbank

compute_fbank(wav_16k: Tensor, device: device) -> Tensor

Compute 80-dim fbank features for CAMPPlus. Input: [T] at 16kHz.

load_campplus

load_campplus(model_path: str, device: device)

load_qwen_emotion

load_qwen_emotion(
    model_path: str,
    device: device,
    *,
    trust_remote_code: bool = True,
)

load_reference_audio

load_reference_audio(
    audio_path: str | tuple | list,
    device: device,
    max_audio_length_seconds: float | None = 15,
    mode: str = "speaker",
) -> tuple[Tensor, Tensor]

Load reference audio and resample to 16kHz and 22.05kHz.

Accepts either a file path (str) or a pre-loaded (wav_list, sr) tuple from the serving layer.

mode mirrors official IndexTTS2 v2: - speaker: librosa default path first normalizes to 22.05kHz, truncates, then derives the 16kHz wav2vec/CAMPPlus input from that 22.05kHz signal. - emotion: librosa loads directly at 16kHz, then truncates.

load_semantic_codec

load_semantic_codec(
    model_path: str, config: dict, device: device
)

load_wav2vec2

load_wav2vec2(model_path: str, device: device)

resolve_model_file

resolve_model_file(
    model_path: str, filename: str
) -> str | None

Resolve an IndexTTS2 asset from a local model dir or HF repo id.

wav2vec_extract

wav2vec_extract(
    wav_16k: Tensor,
    model: Any,
    processor: Any,
    device: device,
    w2v_stat: Tensor | None = None,
) -> Tensor

Extract Wav2Vec2-BERT features. Returns [1, T, 1024].