Skip to content

vllm_omni.model_executor.models.glm_tts.vocoder

Vocoder loading and mel-to-audio conversion for GLM-TTS.

Supports
  • HiFT (24kHz) — reuses CosyVoice3's vendored HiFTGenerator
  • Vocos2D JIT (32kHz) from TorchScript checkpoint

Extracted from glm_tts_dit_wrapper.py to keep file sizes under 800 lines.

logger module-attribute

logger = init_logger(__name__)

ConvRNNF0Predictor

Bases: Module

Non-causal F0 predictor for GLM-TTS HiFT vocoder.

GLM-TTS ships a non-causal HiFT checkpoint whose F0 predictor uses standard nn.Conv1d layers (kernel_size=3, padding=1). CosyVoice3's vendored hifigan only includes the causal variant (CausalConvRNNF0Predictor), so we provide the non-causal version here. The network structure and state_dict keys are identical to the original cosyvoice.hifigan_cosy2.f0_predictor.ConvRNNF0Predictor.

classifier instance-attribute

classifier = Linear(
    in_features=cond_channels, out_features=num_class
)

condnet instance-attribute

condnet = Sequential(
    weight_norm(
        Conv1d(
            in_channels,
            cond_channels,
            kernel_size=3,
            padding=1,
        )
    ),
    ELU(),
    weight_norm(
        Conv1d(
            cond_channels,
            cond_channels,
            kernel_size=3,
            padding=1,
        )
    ),
    ELU(),
    weight_norm(
        Conv1d(
            cond_channels,
            cond_channels,
            kernel_size=3,
            padding=1,
        )
    ),
    ELU(),
    weight_norm(
        Conv1d(
            cond_channels,
            cond_channels,
            kernel_size=3,
            padding=1,
        )
    ),
    ELU(),
    weight_norm(
        Conv1d(
            cond_channels,
            cond_channels,
            kernel_size=3,
            padding=1,
        )
    ),
    ELU(),
)

num_class instance-attribute

num_class = num_class

forward

forward(x: Tensor) -> Tensor

HiFTWrapper

Thin wrapper around CosyVoice3's vendored HiFTGenerator.

device instance-attribute

device = device

model instance-attribute

model = to(device)

sample_rate instance-attribute

sample_rate = 24000

Vocos2DWrapper

Thin wrapper around Vocos2D TorchScript model.

device instance-attribute

device = device

gen_model instance-attribute

gen_model = load(ckpt_path, map_location=device)

load_hift

load_hift(ckpt_path: str, device: device) -> HiFTWrapper

Load HiFT vocoder from checkpoint.

load_vocoder

load_vocoder(
    model_root: str,
    device: device,
    sample_rate: int = 24000,
) -> tuple[Any | None, int]

Try to load the best available vocoder.

Returns:

Type Description
tuple[Any | None, int]

(vocoder, actual_sample_rate) or (None, sample_rate) on failure.

load_vocos2d_jit

load_vocos2d_jit(
    ckpt_path: str, device: device
) -> Vocos2DWrapper

Load Vocos2D JIT vocoder from TorchScript checkpoint.

mel_to_audio

mel_to_audio(vocoder: Any | None, mel: Tensor) -> Tensor

Convert mel-spectrogram to audio waveform.

Parameters:

Name Type Description Default
vocoder Any | None

Vocoder instance (HiFTWrapper, Vocos2DWrapper, or None).

required
mel Tensor

Mel-spectrogram [B, T, mel_dim].

required

Returns:

Type Description
Tensor

Audio waveform [B, 1, samples].