vllm_omni.model_executor.models.glm_tts.vocoder ¶
Vocoder loading and mel-to-audio conversion for GLM-TTS.
Supports
- HiFT (24kHz) — reuses CosyVoice3's vendored HiFTGenerator
- Vocos2D JIT (32kHz) from TorchScript checkpoint
Extracted from glm_tts_dit_wrapper.py to keep file sizes under 800 lines.
ConvRNNF0Predictor ¶
Bases: Module
Non-causal F0 predictor for GLM-TTS HiFT vocoder.
GLM-TTS ships a non-causal HiFT checkpoint whose F0 predictor uses standard nn.Conv1d layers (kernel_size=3, padding=1). CosyVoice3's vendored hifigan only includes the causal variant (CausalConvRNNF0Predictor), so we provide the non-causal version here. The network structure and state_dict keys are identical to the original cosyvoice.hifigan_cosy2.f0_predictor.ConvRNNF0Predictor.
classifier instance-attribute ¶
condnet instance-attribute ¶
condnet = Sequential(
weight_norm(
Conv1d(
in_channels,
cond_channels,
kernel_size=3,
padding=1,
)
),
ELU(),
weight_norm(
Conv1d(
cond_channels,
cond_channels,
kernel_size=3,
padding=1,
)
),
ELU(),
weight_norm(
Conv1d(
cond_channels,
cond_channels,
kernel_size=3,
padding=1,
)
),
ELU(),
weight_norm(
Conv1d(
cond_channels,
cond_channels,
kernel_size=3,
padding=1,
)
),
ELU(),
weight_norm(
Conv1d(
cond_channels,
cond_channels,
kernel_size=3,
padding=1,
)
),
ELU(),
)
HiFTWrapper ¶
Vocos2DWrapper ¶
load_hift ¶
load_hift(ckpt_path: str, device: device) -> HiFTWrapper
Load HiFT vocoder from checkpoint.
load_vocoder ¶
load_vocos2d_jit ¶
load_vocos2d_jit(
ckpt_path: str, device: device
) -> Vocos2DWrapper
Load Vocos2D JIT vocoder from TorchScript checkpoint.
mel_to_audio ¶
mel_to_audio(vocoder: Any | None, mel: Tensor) -> Tensor
Convert mel-spectrogram to audio waveform.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocoder | Any | None | Vocoder instance (HiFTWrapper, Vocos2DWrapper, or None). | required |
mel | Tensor | Mel-spectrogram [B, T, mel_dim]. | required |
Returns:
| Type | Description |
|---|---|
Tensor | Audio waveform [B, 1, samples]. |