vllm_omni.model_executor.models.cosyvoice3.cosyvoice3_code2wav ¶
CosyVoice3 Code2Wav Stage - Converts speech tokens to audio waveforms.
This module contains the code2wav (token-to-waveform) stage which uses: 1. DiT (Diffusion Transformer) with optimized attention backends 2. CFM (Conditional Flow Matching) for mel spectrogram generation 3. HiFiGAN vocoder for waveform synthesis
CosyVoice3Code2Wav ¶
Bases: Module
CosyVoice3 Code2Wav stage for token-to-waveform conversion.
This class encapsulates: - Flow matching decoder with DiT backbone (using diffusion attention) - HiFiGAN vocoder for mel-to-waveform conversion
flow_model instance-attribute ¶
flow_model = CausalMaskedDiffWithDiT(
input_size=flow["input_size"],
output_size=flow["output_size"],
spk_embed_dim=flow["spk_embed_dim"],
output_type=flow["output_type"],
vocab_size=flow["vocab_size"],
input_frame_rate=flow["input_frame_rate"],
only_mask_loss=flow["only_mask_loss"],
token_mel_ratio=flow["token_mel_ratio"],
pre_lookahead_len=flow["pre_lookahead_len"],
pre_lookahead_layer=pre_lookahead_layer,
decoder=decoder,
)
mel_overlap_len instance-attribute ¶
mel_overlap_len = int(
token_overlap_len / input_frame_rate * 22050 / 256
)
forward ¶
forward(
token: Tensor,
prompt_token: Tensor,
prompt_feat: Tensor,
embedding: Tensor,
n_timesteps: int = 10,
token_offset_tokens: int = 0,
) -> Tensor
Generate audio waveform from speech tokens.
forward_streaming ¶
forward_streaming(
token: Tensor,
prompt_token: Tensor,
prompt_feat: Tensor,
embedding: Tensor,
*,
cache_state: dict[str, Tensor] | None = None,
n_timesteps: int = 10,
token_offset_tokens: int = 0,
finalize: bool = False,
) -> tuple[Tensor, dict[str, Tensor] | None]
Decode streaming audio using cumulative mel + emitted-speech offset.
This mirrors upstream CosyVoice3 streaming semantics more closely than waveform-domain overlap-add: keep a cumulative mel history per request, re-run causal HiFT on the history, and emit only the newly grown speech suffix. That preserves causal look-right handling without double trimming or duplicated overlap at chunk boundaries.