Skip to content

vllm_omni.model_executor.models.cosyvoice3.cosyvoice3_code2wav

CosyVoice3 Code2Wav Stage - Converts speech tokens to audio waveforms.

This module contains the code2wav (token-to-waveform) stage which uses: 1. DiT (Diffusion Transformer) with optimized attention backends 2. CFM (Conditional Flow Matching) for mel spectrogram generation 3. HiFiGAN vocoder for waveform synthesis

logger module-attribute

logger = init_logger(__name__)

CosyVoice3Code2Wav

Bases: Module

CosyVoice3 Code2Wav stage for token-to-waveform conversion.

This class encapsulates: - Flow matching decoder with DiT backbone (using diffusion attention) - HiFiGAN vocoder for mel-to-waveform conversion

config instance-attribute

config = config

decoder property

decoder: Module

Flow matching decoder.

flow_model instance-attribute

flow_model = CausalMaskedDiffWithDiT(
    input_size=flow["input_size"],
    output_size=flow["output_size"],
    spk_embed_dim=flow["spk_embed_dim"],
    output_type=flow["output_type"],
    vocab_size=flow["vocab_size"],
    input_frame_rate=flow["input_frame_rate"],
    only_mask_loss=flow["only_mask_loss"],
    token_mel_ratio=flow["token_mel_ratio"],
    pre_lookahead_len=flow["pre_lookahead_len"],
    pre_lookahead_layer=pre_lookahead_layer,
    decoder=decoder,
)

hift instance-attribute

hift = float()

input_embedding property

input_embedding: Embedding

Token embedding layer.

input_frame_rate property

input_frame_rate: int

Input frame rate from flow model.

mel_cache_len instance-attribute

mel_cache_len = 20

mel_overlap_len instance-attribute

mel_overlap_len = int(
    token_overlap_len / input_frame_rate * 22050 / 256
)

mel_window instance-attribute

mel_window = hamming(2 * mel_overlap_len)

output_size property

output_size: int

Output mel dimension.

pre_lookahead_layer property

pre_lookahead_layer: Module

Pre-lookahead layer.

source_cache_len instance-attribute

source_cache_len = int(mel_cache_len * 256)

speech_window instance-attribute

speech_window = hamming(2 * source_cache_len)

spk_embed_affine_layer property

spk_embed_affine_layer: Linear

Speaker embedding affine layer.

token_mel_ratio property

token_mel_ratio: int

Token to mel ratio.

token_overlap_len instance-attribute

token_overlap_len = 20

forward

forward(
    token: Tensor,
    prompt_token: Tensor,
    prompt_feat: Tensor,
    embedding: Tensor,
    n_timesteps: int = 10,
    token_offset_tokens: int = 0,
) -> Tensor

Generate audio waveform from speech tokens.

forward_streaming

forward_streaming(
    token: Tensor,
    prompt_token: Tensor,
    prompt_feat: Tensor,
    embedding: Tensor,
    *,
    cache_state: dict[str, Tensor] | None = None,
    n_timesteps: int = 10,
    token_offset_tokens: int = 0,
    finalize: bool = False,
) -> tuple[Tensor, dict[str, Tensor] | None]

Decode streaming audio using cumulative mel + emitted-speech offset.

This mirrors upstream CosyVoice3 streaming semantics more closely than waveform-domain overlap-add: keep a cumulative mel history per request, re-run causal HiFT on the history, and emit only the newly grown speech suffix. That preserves causal look-right handling without double trimming or duplicated overlap at chunk boundaries.

load_weights

load_weights(model_dir: str, device: device) -> None

Load flow.pt and hift.pt weights.

Parameters:

Name Type Description Default
model_dir str

Model directory containing flow.pt and hift.pt

required
device device

Device to load weights to

required