Skip to content

vllm_omni.model_executor.models.omnivoice.omnivoice_decoder

OmniVoice Decoder (Stage 1) - Audio token to waveform conversion.

Implements the HiggsAudioV2 decode path using transformers' DacModel decoder and a custom RVQ quantizer, compatible with transformers 4.x.

Decode path

audio_codes [B, 8, T] → RVQ codebook lookup + project_out → sum → [B, 1024, T] → fc2 Linear(1024, 256) → [B, 256, T] → DAC acoustic decoder (conv transpose upsampling) → [B, 1, T*960] → 24kHz waveform (25fps × 960 samples/frame)

logger module-attribute

logger = init_logger(__name__)

HiggsAudioRVQ

Bases: Module

Residual Vector Quantizer with 8 codebook layers.

quantizers instance-attribute

quantizers = ModuleList(
    [
        (
            HiggsAudioVQLayer(
                codebook_size, codebook_dim, hidden_size
            )
        )
        for _ in (range(num_quantizers))
    ]
)

decode

decode(codes: Tensor) -> Tensor

codes: [num_quantizers, B, T] → [B, hidden_size, T]

HiggsAudioVQLayer

Bases: Module

Single VQ layer: codebook lookup + project_out.

codebook instance-attribute

codebook = Embedding(codebook_size, codebook_dim)

project_out instance-attribute

project_out = Linear(codebook_dim, hidden_size)

decode

decode(indices: Tensor) -> Tensor

indices: [B, T] → [B, hidden_size, T]

OmniVoiceDecoder

Bases: Module

OmniVoice Stage 1: Token-to-audio decoder.

Uses DAC acoustic decoder from transformers + custom HiggsAudio RVQ quantizer to convert 8-codebook tokens into 24kHz waveform.

acoustic_decoder instance-attribute

acoustic_decoder = None

config instance-attribute

config = config

fc2 instance-attribute

fc2 = None

quantizer instance-attribute

quantizer = None

sample_rate instance-attribute

sample_rate = sample_rate

forward

forward(audio_codes: Tensor) -> Tensor

Decode audio tokens to waveform.

Parameters:

Name Type Description Default
audio_codes Tensor

[B, 8, T] - 8-codebook audio token IDs

required

Returns:

Name Type Description
waveform Tensor

[B, 1, audio_samples] at 24kHz

load_weights

load_weights(model_dir: str, device: device) -> None

Load decoder components from audio_tokenizer/model.safetensors.