Skip to content

vllm_omni.model_executor.models.omnivoice.omnivoice

OmniVoice model for vLLM-Omni two-stage TTS pipeline.

Stage 0 (Generator): Qwen3 backbone + iterative unmasking → 8-codebook tokens Stage 1 (Decoder): HiggsAudioV2 decoder → 24kHz waveform

logger module-attribute

logger = init_logger(__name__)

OmniVoiceDummyInputsBuilder

Bases: BaseDummyInputsBuilder[OmniVoiceMultiModalProcessingInfo]

get_dummy_mm_data

get_dummy_mm_data(
    seq_len: int,
    mm_counts: Mapping[str, int],
    mm_options: Mapping[str, BaseDummyOptions]
    | None = None,
) -> MultiModalDataDict

get_dummy_processor_inputs

get_dummy_processor_inputs(
    seq_len: int,
    mm_counts: Mapping[str, int],
    mm_options: Mapping[str, BaseDummyOptions]
    | None = None,
) -> ProcessorInputs

get_dummy_text

get_dummy_text(mm_counts: Mapping[str, int]) -> str

OmniVoiceModel

Bases: Module

OmniVoice model for vLLM-Omni two-stage pipeline.

Routes to generator (Stage 0) or decoder (Stage 1) based on model_stage.

config instance-attribute

config = vllm_config.model_config.hf_config

decoder instance-attribute

decoder = OmniVoiceDecoder(self.config)

generator instance-attribute

generator = OmniVoiceGenerator(self.config)

have_multimodal_outputs instance-attribute

have_multimodal_outputs = True

model instance-attribute

model = self.generator

model_dir instance-attribute

model_dir = vllm_config.model_config.model

model_stage instance-attribute

model_stage = vllm_config.model_config.model_stage

requires_raw_input_tokens class-attribute instance-attribute

requires_raw_input_tokens = True

embed_input_ids

embed_input_ids(
    input_ids: Tensor,
    multimodal_embeddings=None,
    is_multimodal=None,
) -> Tensor

forward

forward(
    input_ids: Tensor,
    positions: Tensor,
    intermediate_tensors: IntermediateTensors | None = None,
    inputs_embeds: Tensor | None = None,
    additional_information: dict[str, object] | None = None,
    **kwargs: object,
) -> OmniOutput

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

OmniVoiceMultiModalProcessingInfo

Bases: BaseProcessingInfo

get_data_parser

get_data_parser()

get_hf_config

get_hf_config()

get_supported_mm_limits

get_supported_mm_limits() -> Mapping[str, int | None]

OmniVoiceMultiModalProcessor

Bases: BaseMultiModalProcessor[OmniVoiceMultiModalProcessingInfo]

Processes text + optional reference audio for OmniVoice.

For voice cloning: text + reference audio → tokenized reference For auto voice: text only