Skip to content

vllm_omni.diffusion.models.omnivoice.pipeline_omnivoice

OmniVoice TTS Pipeline for vLLM-Omni diffusion engine.

Single-stage pipeline that runs the full text-to-speech flow

text → tokenize → 32-step iterative unmasking → 8-codebook tokens → DAC decode → 24kHz audio

Uses request-mode execution (all steps in one forward() call).

logger module-attribute

logger = init_logger(__name__)

OmniVoicePipeline

Bases: Module, SupportAudioOutput

OmniVoice text-to-speech pipeline for the diffusion engine.

Wraps OmniVoiceGenerator (32-step iterative unmasking) and OmniVoiceDecoder (HiggsAudioV2 RVQ + DAC) into a single forward() call.

audio_tokenizer instance-attribute

audio_tokenizer = (
    HiggsAudioV2TokenizerModel.from_pretrained(
        audio_tokenizer_path, device_map=self.device
    ).eval()
)

class_temperature instance-attribute

class_temperature = self.config.class_temperature

config instance-attribute

config = OmniVoiceConfig(**hf_config)

decoder instance-attribute

decoder = OmniVoiceDecoder(self.config)

device instance-attribute

device = get_local_device()

duration_estimator instance-attribute

duration_estimator = RuleDurationEstimator()

generator instance-attribute

generator = OmniVoiceGenerator(self.config)

guidance_scale instance-attribute

guidance_scale = self.config.guidance_scale

layer_penalty_factor instance-attribute

layer_penalty_factor = self.config.layer_penalty_factor

model_path instance-attribute

model_path = od_config.model

num_step instance-attribute

num_step = self.config.num_step

od_config instance-attribute

od_config = od_config

position_temperature instance-attribute

position_temperature = self.config.position_temperature

sample_rate instance-attribute

sample_rate = self.config.sample_rate

support_audio_output class-attribute

support_audio_output: bool = True

t_shift instance-attribute

t_shift = self.config.t_shift

tokenizer instance-attribute

tokenizer = HFTokenizer.from_file(tokenizer_path)

forward

Generate speech audio from text, optionally with voice cloning.

Accepts either a plain text prompt or a structured dict

{"text": "...", "ref_audio": (samples, sr), "ref_text": "...", "lang": "...", "instruct": "..."}

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Load weights from model directory (not from the iterator).

The diffusion model loader passes HF safetensors weights, but OmniVoice has custom weight names (llm. → generator., audio_tokenizer. → decoder.). We load from model_path directly and return all param names to satisfy the loader's "all weights initialized" check.

get_omnivoice_post_process_func

get_omnivoice_post_process_func(
    od_config: OmniDiffusionConfig,
)

Post-processing: convert audio tensor to numpy for WAV encoding.