Skip to content

vllm_omni.model_executor.models.moss_tts_nano

Modules:

Name Description
configuration_moss_tts_nano

Configuration for MOSS-TTS-Nano in vLLM-Omni single-stage pipeline.

modeling_moss_tts_nano

MOSS-TTS-Nano single-stage model for vLLM-Omni.

pipeline

MOSS-TTS-Nano pipeline topology (frozen).

MossTTSNanoForGeneration

Bases: Module

Single-stage MOSS-TTS-Nano model with streaming audio output.

Uses the VoxCPM-style generator pattern: inference_stream() is stored per-request and yields one audio chunk per forward() call. The AR scheduler keeps the request alive until compute_logits() emits EOS.

config instance-attribute

config = hf_config

enable_update_additional_information class-attribute instance-attribute

enable_update_additional_information = True

has_postprocess class-attribute instance-attribute

has_postprocess = False

has_preprocess class-attribute instance-attribute

has_preprocess = False

have_multimodal_outputs class-attribute instance-attribute

have_multimodal_outputs = True

inject_omni_request_id_into_runtime_info class-attribute instance-attribute

inject_omni_request_id_into_runtime_info = True

model_path instance-attribute

model_path: str = model

requires_raw_input_tokens class-attribute instance-attribute

requires_raw_input_tokens = True

vllm_config instance-attribute

vllm_config = vllm_config

compute_logits

compute_logits(
    hidden_states: Tensor | OmniOutput,
    sampling_metadata: Any = None,
) -> Tensor

Emit per-row EOS / non-EOS logits to control AR scheduler lifetime.

Rows whose _ar_last_chunk_flags entry is True get EOS-dominant logits so the scheduler finishes that request; other rows get a non-EOS token so they stay alive for the next streaming chunk.

embed_input_ids

embed_input_ids(
    input_ids: Tensor,
    multimodal_embeddings=None,
    is_multimodal=None,
) -> Tensor

forward

forward(
    input_ids: Tensor | None = None,
    positions: Tensor | None = None,
    intermediate_tensors: Any = None,
    inputs_embeds: Tensor | None = None,
    runtime_additional_information: list[dict[str, Any]]
    | None = None,
    **kwargs: Any,
) -> OmniOutput

get_dummy_runtime_additional_information

get_dummy_runtime_additional_information(
    num_reqs: int,
) -> list[dict]

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

on_requests_finished

on_requests_finished(
    finished_req_ids: set[str] | list[str],
) -> None

Release streaming generators for requests the scheduler finished.

forward() only pops generators on normal completion (is_last or StopIteration). Abnormal termination (cancel, timeout, preempt) would otherwise leak the generator and skip its finally block, stranding temp WAV files. Closing here raises GeneratorExit inside the generator so the cleanup block runs.