vllm_omni.model_executor.models.moss_tts_nano ¶
Modules:
| Name | Description |
|---|---|
configuration_moss_tts_nano | Configuration for MOSS-TTS-Nano in vLLM-Omni single-stage pipeline. |
modeling_moss_tts_nano | MOSS-TTS-Nano single-stage model for vLLM-Omni. |
pipeline | MOSS-TTS-Nano pipeline topology (frozen). |
MossTTSNanoForGeneration ¶
Bases: Module
Single-stage MOSS-TTS-Nano model with streaming audio output.
Uses the VoxCPM-style generator pattern: inference_stream() is stored per-request and yields one audio chunk per forward() call. The AR scheduler keeps the request alive until compute_logits() emits EOS.
enable_update_additional_information class-attribute instance-attribute ¶
inject_omni_request_id_into_runtime_info class-attribute instance-attribute ¶
compute_logits ¶
compute_logits(
hidden_states: Tensor | OmniOutput,
sampling_metadata: Any = None,
) -> Tensor
Emit per-row EOS / non-EOS logits to control AR scheduler lifetime.
Rows whose _ar_last_chunk_flags entry is True get EOS-dominant logits so the scheduler finishes that request; other rows get a non-EOS token so they stay alive for the next streaming chunk.
embed_input_ids ¶
forward ¶
forward(
input_ids: Tensor | None = None,
positions: Tensor | None = None,
intermediate_tensors: Any = None,
inputs_embeds: Tensor | None = None,
runtime_additional_information: list[dict[str, Any]]
| None = None,
**kwargs: Any,
) -> OmniOutput
get_dummy_runtime_additional_information ¶
on_requests_finished ¶
Release streaming generators for requests the scheduler finished.
forward() only pops generators on normal completion (is_last or StopIteration). Abnormal termination (cancel, timeout, preempt) would otherwise leak the generator and skip its finally block, stranding temp WAV files. Closing here raises GeneratorExit inside the generator so the cleanup block runs.