vllm_omni.model_executor.stage_input_processors.moss_tts ¶
Stage input processors: MOSS-TTS talker (Stage 0) → codec (Stage 1).
talker2codec ¶
talker2codec(
stage_list: list[Any],
engine_input_source: list[int],
prompt: Any = None,
requires_multimodal_data: bool = False,
) -> list[Any]
Convert all talker codes to a single Stage-1 token sequence.
Stage 0 output contains codes["audio"] shaped (T, NQ) where T is the number of generated audio frames and NQ is n_vq. We flatten to [NQ * T] as the Stage-1 input_ids so the codec can reshape back to (NQ, T) for decoding.
talker2codec_async_chunk ¶
talker2codec_async_chunk(
transfer_manager: Any,
pooling_output: dict[str, Any] | None,
request: Any,
is_finished: bool = False,
) -> OmniPayloadStruct | None
Emit accumulated audio codes to Stage 1 as they arrive from Stage 0.
State is maintained in transfer_manager keyed by request ID. A chunk is forwarded to Stage 1 when either: (a) is_finished is True (flush all remaining codes), or (b) the accumulated frame count reaches chunk_frames (default 25).
Returns a dict compatible with the Stage-1 input format, or None to signal "not enough data yet — wait for more frames".