vllm_omni.model_executor.models.moss_tts ¶
Modules:
| Name | Description |
|---|---|
audio_tokenizer | MOSS Audio Tokenizer — inference-only codec (encode waveform ↔ RVQ codes). |
configuration_moss_tts | MOSS-TTS model configuration. |
modeling_moss_tts_codec | MOSS-TTS Stage-1 codec decoder: RVQ codes → 24 kHz waveform. |
modeling_moss_tts_local | Local depth transformer for MossTTSRealtime. |
modeling_moss_tts_talker | MOSS-TTS Stage-0 talker: Qwen3 backbone + (n_vq+1) parallel AR heads. |
pipeline | Pipeline topology for all MOSS-TTS variants (2-stage: talker → codec). |
reference_encoder | Reference-audio encoding + speaker cache for the MOSS-TTS-family talker. |
MossTTSCodecDecoder ¶
Bases: Module
Stage-1 decoder for all MOSS-TTS variants.
Consumes (T, NQ) audio VQ codes emitted by Stage 0 and decodes them to a 24 kHz mono waveform using the vendored MossAudioTokenizerModel.
All five variants share the same codec checkpoint OpenMOSS-Team/MOSS-Audio-Tokenizer. The number of quantizers (n_vq) is passed as a num_quantizers argument to batch_decode so one codec instance handles both n_vq=32 (MOSS-TTS) and n_vq=16 (all other variants) without swapping weights.
The codec checkpoint path defaults to the value stored in vllm_config.model_config.hf_config.codec_model_name_or_path but can be overridden by setting the environment variable MOSS_TTS_CODEC_PATH.
enable_update_additional_information class-attribute instance-attribute ¶
enable_update_additional_information: bool = True
requires_raw_input_tokens class-attribute instance-attribute ¶
requires_raw_input_tokens: bool = True
compute_logits ¶
compute_logits(
hidden_states: Tensor | OmniOutput,
sampling_metadata: Any = None,
) -> None
forward ¶
forward(
input_ids: Tensor | None = None,
positions: Tensor | None = None,
intermediate_tensors: Any = None,
inputs_embeds: Tensor | None = None,
runtime_additional_information: list[dict[str, Any]]
| None = None,
**kwargs: Any,
) -> OmniOutput
Decode audio VQ codes to waveform.
Stage 0 emits flat codebook-major [NQ * T_chunk] audio codes. The chunk transfer adapter assigns those to request.prompt_token_ids, which arrives here as input_ids. Per-request offsets are derived from runtime_additional_information-attached request metadata; meta like left_context_size (overlap context for the causal decoder) also lives there.
Returns¶
OmniOutput with: multimodal_outputs["model_outputs"] — list of (T_wav,) float32 tensors multimodal_outputs["sr"] — list of scalar int32 tensors
MossTTSDelayTalkerForGeneration ¶
Bases: Module
Stage-0 talker for MossTTSDelayModel variants.
Covers all four repos that ship architectures: ["MossTTSDelayModel"]: - MOSS-TTS (8B, n_vq=32) - MOSS-TTSD-v1.0 (8B, n_vq=16) - MOSS-SoundEffect (8B, n_vq=16) - MOSS-VoiceGenerator (1.7B, n_vq=16)
Architecture ~~~~~~~~~~~~ * Backbone: Qwen3 transformer (hidden_size, num_hidden_layers, etc. from config.language_config). * Embedding: text_embed(t) + Σᵢ audio_embed_i(aᵢ) — additive fusion, no cross-attention or concatenation. * Heads: (n_vq + 1) parallel linear heads over the final hidden state. – Head 0 → text logits (drives AR scheduler) – Heads 1…n_vq → audio VQ logits (one per RVQ codebook)
Delay pattern ~~~~~~~~~~~~~ Audio heads are only active after the model emits the delay-slot token (audio_assistant_delay_slot_token_id). Before the slot fires all audio heads output a pad token (audio_pad_code). After the slot:
step t: collect audio_codebook_0 for frame t
step t+1: collect audio_codebook_0 for frame t+1
collect audio_codebook_1 for frame t (1-step lag)
…
step t+k: all k codebooks active; emit frame t
The per-request delay_step counter (stored in the per-request info dict) tracks this. Stage-1 receives the codes in (T, NQ) shape and the codec's batch_decode handles the de-interleaving internally.
audio_assistant_delay_slot_token_id instance-attribute ¶
audio_assistant_delay_slot_token_id: int = (
audio_assistant_delay_slot_token_id
)
audio_assistant_gen_slot_token_id instance-attribute ¶
audio_assistant_gen_slot_token_id: int = (
audio_assistant_gen_slot_token_id
)
audio_embeddings instance-attribute ¶
audio_embeddings = ModuleList(
[
(Embedding(audio_vocab_size + 1, hidden_size))
for _ in (range(n_vq))
]
)
audio_heads instance-attribute ¶
audio_heads = ModuleList(
[
(
Linear(
hidden_size,
audio_vocab_size + 1,
bias=False,
)
)
for _ in (range(n_vq))
]
)
gpu_resident_buffer_keys instance-attribute ¶
gpu_resident_buffer_keys: set[tuple[str, str]] = {
("audio_codes", "current"),
("audio_codes", "accumulated"),
("hidden_states", "last"),
}
im_end_token_id instance-attribute ¶
model instance-attribute ¶
text_lm_head instance-attribute ¶
text_lm_head = ParallelLMHead(
vocab_size,
hidden_size,
bias=False,
prefix=_maybe_prefix(prefix, "text_lm_head"),
)
compute_logits ¶
compute_logits(
hidden_states: Tensor | OmniOutput,
sampling_metadata: SamplingMetadata | None = None,
) -> Tensor | None
Return text-head logits with delay-pattern constraints applied.
The mask follows upstream MOSS-TTS' generate loop: * Forced tokens override the sampler when delayed_lengths is in the audio-emit window (delay_slot for [0, n_vq), audio_end at n_vq). * Outside that window, mask audio control tokens unless the model is currently emitting audio (is_audio). * Mask delay_slot at step 0 and im_end during the first n_vq steps, matching upstream's anti-collapse heuristics.
forward ¶
forward(
input_ids: Tensor,
positions: Tensor,
intermediate_tensors: IntermediateTensors | None = None,
inputs_embeds: Tensor | None = None,
**_: Any,
) -> Tensor | IntermediateTensors
load_weights ¶
Map HF weight names to vLLM-Omni module names.
HF layout (MossTTSDelayModel): language_model.model. → model. language_model.lm_head.weight (if present) emb_ext.{i}.weight → audio_embeddings.{i}.weight lm_heads.0.weight → text_lm_head.weight lm_heads.{i+1}.weight → audio_heads.{i}.weight (i ≥ 0)
make_omni_output ¶
make_omni_output(
model_outputs: Tensor | OmniOutput, **kwargs: Any
) -> OmniOutput
Sample audio codes per request and stash text-mask state.
Per-request state lives in info["audio_state"]. Audio codes are accumulated in info["audio_codes"]["accumulated"] (T_acc, NQ) and the most recent row is stored in info["audio_codes"]["current"] for the next preprocess step.
preprocess ¶
preprocess(
input_ids: Tensor,
input_embeds: Tensor | None,
**info_dict: Any,
) -> tuple[Tensor, Tensor, dict[str, Any]]
Build per-step input embeddings (text + audio additive fusion).
Prefill: initialise the per-request state machine from the prompt. Decode: update the state with the just-sampled text token, then build the combined text+audio embedding using the previous step's codes.
MossTTSRealtimeTalkerForGeneration ¶
Bases: Module
Stage-0 talker for MossTTSRealtime (1.7B, TTFB ~180 ms).
Architecture differs from the delay model: * Backbone (Qwen3) consumes embed_tokens[0](text) + Σᵢ embed_tokens[i+1](audio_i). * The model does NOT have a text LM head — the text column at every decode step is forced to text_pad (or eos when the audio EOS token has been emitted), so we synthesise a deterministic logit row to feed the vLLM sampler. * Per-step audio generation runs the small local_transformer (4-layer Qwen3-style decoder, rvq=16 codebooks) inside make_omni_output. Stop condition: codebook-0 token equals audio_eos_token (1026).
audio_eos_id instance-attribute ¶
gpu_resident_buffer_keys instance-attribute ¶
gpu_resident_buffer_keys: set[tuple[str, str]] = {
("audio_codes", "current"),
("audio_codes", "accumulated"),
("hidden_states", "last"),
}
local_lm_heads instance-attribute ¶
local_lm_heads = ModuleList(
[
(
Linear(
int(hidden_size),
audio_vocab_size,
bias=False,
)
)
for _ in (range(n_vq))
]
)
local_transformer instance-attribute ¶
local_transformer = MossTTSRealtimeLocalTransformer(
local_cfg
)
model instance-attribute ¶
compute_logits ¶
compute_logits(
hidden_states: Tensor | OmniOutput,
sampling_metadata: SamplingMetadata | None = None,
) -> Tensor | None
Synthesise a one-hot text logit row per request.
The realtime model has no text LM head — text is always either text_pad (continue) or eos (stop because audio EOS just fired).
forward ¶
forward(
input_ids: Tensor,
positions: Tensor,
intermediate_tensors: IntermediateTensors | None = None,
inputs_embeds: Tensor | None = None,
**_: Any,
) -> Tensor | IntermediateTensors
load_weights ¶
Remap upstream MossTTSRealtime checkpoint names → vendored layout.
Mapping
embed_tokens.{i}. → embed_tokens.{i}. (kept) language_model.embed_tokens. → model.embed_tokens. (Qwen3 inner) language_model.layers. → model.layers. language_model.norm. → model.norm. local_transformer.model.embed_tokens. → local_transformer.model.codec_embedding. local_transformer.model. → local_transformer.model. (shared body, kept) local_transformer.local_lm_heads. → local_lm_heads. (top-level)
make_omni_output ¶
make_omni_output(
model_outputs: Tensor | OmniOutput, **kwargs: Any
) -> OmniOutput