Skip to content

vllm_omni.model_executor.models.moss_tts

Modules:

Name Description
audio_tokenizer

MOSS Audio Tokenizer — inference-only codec (encode waveform ↔ RVQ codes).

configuration_moss_tts

MOSS-TTS model configuration.

modeling_moss_tts_codec

MOSS-TTS Stage-1 codec decoder: RVQ codes → 24 kHz waveform.

modeling_moss_tts_local

Local depth transformer for MossTTSRealtime.

modeling_moss_tts_talker

MOSS-TTS Stage-0 talker: Qwen3 backbone + (n_vq+1) parallel AR heads.

pipeline

Pipeline topology for all MOSS-TTS variants (2-stage: talker → codec).

reference_encoder

Reference-audio encoding + speaker cache for the MOSS-TTS-family talker.

MossTTSCodecDecoder

Bases: Module

Stage-1 decoder for all MOSS-TTS variants.

Consumes (T, NQ) audio VQ codes emitted by Stage 0 and decodes them to a 24 kHz mono waveform using the vendored MossAudioTokenizerModel.

All five variants share the same codec checkpoint OpenMOSS-Team/MOSS-Audio-Tokenizer. The number of quantizers (n_vq) is passed as a num_quantizers argument to batch_decode so one codec instance handles both n_vq=32 (MOSS-TTS) and n_vq=16 (all other variants) without swapping weights.

The codec checkpoint path defaults to the value stored in vllm_config.model_config.hf_config.codec_model_name_or_path but can be overridden by setting the environment variable MOSS_TTS_CODEC_PATH.

enable_update_additional_information class-attribute instance-attribute

enable_update_additional_information: bool = True

has_postprocess class-attribute instance-attribute

has_postprocess: bool = False

has_preprocess class-attribute instance-attribute

has_preprocess: bool = False

have_multimodal_outputs class-attribute instance-attribute

have_multimodal_outputs: bool = True

input_modalities class-attribute instance-attribute

input_modalities = 'audio'

requires_raw_input_tokens class-attribute instance-attribute

requires_raw_input_tokens: bool = True

vllm_config instance-attribute

vllm_config = vllm_config

compute_logits

compute_logits(
    hidden_states: Tensor | OmniOutput,
    sampling_metadata: Any = None,
) -> None

embed_input_ids

embed_input_ids(input_ids: Tensor, **_: Any) -> Tensor

forward

forward(
    input_ids: Tensor | None = None,
    positions: Tensor | None = None,
    intermediate_tensors: Any = None,
    inputs_embeds: Tensor | None = None,
    runtime_additional_information: list[dict[str, Any]]
    | None = None,
    **kwargs: Any,
) -> OmniOutput

Decode audio VQ codes to waveform.

Stage 0 emits flat codebook-major [NQ * T_chunk] audio codes. The chunk transfer adapter assigns those to request.prompt_token_ids, which arrives here as input_ids. Per-request offsets are derived from runtime_additional_information-attached request metadata; meta like left_context_size (overlap context for the causal decoder) also lives there.

Returns

OmniOutput with: multimodal_outputs["model_outputs"] — list of (T_wav,) float32 tensors multimodal_outputs["sr"] — list of scalar int32 tensors

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Drain the Stage-0 weights iterator, then load the codec from its own checkpoint.

The codec lives in a separate HuggingFace repo (OpenMOSS-Team/MOSS-Audio-Tokenizer) and is loaded independently of the talker weights.

MossTTSDelayTalkerForGeneration

Bases: Module

Stage-0 talker for MossTTSDelayModel variants.

Covers all four repos that ship architectures: ["MossTTSDelayModel"]: - MOSS-TTS (8B, n_vq=32) - MOSS-TTSD-v1.0 (8B, n_vq=16) - MOSS-SoundEffect (8B, n_vq=16) - MOSS-VoiceGenerator (1.7B, n_vq=16)

Architecture ~~~~~~~~~~~~ * Backbone: Qwen3 transformer (hidden_size, num_hidden_layers, etc. from config.language_config). * Embedding: text_embed(t) + Σᵢ audio_embed_i(aᵢ) — additive fusion, no cross-attention or concatenation. * Heads: (n_vq + 1) parallel linear heads over the final hidden state. – Head 0 → text logits (drives AR scheduler) – Heads 1…n_vq → audio VQ logits (one per RVQ codebook)

Delay pattern ~~~~~~~~~~~~~ Audio heads are only active after the model emits the delay-slot token (audio_assistant_delay_slot_token_id). Before the slot fires all audio heads output a pad token (audio_pad_code). After the slot:

step  t:     collect audio_codebook_0  for frame t
step t+1:    collect audio_codebook_0  for frame t+1
             collect audio_codebook_1  for frame t     (1-step lag)
…
step t+k:    all k codebooks active; emit frame t

The per-request delay_step counter (stored in the per-request info dict) tracks this. Stage-1 receives the codes in (T, NQ) shape and the codec's batch_decode handles the de-interleaving internally.

audio_assistant_delay_slot_token_id instance-attribute

audio_assistant_delay_slot_token_id: int = (
    audio_assistant_delay_slot_token_id
)

audio_assistant_gen_slot_token_id instance-attribute

audio_assistant_gen_slot_token_id: int = (
    audio_assistant_gen_slot_token_id
)

audio_embeddings instance-attribute

audio_embeddings = ModuleList(
    [
        (Embedding(audio_vocab_size + 1, hidden_size))
        for _ in (range(n_vq))
    ]
)

audio_end_token_id instance-attribute

audio_end_token_id: int = audio_end_token_id

audio_heads instance-attribute

audio_heads = ModuleList(
    [
        (
            Linear(
                hidden_size,
                audio_vocab_size + 1,
                bias=False,
            )
        )
        for _ in (range(n_vq))
    ]
)

audio_pad_code instance-attribute

audio_pad_code: int = audio_pad_code

audio_start_token_id instance-attribute

audio_start_token_id: int = audio_start_token_id

audio_vocab_size instance-attribute

audio_vocab_size: int = audio_vocab_size

config instance-attribute

config: MossTTSDelayConfig = hf_config

gpu_resident_buffer_keys instance-attribute

gpu_resident_buffer_keys: set[tuple[str, str]] = {
    ("audio_codes", "current"),
    ("audio_codes", "accumulated"),
    ("hidden_states", "last"),
}

has_postprocess class-attribute instance-attribute

has_postprocess: bool = True

has_preprocess class-attribute instance-attribute

has_preprocess: bool = True

have_multimodal_outputs class-attribute instance-attribute

have_multimodal_outputs: bool = True

hidden_size instance-attribute

hidden_size = hidden_size

im_end_token_id instance-attribute

im_end_token_id: int = getattr(
    config, "im_end_token_id", 151645
)

logits_processor instance-attribute

logits_processor = LogitsProcessor(vocab_size)

model instance-attribute

model = Qwen3Model(
    vllm_config=vllm_config,
    prefix=_maybe_prefix(prefix, "model"),
)

n_vq instance-attribute

n_vq: int = n_vq

pad_token_id instance-attribute

pad_token_id: int = getattr(config, 'pad_token_id', 151643)

text_lm_head instance-attribute

text_lm_head = ParallelLMHead(
    vocab_size,
    hidden_size,
    bias=False,
    prefix=_maybe_prefix(prefix, "text_lm_head"),
)

vllm_config instance-attribute

vllm_config = vllm_config

compute_logits

compute_logits(
    hidden_states: Tensor | OmniOutput,
    sampling_metadata: SamplingMetadata | None = None,
) -> Tensor | None

Return text-head logits with delay-pattern constraints applied.

The mask follows upstream MOSS-TTS' generate loop: * Forced tokens override the sampler when delayed_lengths is in the audio-emit window (delay_slot for [0, n_vq), audio_end at n_vq). * Outside that window, mask audio control tokens unless the model is currently emitting audio (is_audio). * Mask delay_slot at step 0 and im_end during the first n_vq steps, matching upstream's anti-collapse heuristics.

embed_input_ids

embed_input_ids(input_ids: Tensor, **_: Any) -> Tensor

forward

forward(
    input_ids: Tensor,
    positions: Tensor,
    intermediate_tensors: IntermediateTensors | None = None,
    inputs_embeds: Tensor | None = None,
    **_: Any,
) -> Tensor | IntermediateTensors

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Map HF weight names to vLLM-Omni module names.

HF layout (MossTTSDelayModel): language_model.model. → model. language_model.lm_head.weight (if present) emb_ext.{i}.weight → audio_embeddings.{i}.weight lm_heads.0.weight → text_lm_head.weight lm_heads.{i+1}.weight → audio_heads.{i}.weight (i ≥ 0)

make_omni_output

make_omni_output(
    model_outputs: Tensor | OmniOutput, **kwargs: Any
) -> OmniOutput

Sample audio codes per request and stash text-mask state.

Per-request state lives in info["audio_state"]. Audio codes are accumulated in info["audio_codes"]["accumulated"] (T_acc, NQ) and the most recent row is stored in info["audio_codes"]["current"] for the next preprocess step.

postprocess

postprocess(
    hidden_states: Tensor, **_: Any
) -> dict[str, Any]

preprocess

preprocess(
    input_ids: Tensor,
    input_embeds: Tensor | None,
    **info_dict: Any,
) -> tuple[Tensor, Tensor, dict[str, Any]]

Build per-step input embeddings (text + audio additive fusion).

Prefill: initialise the per-request state machine from the prompt. Decode: update the state with the just-sampled text token, then build the combined text+audio embedding using the previous step's codes.

MossTTSRealtimeTalkerForGeneration

Bases: Module

Stage-0 talker for MossTTSRealtime (1.7B, TTFB ~180 ms).

Architecture differs from the delay model: * Backbone (Qwen3) consumes embed_tokens[0](text) + Σᵢ embed_tokens[i+1](audio_i). * The model does NOT have a text LM head — the text column at every decode step is forced to text_pad (or eos when the audio EOS token has been emitted), so we synthesise a deterministic logit row to feed the vLLM sampler. * Per-step audio generation runs the small local_transformer (4-layer Qwen3-style decoder, rvq=16 codebooks) inside make_omni_output. Stop condition: codebook-0 token equals audio_eos_token (1026).

AUDIO_BOS class-attribute instance-attribute

AUDIO_BOS = 1025

AUDIO_EOS class-attribute instance-attribute

AUDIO_EOS = 1026

audio_eos_id instance-attribute

audio_eos_id: int = int(
    getattr(config, "eos_token_id", 151645)
)

audio_pad_token instance-attribute

audio_pad_token: int = int(audio_pad_token)

audio_vocab_size instance-attribute

audio_vocab_size: int = int(audio_vocab_size)

config instance-attribute

config: MossTTSRealtimeConfig = hf_config

embed_tokens instance-attribute

embed_tokens = ModuleList()

gpu_resident_buffer_keys instance-attribute

gpu_resident_buffer_keys: set[tuple[str, str]] = {
    ("audio_codes", "current"),
    ("audio_codes", "accumulated"),
    ("hidden_states", "last"),
}

has_postprocess class-attribute instance-attribute

has_postprocess: bool = True

has_preprocess class-attribute instance-attribute

has_preprocess: bool = True

have_multimodal_outputs class-attribute instance-attribute

have_multimodal_outputs: bool = True

hidden_size instance-attribute

hidden_size: int = int(hidden_size)

local_lm_heads instance-attribute

local_lm_heads = ModuleList(
    [
        (
            Linear(
                int(hidden_size),
                audio_vocab_size,
                bias=False,
            )
        )
        for _ in (range(n_vq))
    ]
)

local_transformer instance-attribute

local_transformer = MossTTSRealtimeLocalTransformer(
    local_cfg
)

logits_processor instance-attribute

logits_processor = LogitsProcessor(text_vocab_size)

model instance-attribute

model = Qwen3Model(
    vllm_config=backbone_vllm_config,
    prefix=_maybe_prefix(prefix, "model"),
)

n_vq instance-attribute

n_vq: int = int(rvq)

text_pad_id instance-attribute

text_pad_id: int = int(text_pad)

text_vocab_size instance-attribute

text_vocab_size: int = int(vocab_size)

vllm_config instance-attribute

vllm_config = vllm_config

compute_logits

compute_logits(
    hidden_states: Tensor | OmniOutput,
    sampling_metadata: SamplingMetadata | None = None,
) -> Tensor | None

Synthesise a one-hot text logit row per request.

The realtime model has no text LM head — text is always either text_pad (continue) or eos (stop because audio EOS just fired).

embed_input_ids

embed_input_ids(input_ids: Tensor, **_: Any) -> Tensor

forward

forward(
    input_ids: Tensor,
    positions: Tensor,
    intermediate_tensors: IntermediateTensors | None = None,
    inputs_embeds: Tensor | None = None,
    **_: Any,
) -> Tensor | IntermediateTensors

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Remap upstream MossTTSRealtime checkpoint names → vendored layout.

Mapping

embed_tokens.{i}. → embed_tokens.{i}. (kept) language_model.embed_tokens. → model.embed_tokens. (Qwen3 inner) language_model.layers. → model.layers. language_model.norm. → model.norm. local_transformer.model.embed_tokens. → local_transformer.model.codec_embedding. local_transformer.model. → local_transformer.model. (shared body, kept) local_transformer.local_lm_heads. → local_lm_heads. (top-level)

make_omni_output

make_omni_output(
    model_outputs: Tensor | OmniOutput, **kwargs: Any
) -> OmniOutput

postprocess

postprocess(
    hidden_states: Tensor, **_: Any
) -> dict[str, Any]

preprocess

preprocess(
    input_ids: Tensor,
    input_embeds: Tensor | None,
    **info_dict: Any,
) -> tuple[Tensor, Tensor, dict[str, Any]]