Skip to content

vllm_omni.model_executor.models.glm_tts

GLM-TTS model support for vLLM-Omni.

GLM-TTS is a two-stage text-to-speech system
  • Stage 0 (LLM): Llama-based AR model generates speech tokens from text
  • Stage 1 (DiT): Flow matching model converts speech tokens to mel-spectrogram

Reference: https://github.com/zai-org/GLM-TTS

Modules:

Name Description
glm_tts

GLM-TTS AR Model (Stage 0): Text → Speech Tokens.

glm_tts_dit

GLM-TTS DiT (Diffusion Transformer) Model.

glm_tts_dit_wrapper

GLM-TTS DiT wrapper: LLM_GENERATION-compatible flow-matching + vocoder.

pipeline

GLM-TTS pipeline: Stage 0 (AR) → Stage 1 (DiT).

text_frontend

Lightweight text frontend aligned with the official GLM-TTS preprocessing.

vocoder

Vocoder loading and mel-to-audio conversion for GLM-TTS.

voice_clone

GLM-TTS voice cloning frontend: speech tokenizer, speaker embedding, mel features.

GLMTTSForConditionalGeneration

Bases: Module, SupportsMultiModal

vLLM model for GLM-TTS.

Handles both stages via model_stage branching
  • glm_tts (Stage 0): Text → Speech tokens (LLM AR, Llama backbone).
  • glm_tts_dit (Stage 1): Speech tokens → Audio (DiT flow-matching + vocoder).

Attributes:

Name Type Description
have_multimodal_outputs

Signals scheduler to collect multimodal outputs.

has_preprocess

Model has preprocess hook for input preparation (stage 0 only).

has_postprocess

Model has postprocess hook for hidden state caching (stage 0 only).

allow_patterns_overrides instance-attribute

allow_patterns_overrides = ['flow/flow.pt']

config instance-attribute

config = config

enable_update_additional_information instance-attribute

enable_update_additional_information = True

fall_back_to_pt_during_load instance-attribute

fall_back_to_pt_during_load = False

gpu_resident_buffer_keys instance-attribute

gpu_resident_buffer_keys: set[tuple[str, str]] = {
    ("last_hidden", "last")
}

has_postprocess instance-attribute

has_postprocess = True

has_preprocess instance-attribute

has_preprocess = True

have_multimodal_outputs instance-attribute

have_multimodal_outputs = True

hf_to_vllm_mapper class-attribute instance-attribute

hf_to_vllm_mapper = WeightsMapper(
    orig_to_new_prefix={
        "llama_embedding.": "model.embed_tokens.",
        "llama.model.": "model.",
        "llama.": "model.",
    }
)

lm_head instance-attribute

lm_head = ParallelLMHead(
    vocab_size,
    hidden_size,
    quant_config=quant_config,
    prefix=maybe_prefix(prefix, "lm_head"),
)

logits_processor instance-attribute

logits_processor = LogitsProcessor(vocab_size)

make_empty_intermediate_tensors instance-attribute

make_empty_intermediate_tensors = (
    make_empty_intermediate_tensors
)

model instance-attribute

model = LlamaModel(
    vllm_config=vllm_config,
    prefix=maybe_prefix(prefix, "model"),
)

model_dir instance-attribute

model_dir = resolve_glm_tts_model_dir(
    model_dir,
    tokenizer_path=getattr(model_config, "tokenizer", None),
)

model_path instance-attribute

model_path = model

model_stage instance-attribute

model_stage = getattr(
    model_config, "model_stage", "glm_tts"
)

prefer_model_sampler class-attribute instance-attribute

prefer_model_sampler = True

requires_raw_input_tokens class-attribute instance-attribute

requires_raw_input_tokens = True

supports_multimodal class-attribute instance-attribute

supports_multimodal = True

supports_multimodal_raw_input_only class-attribute instance-attribute

supports_multimodal_raw_input_only = True

vllm_config instance-attribute

vllm_config = vllm_config

compute_logits

compute_logits(
    hidden_states: Tensor | OmniOutput,
    sampling_metadata: Any = None,
) -> Tensor | None

embed_input_ids

embed_input_ids(
    input_ids: Tensor,
    multimodal_embeddings: Any | None = None,
    is_multimodal: Any | None = None,
    **kwargs: Any,
) -> Tensor

embed_multimodal

embed_multimodal(**kwargs: Any) -> list[Tensor] | None

forward

forward(
    input_ids: Tensor,
    positions: Tensor,
    intermediate_tensors: IntermediateTensors | None = None,
    inputs_embeds: Tensor | None = None,
    **kwargs: Any,
) -> Tensor | IntermediateTensors | OmniOutput

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Load weights from checkpoint.

Stage 0 (glm_tts): HuggingFace Llama-format checkpoint from llm/ subdir. Stage 1 (glm_tts_dit): DiT flow.pt + vocoder.

make_omni_output

make_omni_output(
    model_outputs: Tensor | OmniOutput, **kwargs: Any
) -> OmniOutput

Package hidden states, speech tokens, and voice clone data into OmniOutput.

Streaming contract: delta. Each decode step emits exactly one speech token (or a prefill placeholder). The engine's output processor concatenates per-step deltas into the final tensor.

postprocess

postprocess(
    hidden_states: Tensor, **kwargs: Any
) -> dict[str, Any]

Cache last hidden state for next decode step.

preprocess

preprocess(
    input_ids: Tensor,
    input_embeds: Tensor | None,
    **info_dict: Any,
) -> tuple[Tensor, Tensor, dict[str, Any]]

Prepare inputs for GLM-TTS AR model.

GLM-TTS only supports the multimodal processor path: text prompt + multi_modal_data["audio"] + mm_processor_kwargs["prompt_text"]. Legacy placeholder prompts via additional_information are rejected.

sample

sample(
    logits: Tensor, sampling_metadata: SamplingMetadata
) -> SamplerOutput | None

RAS sampler following CosyVoice3 pattern.

Uses vLLM Sampler for logits processing (logit_bias_state handles min_tokens/max_tokens/stop_token_ids). When RAS is enabled, applies per-request nucleus+repetition-aware sampling; otherwise falls back to standard vLLM sampling.