vllm_omni.model_executor.models.glm_tts ¶
GLM-TTS model support for vLLM-Omni.
GLM-TTS is a two-stage text-to-speech system
- Stage 0 (LLM): Llama-based AR model generates speech tokens from text
- Stage 1 (DiT): Flow matching model converts speech tokens to mel-spectrogram
Reference: https://github.com/zai-org/GLM-TTS
Modules:
| Name | Description |
|---|---|
glm_tts | GLM-TTS AR Model (Stage 0): Text → Speech Tokens. |
glm_tts_dit | GLM-TTS DiT (Diffusion Transformer) Model. |
glm_tts_dit_wrapper | GLM-TTS DiT wrapper: LLM_GENERATION-compatible flow-matching + vocoder. |
pipeline | GLM-TTS pipeline: Stage 0 (AR) → Stage 1 (DiT). |
text_frontend | Lightweight text frontend aligned with the official GLM-TTS preprocessing. |
vocoder | Vocoder loading and mel-to-audio conversion for GLM-TTS. |
voice_clone | GLM-TTS voice cloning frontend: speech tokenizer, speaker embedding, mel features. |
GLMTTSForConditionalGeneration ¶
Bases: Module, SupportsMultiModal
vLLM model for GLM-TTS.
Handles both stages via model_stage branching
glm_tts(Stage 0): Text → Speech tokens (LLM AR, Llama backbone).glm_tts_dit(Stage 1): Speech tokens → Audio (DiT flow-matching + vocoder).
Attributes:
| Name | Type | Description |
|---|---|---|
have_multimodal_outputs | Signals scheduler to collect multimodal outputs. | |
has_preprocess | Model has preprocess hook for input preparation (stage 0 only). | |
has_postprocess | Model has postprocess hook for hidden state caching (stage 0 only). |
enable_update_additional_information instance-attribute ¶
gpu_resident_buffer_keys instance-attribute ¶
hf_to_vllm_mapper class-attribute instance-attribute ¶
hf_to_vllm_mapper = WeightsMapper(
orig_to_new_prefix={
"llama_embedding.": "model.embed_tokens.",
"llama.model.": "model.",
"llama.": "model.",
}
)
lm_head instance-attribute ¶
lm_head = ParallelLMHead(
vocab_size,
hidden_size,
quant_config=quant_config,
prefix=maybe_prefix(prefix, "lm_head"),
)
make_empty_intermediate_tensors instance-attribute ¶
model instance-attribute ¶
model_dir instance-attribute ¶
model_dir = resolve_glm_tts_model_dir(
model_dir,
tokenizer_path=getattr(model_config, "tokenizer", None),
)
supports_multimodal_raw_input_only class-attribute instance-attribute ¶
compute_logits ¶
compute_logits(
hidden_states: Tensor | OmniOutput,
sampling_metadata: Any = None,
) -> Tensor | None
embed_input_ids ¶
embed_input_ids(
input_ids: Tensor,
multimodal_embeddings: Any | None = None,
is_multimodal: Any | None = None,
**kwargs: Any,
) -> Tensor
forward ¶
forward(
input_ids: Tensor,
positions: Tensor,
intermediate_tensors: IntermediateTensors | None = None,
inputs_embeds: Tensor | None = None,
**kwargs: Any,
) -> Tensor | IntermediateTensors | OmniOutput
load_weights ¶
Load weights from checkpoint.
Stage 0 (glm_tts): HuggingFace Llama-format checkpoint from llm/ subdir. Stage 1 (glm_tts_dit): DiT flow.pt + vocoder.
make_omni_output ¶
make_omni_output(
model_outputs: Tensor | OmniOutput, **kwargs: Any
) -> OmniOutput
Package hidden states, speech tokens, and voice clone data into OmniOutput.
Streaming contract: delta. Each decode step emits exactly one speech token (or a prefill placeholder). The engine's output processor concatenates per-step deltas into the final tensor.
postprocess ¶
Cache last hidden state for next decode step.
preprocess ¶
preprocess(
input_ids: Tensor,
input_embeds: Tensor | None,
**info_dict: Any,
) -> tuple[Tensor, Tensor, dict[str, Any]]
Prepare inputs for GLM-TTS AR model.
GLM-TTS only supports the multimodal processor path: text prompt + multi_modal_data["audio"] + mm_processor_kwargs["prompt_text"]. Legacy placeholder prompts via additional_information are rejected.
sample ¶
RAS sampler following CosyVoice3 pattern.
Uses vLLM Sampler for logits processing (logit_bias_state handles min_tokens/max_tokens/stop_token_ids). When RAS is enabled, applies per-request nucleus+repetition-aware sampling; otherwise falls back to standard vLLM sampling.