vllm_omni.model_executor.models.qwen3_omni ¶
Modules:
| Name | Description |
|---|---|
pipeline | Qwen3-Omni-MoE pipeline topology (frozen). |
qwen3_moe | |
qwen3_omni | Inference-only Qwen3-Omni-Moe unified model (thinker + talker + code2wav). |
qwen3_omni_code2wav | Inference-only Qwen3-Omni-Moe Code2Wav model. |
qwen3_omni_moe_code_predictor_mtp | Qwen3-Omni Code Predictor -- thin wrapper over CodePredictorWrapper. |
qwen3_omni_moe_talker | |
qwen3_omni_moe_thinker | Inference-only Qwen3-Omni-Moe model (thinker part). |
Qwen3OmniMoeForConditionalGeneration ¶
Bases: Module, SupportsMultiModal, SupportsPP, Qwen3OmniMoeConditionalGenerationMixin, CustomProcessMixin, SupportsMRoPE, SupportsRealtime
Unified Qwen3 Omni MoE model combining thinker, talker, and code2wav.
Architecture: - Thinker: Multimodal understanding (text + audio + video) → text generation - Talker: Text embeddings → RVQ codec codes - Code2Wav: RVQ codes → audio waveform
Usage
Set model_stage in vllm_config to one of: "thinker", "talker", "code2wav"
enable_update_additional_information instance-attribute ¶
gpu_resident_buffer_keys instance-attribute ¶
gpu_resident_buffer_keys: set[tuple[str, str]] = {
("hidden_states", "last"),
("hidden_states", "trailing_text"),
("embed", "tts_pad_projected"),
("codes", "audio"),
}
make_empty_intermediate_tensors instance-attribute ¶
make_empty_intermediate_tensors = (
make_empty_intermediate_tensors
if model_stage == "thinker"
else (lambda: None)
)
thinker instance-attribute ¶
thinker = init_vllm_registered_model(
vllm_config=thinker_vllm_config,
prefix=maybe_prefix(prefix, "thinker"),
hf_config=thinker_config,
architectures=[
"Qwen3OmniMoeThinkerForConditionalGeneration"
],
)
tts_tokens instance-attribute ¶
tts_tokens = tensor(
[
[
tts_bos_token_id,
tts_eos_token_id,
tts_pad_token_id,
]
],
device=_module_device(thinker),
dtype=long,
)
buffer_realtime_audio async classmethod ¶
buffer_realtime_audio(
audio_stream: AsyncGenerator[ndarray, None],
input_stream: Queue[list[int]],
model_config: ModelConfig,
) -> AsyncGenerator[PromptType, None]
compute_logits ¶
compute_logits(
hidden_states: Tensor | OmniOutput,
sampling_metadata: SamplingMetadata = None,
) -> Tensor | None
Compute logits from hidden states.
embed_input_ids ¶
forward ¶
forward(
input_ids: Tensor,
positions: Tensor,
intermediate_tensors: IntermediateTensors | None = None,
inputs_embeds: Tensor | None = None,
generate_audio: bool = True,
voice_type: str = "ethan",
codec: Tensor | None = None,
sampling_metadata: SamplingMetadata | None = None,
logits_index: int | None = None,
runtime_additional_information: list[dict[str, Any]]
| None = None,
**kwargs: object,
) -> Tensor | IntermediateTensors | OmniOutput
Unified forward pass for all model stages.
Workflow: 1) Thinker: multimodal understanding → text hidden states 2) Talker -> Code Predictor: text embeddings → codec codes (layer 0 + code_predictor:residual layers) 3) Code2wav: 8-layer RVQ codes → audio waveform
Returns:
| Type | Description |
|---|---|
Tensor | IntermediateTensors | OmniOutput | OmniOutput with text_hidden_states and optional audio |
generate_audio ¶
generate_audio(
code: Tensor,
left_context_size: list[int] | None = None,
seq_token_counts: list[int] | None = None,
) -> list[Tensor]
Generate audio waveform from codec codes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
code | Tensor | [batch, num_quantizers, T] - RVQ codec codes | required |
left_context_size | list[int] | None | Left context size for streaming decode | None |
seq_token_counts | list[int] | None | Token count for each request in batch | None |
Returns:
| Type | Description |
|---|---|
list[Tensor] | list of audio waveforms |
get_language_model ¶
Delegate to the active stage's language model for upstream MoE resolution.
get_mrope_input_positions ¶
get_mrope_input_positions(
input_tokens: list[int],
mm_features: list[MultiModalFeatureSpec] | None = None,
**kwargs: object,
) -> tuple[Tensor, int]
load_weights ¶
Load weights for all components of the omni model.
make_omni_output ¶
make_omni_output(
model_outputs: Tensor | OmniOutput, **kwargs
) -> OmniOutput
Make an OmniOutput object from model outputs. Args: model_outputs: Model outputs
sample ¶
Sample from logits.
talker_mtp ¶
talker_mtp(
input_ids: Tensor,
input_embeds: Tensor,
last_talker_hidden: Tensor,
text_step: Tensor,
**kwargs: Any,
)
talker_postprocess ¶
talker_postprocess(
hidden_states: Tensor, **info_dict: object
)
Postprocess the talker hidden states.
talker_preprocess ¶
talker_preprocess(
input_ids: Tensor,
input_embeds: Tensor,
**info_dict: dict,
)
Preprocess talker embeds. Noted that we set the MTP here.
talker_preprocess_decode ¶
talker_preprocess_decode(
input_ids: Tensor,
input_embeds: Tensor,
update_dict: OmniPayload,
payload: OmniPayload,
)
talker_preprocess_prefill ¶
talker_preprocess_prefill(
input_ids: Tensor,
input_embeds: Tensor,
payload: OmniPayload,
)