Skip to content

vllm_omni.model_executor.models.qwen3_omni

Modules:

Name Description
pipeline

Qwen3-Omni-MoE pipeline topology (frozen).

qwen3_moe
qwen3_omni

Inference-only Qwen3-Omni-Moe unified model (thinker + talker + code2wav).

qwen3_omni_code2wav

Inference-only Qwen3-Omni-Moe Code2Wav model.

qwen3_omni_moe_code_predictor_mtp

Qwen3-Omni Code Predictor -- thin wrapper over CodePredictorWrapper.

qwen3_omni_moe_talker
qwen3_omni_moe_thinker

Inference-only Qwen3-Omni-Moe model (thinker part).

Qwen3OmniMoeForConditionalGeneration

Bases: Module, SupportsMultiModal, SupportsPP, Qwen3OmniMoeConditionalGenerationMixin, CustomProcessMixin, SupportsMRoPE, SupportsRealtime

Unified Qwen3 Omni MoE model combining thinker, talker, and code2wav.

Architecture: - Thinker: Multimodal understanding (text + audio + video) → text generation - Talker: Text embeddings → RVQ codec codes - Code2Wav: RVQ codes → audio waveform

Usage

Set model_stage in vllm_config to one of: "thinker", "talker", "code2wav"

code2wav instance-attribute

code2wav = None

code2wav_config instance-attribute

code2wav_config = code2wav_config

config instance-attribute

config = config

enable_update_additional_information instance-attribute

enable_update_additional_information = True

gpu_resident_buffer_keys instance-attribute

gpu_resident_buffer_keys: set[tuple[str, str]] = {
    ("hidden_states", "last"),
    ("hidden_states", "trailing_text"),
    ("embed", "tts_pad_projected"),
    ("codes", "audio"),
}

has_postprocess instance-attribute

has_postprocess = False

has_preprocess instance-attribute

has_preprocess = False

have_multimodal_outputs instance-attribute

have_multimodal_outputs = True

make_empty_intermediate_tensors instance-attribute

make_empty_intermediate_tensors = (
    make_empty_intermediate_tensors
    if model_stage == "thinker"
    else (lambda: None)
)

model instance-attribute

model = thinker

model_stage instance-attribute

model_stage = model_stage

multimodal_config instance-attribute

multimodal_config = multimodal_config

realtime_max_tokens class-attribute instance-attribute

realtime_max_tokens = 64

requires_raw_input_tokens instance-attribute

requires_raw_input_tokens = True

sampler cached property

sampler

Get sampler from active model.

suppressed_tokens instance-attribute

suppressed_tokens = _get_talker_suppressed_tokens()

talker instance-attribute

talker = None

talker_config instance-attribute

talker_config = talker_config

thinker instance-attribute

thinker = init_vllm_registered_model(
    vllm_config=thinker_vllm_config,
    prefix=maybe_prefix(prefix, "thinker"),
    hf_config=thinker_config,
    architectures=[
        "Qwen3OmniMoeThinkerForConditionalGeneration"
    ],
)

thinker_config instance-attribute

thinker_config = thinker_config

tts_tokens instance-attribute

tts_tokens = tensor(
    [
        [
            tts_bos_token_id,
            tts_eos_token_id,
            tts_pad_token_id,
        ]
    ],
    device=_module_device(thinker),
    dtype=long,
)

vllm_config instance-attribute

vllm_config = vllm_config

buffer_realtime_audio async classmethod

buffer_realtime_audio(
    audio_stream: AsyncGenerator[ndarray, None],
    input_stream: Queue[list[int]],
    model_config: ModelConfig,
) -> AsyncGenerator[PromptType, None]

compute_logits

compute_logits(
    hidden_states: Tensor | OmniOutput,
    sampling_metadata: SamplingMetadata = None,
) -> Tensor | None

Compute logits from hidden states.

embed_input_ids

embed_input_ids(
    input_ids: Tensor,
    multimodal_embeddings=None,
    is_multimodal=None,
) -> Tensor

embed_multimodal

embed_multimodal(**kwargs)

Delegate to active model for multimodal processing.

forward

forward(
    input_ids: Tensor,
    positions: Tensor,
    intermediate_tensors: IntermediateTensors | None = None,
    inputs_embeds: Tensor | None = None,
    generate_audio: bool = True,
    voice_type: str = "ethan",
    codec: Tensor | None = None,
    sampling_metadata: SamplingMetadata | None = None,
    logits_index: int | None = None,
    runtime_additional_information: list[dict[str, Any]]
    | None = None,
    **kwargs: object,
) -> Tensor | IntermediateTensors | OmniOutput

Unified forward pass for all model stages.

Workflow: 1) Thinker: multimodal understanding → text hidden states 2) Talker -> Code Predictor: text embeddings → codec codes (layer 0 + code_predictor:residual layers) 3) Code2wav: 8-layer RVQ codes → audio waveform

Returns:

Type Description
Tensor | IntermediateTensors | OmniOutput

OmniOutput with text_hidden_states and optional audio

generate_audio

generate_audio(
    code: Tensor,
    left_context_size: list[int] | None = None,
    seq_token_counts: list[int] | None = None,
) -> list[Tensor]

Generate audio waveform from codec codes.

Parameters:

Name Type Description Default
code Tensor

[batch, num_quantizers, T] - RVQ codec codes

required
left_context_size list[int] | None

Left context size for streaming decode

None
seq_token_counts list[int] | None

Token count for each request in batch

None

Returns:

Type Description
list[Tensor]

list of audio waveforms

get_language_model

get_language_model() -> Module

Delegate to the active stage's language model for upstream MoE resolution.

get_mrope_input_positions

get_mrope_input_positions(
    input_tokens: list[int],
    mm_features: list[MultiModalFeatureSpec] | None = None,
    **kwargs: object,
) -> tuple[Tensor, int]

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Load weights for all components of the omni model.

make_omni_output

make_omni_output(
    model_outputs: Tensor | OmniOutput, **kwargs
) -> OmniOutput

Make an OmniOutput object from model outputs. Args: model_outputs: Model outputs

sample

sample(
    logits: Tensor, sampling_metadata: SamplingMetadata
) -> SamplerOutput | None

Sample from logits.

talker_mtp

talker_mtp(
    input_ids: Tensor,
    input_embeds: Tensor,
    last_talker_hidden: Tensor,
    text_step: Tensor,
    **kwargs: Any,
)

talker_postprocess

talker_postprocess(
    hidden_states: Tensor, **info_dict: object
)

Postprocess the talker hidden states.

talker_preprocess

talker_preprocess(
    input_ids: Tensor,
    input_embeds: Tensor,
    **info_dict: dict,
)

Preprocess talker embeds. Noted that we set the MTP here.

talker_preprocess_decode

talker_preprocess_decode(
    input_ids: Tensor,
    input_embeds: Tensor,
    update_dict: OmniPayload,
    payload: OmniPayload,
)

talker_preprocess_prefill

talker_preprocess_prefill(
    input_ids: Tensor,
    input_embeds: Tensor,
    payload: OmniPayload,
)