Skip to content

vllm_omni.model_executor.models.mimo_audio.pipeline

MiMo Audio pipeline topology (frozen).

Stage 0: fused thinker+talker — multimodal understanding + text + RVQ codes. Stage 1: Code2Wav — RVQ codes → waveform.

MiMoAudioConfig inherits from Qwen2Config, so the HF model_type field reports qwen2 — the registry's model_type-based auto-detect can't disambiguate. hf_architectures lets StageConfigFactory fall back to matching hf_config.architectures instead.

MIMO_AUDIO_PIPELINE module-attribute

MIMO_AUDIO_PIPELINE = PipelineConfig(
    model_type="mimo_audio",
    model_arch="MiMoAudioModel",
    hf_architectures=(
        "MiMoAudioModel",
        "MiMoV2ASRForCausalLM",
    ),
    stages=(
        StagePipelineConfig(
            stage_id=0,
            model_stage="fused_thinker_talker",
            execution_type=LLM_AR,
            input_sources=(),
            final_output=True,
            final_output_type="text",
            owns_tokenizer=True,
            engine_output_type="latent",
            async_chunk_process_next_stage_input_func=f"{_PROC}.llm2code2wav_async_chunk",
            custom_process_next_stage_input_func=f"{_PROC}.llm2code2wav_full_payload",
            sampling_constraints={
                "detokenize": True,
                "stop_token_ids": [
                    NO_INTERLEAVE_NEXT_TOKEN_ID,
                    _IM_END_TOKEN_ID,
                ],
            },
        ),
        StagePipelineConfig(
            stage_id=1,
            model_stage="code2wav",
            execution_type=LLM_GENERATION,
            input_sources=(0,),
            final_output=True,
            final_output_type="audio",
            engine_output_type="audio",
            custom_process_input_func=f"{_PROC}.llm2code2wav",
            sync_process_input_func=f"{_PROC}.llm2code2wav_token_only",
            sampling_constraints={"detokenize": False},
        ),
    ),
)