vllm_omni.model_executor.models.mimo_audio.pipeline ¶
MiMo Audio pipeline topology (frozen).
Stage 0: fused thinker+talker — multimodal understanding + text + RVQ codes. Stage 1: Code2Wav — RVQ codes → waveform.
MiMoAudioConfig inherits from Qwen2Config, so the HF model_type field reports qwen2 — the registry's model_type-based auto-detect can't disambiguate. hf_architectures lets StageConfigFactory fall back to matching hf_config.architectures instead.
MIMO_AUDIO_PIPELINE module-attribute ¶
MIMO_AUDIO_PIPELINE = PipelineConfig(
model_type="mimo_audio",
model_arch="MiMoAudioModel",
hf_architectures=(
"MiMoAudioModel",
"MiMoV2ASRForCausalLM",
),
stages=(
StagePipelineConfig(
stage_id=0,
model_stage="fused_thinker_talker",
execution_type=LLM_AR,
input_sources=(),
final_output=True,
final_output_type="text",
owns_tokenizer=True,
engine_output_type="latent",
async_chunk_process_next_stage_input_func=f"{_PROC}.llm2code2wav_async_chunk",
custom_process_next_stage_input_func=f"{_PROC}.llm2code2wav_full_payload",
sampling_constraints={
"detokenize": True,
"stop_token_ids": [
NO_INTERLEAVE_NEXT_TOKEN_ID,
_IM_END_TOKEN_ID,
],
},
),
StagePipelineConfig(
stage_id=1,
model_stage="code2wav",
execution_type=LLM_GENERATION,
input_sources=(0,),
final_output=True,
final_output_type="audio",
engine_output_type="audio",
custom_process_input_func=f"{_PROC}.llm2code2wav",
sync_process_input_func=f"{_PROC}.llm2code2wav_token_only",
sampling_constraints={"detokenize": False},
),
),
)