Skip to content

MiMo-Audio Offline Inference

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/mimo_audio.

This directory contains an offline demo for running MiMo-Audio models with vLLM Omni. It builds task-specific inputs and generates WAV files or text outputs locally.

Model Overview

MiMo-Audio provides multiple task variants for audio understanding and generation:

  • tts_sft: Basic text-to-speech generation from text input.
  • tts_sft_with_instruct: TTS generation with explicit voice style instructions.
  • tts_sft_with_audio: TTS generation with audio reference for voice cloning.
  • tts_sft_with_natural_instruction: TTS generation from natural language descriptions embedded in text.
  • audio_trancribing_sft: Transcribe audio to text (speech-to-text). (note: the upstream task name uses the spelling 'trancribing', don't fix it)
  • audio_understanding_sft: Understand and analyze audio content with text queries.
  • audio_understanding_sft_with_thinking: Audio understanding with reasoning chain.
  • spoken_dialogue_sft_multiturn: Multi-turn spoken dialogue with audio input/output.
  • speech2text_dialogue_sft_multiturn: Multi-turn dialogue converting speech to text.
  • text_dialogue_sft_multiturn: Multi-turn text-only dialogue.

Setup

Please refer to the stage configuration documentation to configure memory allocation appropriately for your hardware setup.

Environment Variables

The MIMO_AUDIO_TOKENIZER_PATH environment variable is mandatory due to the specialized architecture:

export MIMO_AUDIO_TOKENIZER_PATH="XiaomiMiMo/MiMo-Audio-Tokenizer"

Flash Attention (audio generation)

For audio generation (e.g. TTS variants, multi-turn spoken dialogue with audio output), install the flash-attn package with a build that matches your CUDA and PyTorch versions. On GPU, omitting flash-attn can cause generated audio to be noise-only or otherwise unusable. See the FlashAttention project for installation options and prebuilt wheels.

Quick Start

Run a single sample for basic TTS:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type tts_sft

Run batch samples for basic TTS:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type tts_sft \
  --num-prompts {batch_size}

When enabling multi-batch processing, if the total number of tokens passed to the next stage exceeds the max_model_len value in the mimo_audio.yaml configuration file, you must also synchronously update the max_position_embeddings value in MiMo-Audio-7B-Instruct/config.json to match the modified value.

Generated audio files are saved to output_audio/ by default. --num-prompts also can be used to all tasks below.

Task Usage

tts_sft (Basic Text-to-Speech)

Generate speech from text input:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type tts_sft \
  --text "The weather is so nice today."

tts_sft_with_instruct (TTS with Voice Instructions)

Generate speech with explicit voice style instructions:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type tts_sft_with_instruct \
  --text "The weather is so nice today." \
  --instruct "Speak happily in a child's voice"

tts_sft_with_audio (TTS with Audio Reference)

Generate speech using an audio reference for voice cloning:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type tts_sft_with_audio \
  --text "The weather is so nice today." \
  --audio-path "./spoken_dialogue_assistant_turn_1.wav"

tts_sft_with_natural_instruction (Natural Language TTS)

Generate speech from text containing natural voice descriptions:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type tts_sft_with_natural_instruction \
  --text "In a panting young male voice, he said: I can't run anymore, wait for me!"

audio_trancribing_sft (Speech-to-Text)

Transcribe audio to text:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type audio_trancribing_sft \
  --audio-path "./spoken_dialogue_assistant_turn_1.wav"

audio_understanding_sft (Audio Understanding)

Understand and analyze audio content with text queries:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type audio_understanding_sft \
  --text "Summarize the audio." \
  --audio-path "./spoken_dialogue_assistant_turn_1.wav"

audio_understanding_sft_with_thinking (Audio Understanding with Reasoning)

Audio understanding with reasoning chain:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type audio_understanding_sft_with_thinking \
  --text "Summarize the audio." \
  --audio-path "./spoken_dialogue_assistant_turn_1.wav"

spoken_dialogue_sft_multiturn (Multi-turn Spoken Dialogue)

Multi-turn dialogue with audio input and output:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type spoken_dialogue_sft_multiturn \
  --audio-path "./prompt_speech_zh_m.wav"

Note: This task uses hardcoded audio files in the script. The audio files used in examples are available at: https://github.com/XiaomiMiMo/MiMo-Audio/tree/main/examples

speech2text_dialogue_sft_multiturn (Speech-to-Text Dialogue)

Multi-turn dialogue converting speech to text:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type speech2text_dialogue_sft_multiturn

Note: This task uses hardcoded audio files and message lists in the script.

text_dialogue_sft_multiturn (Text Dialogue)

Multi-turn text-only dialogue:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type text_dialogue_sft_multiturn

Note: This task uses hardcoded message lists in the script.

Troubleshooting

Tokenizer path

  • MIMO_AUDIO_TOKENIZER_PATH not set or model fails to find tokenizer
    Export the tokenizer path before running:
    export MIMO_AUDIO_TOKENIZER_PATH="XiaomiMiMo/MiMo-Audio-Tokenizer"
    
    See Environment Variables in Setup.

Other

  • If the model or stage config fails to load, check stage configuration documentation for memory and GPU settings.
  • For errors when reading/writing WAV (e.g. unsupported format), ensure input files are standard WAV/MP3 and that soundfile is linked to a working libsndfile (see above).

Notes

  • The script uses default model paths and audio files embedded in end2end.py. Update them if your local cache path differs.
  • Use --output-dir to change the output folder (default: ./output_audio).
  • Use --num-prompts to generate multiple prompts in one run (default: 1).
  • Audio files used in multi-turn dialogue examples are available at: https://github.com/XiaomiMiMo/MiMo-Audio/tree/main/examples
  • The script supports various configuration options for initialization timeouts, batch timeouts, and shared memory thresholds. See --help for details.

Example materials

end2end.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/mimo_audio/end2end.py.

message_base64_wav.json

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/mimo_audio/message_base64_wav.json.

message_convert.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/mimo_audio/message_convert.py.

process_speechdata.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/mimo_audio/process_speechdata.py.