MiMo-Audio Offline Inference¶
Source https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/mimo_audio.
This directory contains an offline demo for running MiMo-Audio models with vLLM Omni. It builds task-specific inputs and generates WAV files or text outputs locally.
Model Overview¶
MiMo-Audio provides multiple task variants for audio understanding and generation:
- tts_sft: Basic text-to-speech generation from text input.
- tts_sft_with_instruct: TTS generation with explicit voice style instructions.
- tts_sft_with_audio: TTS generation with audio reference for voice cloning.
- tts_sft_with_natural_instruction: TTS generation from natural language descriptions embedded in text.
- audio_trancribing_sft: Transcribe audio to text (speech-to-text). (note: the upstream task name uses the spelling 'trancribing', don't fix it)
- audio_understanding_sft: Understand and analyze audio content with text queries.
- audio_understanding_sft_with_thinking: Audio understanding with reasoning chain.
- spoken_dialogue_sft_multiturn: Multi-turn spoken dialogue with audio input/output.
- speech2text_dialogue_sft_multiturn: Multi-turn dialogue converting speech to text.
- text_dialogue_sft_multiturn: Multi-turn text-only dialogue.
Setup¶
Please refer to the stage configuration documentation to configure memory allocation appropriately for your hardware setup.
Environment Variables¶
The MIMO_AUDIO_TOKENIZER_PATH environment variable is mandatory due to the specialized architecture:
Flash Attention (audio generation)¶
For audio generation (e.g. TTS variants, multi-turn spoken dialogue with audio output), install the flash-attn package with a build that matches your CUDA and PyTorch versions. On GPU, omitting flash-attn can cause generated audio to be noise-only or otherwise unusable. See the FlashAttention project for installation options and prebuilt wheels.
Quick Start¶
Run a single sample for basic TTS:
Run batch samples for basic TTS:
python3 -u end2end.py \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type tts_sft \
--num-prompts {batch_size}
When enabling multi-batch processing, if the total number of tokens passed to the next stage exceeds the max_model_len value in the mimo_audio.yaml configuration file, you must also synchronously update the max_position_embeddings value in MiMo-Audio-7B-Instruct/config.json to match the modified value.
Generated audio files are saved to output_audio/ by default. --num-prompts also can be used to all tasks below.
Task Usage¶
tts_sft (Basic Text-to-Speech)¶
Generate speech from text input:
python3 -u end2end.py \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type tts_sft \
--text "The weather is so nice today."
tts_sft_with_instruct (TTS with Voice Instructions)¶
Generate speech with explicit voice style instructions:
python3 -u end2end.py \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type tts_sft_with_instruct \
--text "The weather is so nice today." \
--instruct "Speak happily in a child's voice"
tts_sft_with_audio (TTS with Audio Reference)¶
Generate speech using an audio reference for voice cloning:
python3 -u end2end.py \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type tts_sft_with_audio \
--text "The weather is so nice today." \
--audio-path "./spoken_dialogue_assistant_turn_1.wav"
tts_sft_with_natural_instruction (Natural Language TTS)¶
Generate speech from text containing natural voice descriptions:
python3 -u end2end.py \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type tts_sft_with_natural_instruction \
--text "In a panting young male voice, he said: I can't run anymore, wait for me!"
audio_trancribing_sft (Speech-to-Text)¶
Transcribe audio to text:
python3 -u end2end.py \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type audio_trancribing_sft \
--audio-path "./spoken_dialogue_assistant_turn_1.wav"
audio_understanding_sft (Audio Understanding)¶
Understand and analyze audio content with text queries:
python3 -u end2end.py \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type audio_understanding_sft \
--text "Summarize the audio." \
--audio-path "./spoken_dialogue_assistant_turn_1.wav"
audio_understanding_sft_with_thinking (Audio Understanding with Reasoning)¶
Audio understanding with reasoning chain:
python3 -u end2end.py \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type audio_understanding_sft_with_thinking \
--text "Summarize the audio." \
--audio-path "./spoken_dialogue_assistant_turn_1.wav"
spoken_dialogue_sft_multiturn (Multi-turn Spoken Dialogue)¶
Multi-turn dialogue with audio input and output:
python3 -u end2end.py \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type spoken_dialogue_sft_multiturn \
--audio-path "./prompt_speech_zh_m.wav"
Note: This task uses hardcoded audio files in the script. The audio files used in examples are available at: https://github.com/XiaomiMiMo/MiMo-Audio/tree/main/examples
speech2text_dialogue_sft_multiturn (Speech-to-Text Dialogue)¶
Multi-turn dialogue converting speech to text:
python3 -u end2end.py \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type speech2text_dialogue_sft_multiturn
Note: This task uses hardcoded audio files and message lists in the script.
text_dialogue_sft_multiturn (Text Dialogue)¶
Multi-turn text-only dialogue:
python3 -u end2end.py \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type text_dialogue_sft_multiturn
Note: This task uses hardcoded message lists in the script.
Troubleshooting¶
Tokenizer path¶
MIMO_AUDIO_TOKENIZER_PATHnot set or model fails to find tokenizer
Export the tokenizer path before running: See Environment Variables in Setup.
Other¶
- If the model or stage config fails to load, check stage configuration documentation for memory and GPU settings.
- For errors when reading/writing WAV (e.g. unsupported format), ensure input files are standard WAV/MP3 and that
soundfileis linked to a working libsndfile (see above).
Notes¶
- The script uses default model paths and audio files embedded in
end2end.py. Update them if your local cache path differs. - Use
--output-dirto change the output folder (default:./output_audio). - Use
--num-promptsto generate multiple prompts in one run (default: 1). - Audio files used in multi-turn dialogue examples are available at: https://github.com/XiaomiMiMo/MiMo-Audio/tree/main/examples
- The script supports various configuration options for initialization timeouts, batch timeouts, and shared memory thresholds. See
--helpfor details.
Example materials¶
end2end.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/mimo_audio/end2end.py.
message_base64_wav.json
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/mimo_audio/message_base64_wav.json.
message_convert.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/mimo_audio/message_convert.py.
process_speechdata.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/mimo_audio/process_speechdata.py.