MiMo-Audio Offline Inference¶

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/mimo_audio.

This directory contains an offline demo for running MiMo-Audio models with vLLM Omni. It builds task-specific inputs and generates WAV files or text outputs locally.

Model Overview¶

MiMo-Audio provides multiple task variants for audio understanding and generation:

tts_sft: Basic text-to-speech generation from text input.
tts_sft_with_instruct: TTS generation with explicit voice style instructions.
tts_sft_with_audio: TTS generation with audio reference for voice cloning.
tts_sft_with_natural_instruction: TTS generation from natural language descriptions embedded in text.
audio_trancribing_sft: Transcribe audio to text (speech-to-text). (note: the upstream task name uses the spelling 'trancribing', don't fix it)
audio_understanding_sft: Understand and analyze audio content with text queries.
audio_understanding_sft_with_thinking: Audio understanding with reasoning chain.
spoken_dialogue_sft_multiturn: Multi-turn spoken dialogue with audio input/output.
speech2text_dialogue_sft_multiturn: Multi-turn dialogue converting speech to text.
text_dialogue_sft_multiturn: Multi-turn text-only dialogue.

Setup¶

Please refer to the stage configuration documentation to configure memory allocation appropriately for your hardware setup.

Environment Variables¶

The MIMO_AUDIO_TOKENIZER_PATH environment variable is mandatory due to the specialized architecture:

export MIMO_AUDIO_TOKENIZER_PATH="XiaomiMiMo/MiMo-Audio-Tokenizer"

Flash Attention (audio generation)¶

For audio generation (e.g. TTS variants, multi-turn spoken dialogue with audio output), install the flash-attn package with a build that matches your CUDA and PyTorch versions. On GPU, omitting flash-attn can cause generated audio to be noise-only or otherwise unusable. See the FlashAttention project for installation options and prebuilt wheels.

Quick Start¶

Run a single sample for basic TTS:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type tts_sft

Run batch samples for basic TTS:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type tts_sft \
  --num-prompts {batch_size}

When enabling multi-batch processing, if the total number of tokens passed to the next stage exceeds the max_model_len value in the mimo_audio.yaml configuration file, you must also synchronously update the max_position_embeddings value in MiMo-Audio-7B-Instruct/config.json to match the modified value.

Generated audio files are saved to output_audio/ by default. --num-prompts also can be used to all tasks below.

Task Usage¶

tts_sft (Basic Text-to-Speech)¶

Generate speech from text input:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type tts_sft \
  --text "The weather is so nice today."

tts_sft_with_instruct (TTS with Voice Instructions)¶

Generate speech with explicit voice style instructions:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type tts_sft_with_instruct \
  --text "The weather is so nice today." \
  --instruct "Speak happily in a child's voice"

tts_sft_with_audio (TTS with Audio Reference)¶

Generate speech using an audio reference for voice cloning:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type tts_sft_with_audio \
  --text "The weather is so nice today." \
  --audio-path "./spoken_dialogue_assistant_turn_1.wav"

tts_sft_with_natural_instruction (Natural Language TTS)¶

Generate speech from text containing natural voice descriptions:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type tts_sft_with_natural_instruction \
  --text "In a panting young male voice, he said: I can't run anymore, wait for me!"

audio_trancribing_sft (Speech-to-Text)¶

Transcribe audio to text:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type audio_trancribing_sft \
  --audio-path "./spoken_dialogue_assistant_turn_1.wav"

audio_understanding_sft (Audio Understanding)¶

Understand and analyze audio content with text queries:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type audio_understanding_sft \
  --text "Summarize the audio." \
  --audio-path "./spoken_dialogue_assistant_turn_1.wav"

audio_understanding_sft_with_thinking (Audio Understanding with Reasoning)¶

Audio understanding with reasoning chain:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type audio_understanding_sft_with_thinking \
  --text "Summarize the audio." \
  --audio-path "./spoken_dialogue_assistant_turn_1.wav"

spoken_dialogue_sft_multiturn (Multi-turn Spoken Dialogue)¶

Multi-turn dialogue with audio input and output:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type spoken_dialogue_sft_multiturn \
  --audio-path "./prompt_speech_zh_m.wav"

Note: This task uses hardcoded audio files in the script. The audio files used in examples are available at: https://github.com/XiaomiMiMo/MiMo-Audio/tree/main/examples

speech2text_dialogue_sft_multiturn (Speech-to-Text Dialogue)¶

Multi-turn dialogue converting speech to text:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type speech2text_dialogue_sft_multiturn

Note: This task uses hardcoded audio files and message lists in the script.

text_dialogue_sft_multiturn (Text Dialogue)¶

Multi-turn text-only dialogue:

python3 -u end2end.py \
  --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
  --query-type text_dialogue_sft_multiturn

Note: This task uses hardcoded message lists in the script.

Troubleshooting¶

Tokenizer path¶

MIMO_AUDIO_TOKENIZER_PATH not set or model fails to find tokenizer
Export the tokenizer path before running:
```
export MIMO_AUDIO_TOKENIZER_PATH="XiaomiMiMo/MiMo-Audio-Tokenizer"
```
See Environment Variables in Setup.

Other¶

If the model or stage config fails to load, check stage configuration documentation for memory and GPU settings.
For errors when reading/writing WAV (e.g. unsupported format), ensure input files are standard WAV/MP3 and that soundfile is linked to a working libsndfile (see above).

Notes¶

The script uses default model paths and audio files embedded in end2end.py. Update them if your local cache path differs.
Use --output-dir to change the output folder (default: ./output_audio).
Use --num-prompts to generate multiple prompts in one run (default: 1).
Audio files used in multi-turn dialogue examples are available at: https://github.com/XiaomiMiMo/MiMo-Audio/tree/main/examples
The script supports various configuration options for initialization timeouts, batch timeouts, and shared memory thresholds. See --help for details.

Example materials¶

end2end.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/mimo_audio/end2end.py.

message_base64_wav.json

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/mimo_audio/message_base64_wav.json.

message_convert.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/mimo_audio/message_convert.py.

process_speechdata.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/mimo_audio/process_speechdata.py.