Skip to content

Online serving Example of vLLM-Omni for MiMo-Audio

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/mimo_audio.

🛠️ Installation

Please refer to README.md

⚠️ Important (audio generation)
For audio generation (TTS, responses that include synthesized audio, etc.), install flash-attn for your CUDA and PyTorch stack. Without it on GPU, output audio may be noise-only or unusable. See the FlashAttention repository for compatible builds.

Run examples (MiMo-Audio)

Launch the Server

export MIMO_AUDIO_TOKENIZER_PATH="XiaomiMiMo/MiMo-Audio-Tokenizer"

vllm serve XiaomiMiMo/MiMo-Audio-7B-Instruct --omni \
    --served-model-name "MiMo-Audio-7B-Instruct" \
    --port 18091 \
    --chat-template ./examples/online_serving/mimo_audio/chat_template.jinja

⚠️ Important
MiMo-Audio is not compatible with the default chat template.
The provided chat_template.jinja implements MiMo-specific role, audio token, and instruction formatting and must be used for all inference.

Send Multi-modal Request

Get into the example folder

cd examples/online_serving/mimo_audio

Send request via python

# Audio dialogue task
python openai_chat_completion_client_for_multimodal_generation.py \
--query-type multi_audios \
--message-json ../../offline_inference/mimo_audio/message_base64_wav.json

The Python client supports the following command-line arguments:

  • --query-type (or -q): Query type (default: multi_audios)
  • Options: multi_audios, text
  • --message-json (or -m): Path to base64 multi rounds audio messages json file
  • Do not pass any value for "text" query type
  • Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs, only for "Are these two audio clips the same?" task
  • Example: ---message-json ./examples/offline_inference/mimo_audio/message_base64_wav.json
  • --prompt (or -p): Custom text prompt/question, only for query type is "text"(TTS task)
  • Attention! Do not pass any value for "multi_audios" query type
  • Example: --prompt "What are the main activities shown in this video?"

For example, to use multi rounds audios with local files:

python openai_chat_completion_client_for_multimodal_generation.py \
--query-type multi_audios \
--message-json ../../offline_inference/mimo_audio/message_base64_wav.json

Example materials

chat_template.jinja
{%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0]['role'] == 'system' %}
        {{- messages[0]['content'] }}
    {%- else %}
        {{- 'You are a helpful assistant.' }}
    {%- endif %}
    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
    {%- if messages[0]['role'] == 'system' %}
        {%- set _m = '<|sosp|><|empty|><|eosp|>' -%}
        {%- set _raw0 = messages[0]['content'] if messages[0]['content'] is string else '' -%}
        {%- if _m in _raw0 %}
            {%- set _t0 = (_raw0 | replace(_m ~ '\n', '') | replace(_m, '') | trim) -%}
            {{- '<|im_start|>system\n' + (_t0 ~ _m if _t0 else _m) + '<|im_end|>\n' }}
        {%- else %}
            {{- '<|im_start|>system\n' + _raw0 + '<|im_end|>\n' }}
        {%- endif %}
    {%- else %}
        {{- '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}
    {%- endif %}
{%- endif %}

{%- for message in messages %}
    {%- if message['role'] == 'assistant' %}
        {{- '<|im_start|>assistant' }}
        {%- set _sosp = '<|sosp|><|empty|><|eosp|>' -%}
        {%- set _text = message['content'] if message['content'] is string else '' -%}
        {%- if _sosp in _text %}
            {%- set _clean = _text | replace(_sosp, '') -%}
            {%- set _body = _clean[1:] if (_clean and _clean[0] == '\n') else _clean -%}
            {{- '\n<|sostm|>' + _body + '<|eot|><|empty|><|eostm|>' }}
        {%- else %}
            {%- set _body = _text[1:] if (_text and _text[0] == '\n') else _text -%}
            {{- '\n<|sostm|>' + _body + '<|eot|><|eostm|>' }}
        {%- endif %}
        {{- '<|im_end|>\n' }}

    {%- elif message['role'] == 'user' %}
        {%- set _m = '<|sosp|><|empty|><|eosp|>' -%}
        {%- set _raw = message['content'] if message['content'] is string else '' -%}
        {%- if _m in _raw %}
            {%- set _t = (_raw | replace(_m ~ '\n', '') | replace(_m, '') | trim) -%}
            {{- '<|im_start|>user\n' + (_t ~ _m if _t else _m) + '<|im_end|>\n' }}
        {%- else %}
            {{- '<|im_start|>user\n' + _raw + '<|im_end|>\n' }}
        {%- endif %}

    {%- elif message['role'] == 'system' %}
        {%- if not loop.first %}
            {%- set _m = '<|sosp|><|empty|><|eosp|>' -%}
            {%- set _raw = message['content'] if message['content'] is string else '' -%}
            {%- if _m in _raw %}
                {%- set _t = (_raw | replace(_m ~ '\n', '') | replace(_m, '') | trim) -%}
                {{- '<|im_start|>system\n' + (_t ~ _m if _t else _m) + '<|im_end|>\n' }}
            {%- else %}
                {{- '<|im_start|>system\n' + _raw + '<|im_end|>\n' }}
            {%- endif %}
        {%- endif %}

    {%- elif message['role'] == 'tool' %}
        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1]['role'] != 'tool') %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {%- set _m = '<|sosp|><|empty|><|eosp|>' -%}
        {%- set _raw = message['content'] if message['content'] is string else '' -%}
        {%- if _m in _raw %}
            {%- set _t = (_raw | replace(_m ~ '\n', '') | replace(_m, '') | trim) -%}
            {{- '\n<tool_response>\n' + (_t ~ _m if _t else _m) + '\n</tool_response>' }}
        {%- else %}
            {{- '\n<tool_response>\n' + _raw + '\n</tool_response>' }}
        {%- endif %}
        {%- if loop.last or (messages[loop.index0 + 1]['role'] != 'tool') %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- endif %}
{%- endfor %}

{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n<|sostm|>' }}
{%- endif %}
openai_chat_completion_client_for_multimodal_generation.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/mimo_audio/openai_chat_completion_client_for_multimodal_generation.py.