Ming-flash-omni 2.0¶
Source https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/ming_flash_omni.
Ming-flash-omni-2.0 is an omni-modal model supporting text, image, video, and audio understanding, with text and speech outputs.
vLLM-Omni supports three deployment modes:
| Mode | Deploy config | Output |
|---|---|---|
| Thinker + Talker (omni-speech, default) | vllm_omni/deploy/ming_flash_omni.yaml | Text + Audio |
| Thinker only (multimodal understanding) | vllm_omni/deploy/ming_flash_omni_thinker_only.yaml | Text |
| Thinker + Imagegen (text-to-image / img2img) | vllm_omni/deploy/ming_flash_omni_image.yaml | Image (online-serving only at the moment) |
For standalone TTS (talker only), see the Ming-flash-omni-TTS section in the Text-To-Speech hub.
Setup¶
Please refer to the stage configuration documentation to configure memory allocation appropriately for your hardware setup.
When no --deploy-config is passed, the model registry auto-loads the full thinker+talker vllm_omni/deploy/ming_flash_omni.yaml (See Omni-Speech).
For text-only output without spinning up the talker, pass:
Run examples¶
The end-to-end script defaults to built-in assets; pass --image-path, --audio-path, or --video-path to override.
Multi-Modality Understanding (Standalone Thinker)¶
Here we pass thinker-only deploy yaml:
python examples/offline_inference/ming_flash_omni/end2end.py --deploy-config vllm_omni/deploy/ming_flash_omni_thinker_only.yaml --query-type text
python examples/offline_inference/ming_flash_omni/end2end.py --deploy-config vllm_omni/deploy/ming_flash_omni_thinker_only.yaml --query-type use_image
python examples/offline_inference/ming_flash_omni/end2end.py --deploy-config vllm_omni/deploy/ming_flash_omni_thinker_only.yaml --query-type use_audio
python examples/offline_inference/ming_flash_omni/end2end.py --deploy-config vllm_omni/deploy/ming_flash_omni_thinker_only.yaml --query-type use_video --num-frames 16
Reasoning (Thinking Mode)¶
Reasoning ("detailed thinking on") is applied by the script when --query-type reasoning is set. The default prompt matches Ming's cookbook and expects the reference figure from the upstream repo — see get_reasoning_query in end2end.py.
python examples/offline_inference/ming_flash_omni/end2end.py \
--deploy-config vllm_omni/deploy/ming_flash_omni_thinker_only.yaml \
--query-type reasoning \
--image-path ./3_0.png
Omni-Speech (Thinker + Talker)¶
The default deploy YAML already runs thinker+talker, so spoken output only requires requesting audio (or text,audio) modalities. The thinker processes your multimodal input, generates text, then the talker synthesises the response as speech.
Audio-only output (speech response, no text):
python examples/offline_inference/ming_flash_omni/end2end.py \
--query-type text \
--modalities audio \
--output-dir output_ming_omni_speech
Both text and audio output:
python examples/offline_inference/ming_flash_omni/end2end.py \
--query-type use_audio \
--modalities text,audio \
--output-dir output_ming_omni_speech
Generated .wav files are saved to --output-dir (default output_ming), one per request.
The default deploy YAML allocates thinker on GPUs 0–3 and talker on GPU 3 for a common device topology (4 rather than 5 devices that talker on its own device). Adjust devices in a copied YAML and pass it via --deploy-config to match your hardware or requirements.
Modality control¶
--modalities | Thinker output | Talker | Saved files |
|---|---|---|---|
text (default) | Text | Not run | <id>.txt |
audio | Text (internal) | Runs | <id>.wav |
text,audio | Text | Runs | <id>.txt + <id>.wav |
Pass --deploy-config /path/to/your_deploy.yaml to any of the commands above to override the bundled deploy config.
Image generation (text-to-image / img2img)¶
The diffusion-stage image-generation path is currently only wired through the online OpenAI-compatible chat endpoint (/v1/chat/completions with "modalities": ["image"]). For payloads, optional knobs (steps/cfg/seed/byte5_text/negative_prompt), and the img2img reference-image flow, see the image-gen section in the online-serving README. end2end.py does not yet exercise the imagegen stage; that is tracked as follow-up work.
Online serving¶
For online serving via the OpenAI-compatible API, see examples/online_serving/ming_flash_omni/README.md.
Example materials¶
end2end.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/ming_flash_omni/end2end.py.