Ming-flash-omni 2.0¶

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/ming_flash_omni.

Ming-flash-omni-2.0 is an omni-modal model supporting text, image, video, and audio understanding, with text and speech outputs.

vLLM-Omni supports three deployment modes:

Mode	Deploy config	Output
Thinker + Talker (omni-speech, default)	`vllm_omni/deploy/ming_flash_omni.yaml`	Text + Audio
Thinker only (multimodal understanding)	`vllm_omni/deploy/ming_flash_omni_thinker_only.yaml`	Text
Thinker + Imagegen (text-to-image / img2img)	`vllm_omni/deploy/ming_flash_omni_image.yaml`	Image (online-serving only at the moment)

For standalone TTS (talker only), see the Ming-flash-omni-TTS section in the Text-To-Speech hub.

Setup¶

Please refer to the stage configuration documentation to configure memory allocation appropriately for your hardware setup.

When no --deploy-config is passed, the model registry auto-loads the full thinker+talker vllm_omni/deploy/ming_flash_omni.yaml (See Omni-Speech).

For text-only output without spinning up the talker, pass:

--deploy-config vllm_omni/deploy/ming_flash_omni_thinker_only.yaml

Run examples¶

The end-to-end script defaults to built-in assets; pass --image-path, --audio-path, or --video-path to override.

Multi-Modality Understanding (Standalone Thinker)¶

Here we pass thinker-only deploy yaml:

python examples/offline_inference/ming_flash_omni/end2end.py --deploy-config vllm_omni/deploy/ming_flash_omni_thinker_only.yaml --query-type text
python examples/offline_inference/ming_flash_omni/end2end.py --deploy-config vllm_omni/deploy/ming_flash_omni_thinker_only.yaml --query-type use_image
python examples/offline_inference/ming_flash_omni/end2end.py --deploy-config vllm_omni/deploy/ming_flash_omni_thinker_only.yaml --query-type use_audio
python examples/offline_inference/ming_flash_omni/end2end.py --deploy-config vllm_omni/deploy/ming_flash_omni_thinker_only.yaml --query-type use_video --num-frames 16

Reasoning (Thinking Mode)¶

Reasoning ("detailed thinking on") is applied by the script when --query-type reasoning is set. The default prompt matches Ming's cookbook and expects the reference figure from the upstream repo — see get_reasoning_query in end2end.py.

python examples/offline_inference/ming_flash_omni/end2end.py \
    --deploy-config vllm_omni/deploy/ming_flash_omni_thinker_only.yaml \
    --query-type reasoning \
    --image-path ./3_0.png

Omni-Speech (Thinker + Talker)¶

The default deploy YAML already runs thinker+talker, so spoken output only requires requesting audio (or text,audio) modalities. The thinker processes your multimodal input, generates text, then the talker synthesises the response as speech.

Audio-only output (speech response, no text):

python examples/offline_inference/ming_flash_omni/end2end.py \
    --query-type text \
    --modalities audio \
    --output-dir output_ming_omni_speech

Both text and audio output:

python examples/offline_inference/ming_flash_omni/end2end.py \
    --query-type use_audio \
    --modalities text,audio \
    --output-dir output_ming_omni_speech

Generated .wav files are saved to --output-dir (default output_ming), one per request.

The default deploy YAML allocates thinker on GPUs 0–3 and talker on GPU 3 for a common device topology (4 rather than 5 devices that talker on its own device). Adjust devices in a copied YAML and pass it via --deploy-config to match your hardware or requirements.

Modality control¶

`--modalities`	Thinker output	Talker	Saved files
`text` (default)	Text	Not run	`<id>.txt`
`audio`	Text (internal)	Runs	`<id>.wav`
`text,audio`	Text	Runs	`<id>.txt` + `<id>.wav`

Pass --deploy-config /path/to/your_deploy.yaml to any of the commands above to override the bundled deploy config.

Image generation (text-to-image / img2img)¶

The diffusion-stage image-generation path is currently only wired through the online OpenAI-compatible chat endpoint (/v1/chat/completions with "modalities": ["image"]). For payloads, optional knobs (steps/cfg/seed/byte5_text/negative_prompt), and the img2img reference-image flow, see the image-gen section in the online-serving README. end2end.py does not yet exercise the imagegen stage; that is tracked as follow-up work.

Online serving¶

For online serving via the OpenAI-compatible API, see examples/online_serving/ming_flash_omni/README.md.

Example materials¶

end2end.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/ming_flash_omni/end2end.py.