Ming-flash-omni 2.0¶
Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/ming_flash_omni.
Installation¶
Please refer to README.md
Deployment modes¶
| Mode | Launch command | Output |
|---|---|---|
| Thinker + Talker (omni-speech, default) | vllm serve ... --omni | Text + Audio |
| Thinker only (multimodal understanding) | vllm serve ... --omni --deploy-config vllm_omni/deploy/ming_flash_omni_thinker_only.yaml | Text |
| Thinker + Imagegen (text-to-image / img2img) | vllm serve ... --omni --deploy-config vllm_omni/deploy/ming_flash_omni_image.yaml | Image |
For standalone TTS (talker only), see the Ming-flash-omni-TTS section in the Text-To-Speech hub.
Run examples (Ming-flash-omni 2.0)¶
Launch the Server¶
Thinker + Talker (omni-speech, text + audio output):
The model registry auto-loads corresponding deploy yaml.
Thinker-only (text output):
vllm serve Jonathan1909/Ming-flash-omni-2.0 --omni --port 8091 \
--deploy-config vllm_omni/deploy/ming_flash_omni_thinker_only.yaml
Pass --deploy-config /path/to/your_deploy.yaml to use a custom deploy config.
Send Multi-modal Request¶
Shared Python client (supports text | use_image | use_audio | use_video | use_mixed_modalities; pass --image-path / --audio-path / --video-path for local files or URLs, --modalities text for output, --help for the full flag list):
python examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py \
--model Jonathan1909/Ming-flash-omni-2.0 \
--query-type use_mixed_modalities \
--port 8091 --host localhost \
--modalities text
Parameterized curl wrapper in this directory:
bash run_curl_multimodal_generation.sh text
bash run_curl_multimodal_generation.sh use_image
bash run_curl_multimodal_generation.sh use_audio
bash run_curl_multimodal_generation.sh use_video
bash run_curl_multimodal_generation.sh use_mixed_modalities
bash run_curl_multimodal_generation.sh use_image_gen
Image generation (text-to-image)¶
Ming-flash-omni-2.0 also exposes an image-generation (diffusion) stage. Launch with the image deploy YAML, which adds an image-gen stage behind the thinker:
vllm serve Jonathan1909/Ming-flash-omni-2.0 --omni \
--deploy-config vllm_omni/deploy/ming_flash_omni_image.yaml \
--stage-init-timeout 1800 \
--init-timeout 1800 \
--port 8091
Then request image output by passing "modalities": ["image"]:
curl -s http://127.0.0.1:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Jonathan1909/Ming-flash-omni-2.0",
"messages": [{"role": "user", "content": "Please draw a cute cat."}],
"modalities": ["image"]
}' | jq -r '.choices[0].message.content[0].image_url.url | split(",")[1]' | base64 -d > ming_imagegen.png
Optional knobs¶
Pass image-gen overrides via the diffusion-stage sampling_params_list[1].extra_args.image_gen:
| Key | Default | Description |
|---|---|---|
height / width | from config (1024) | Output resolution (multiples of vae_scale_factor * 2, currently 16). |
steps | 30 | Number of FlowMatchEuler denoise steps. |
cfg | 2.0 | Classifier-free guidance scale. |
seed | 42 | Per-request RNG seed (deterministic when ByT5 is also seed-stable). |
byte5_text | (auto) | Override the glyph text for ByT5 enhancement; raw strings are auto-wrapped to Ming's Text "...". format. |
negative_prompt | empty | Real CFG negative conditioning (must be set on stage-0 thinker sampling_params so expand_cfg_prompts spawns the companion). |
Example with all the knobs:
curl http://127.0.0.1:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Jonathan1909/Ming-flash-omni-2.0",
"modalities": ["image"],
"sampling_params_list": [
{"temperature":0.4,"top_p":0.9,"top_k":1,"max_tokens":1,"seed":42,
"extra_args":{"image_gen":{"negative_prompt":"ugly, blurry, distorted"}}},
{"seed":42,"extra_args":{"image_gen":{
"steps":6,"cfg":1.5,"height":512,"width":512,"seed":123,
"byte5_text":["理解与生成统一"]}}}
],
"messages": [{"role": "user", "content": "Draw a poster."}]
}' | jq -r '.choices[0].message.content[0].image_url.url | split(",")[1]' | base64 -d > ming_imagegen_knobs.png
img2img (reference image)¶
Add an image_url content part to the user message; the parser routes it into the diffusion stage as extra[reference_image]:
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Change the background to a sandy beach at sunset."},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,<base64>"}}
]
}]
GPU layout¶
The shipped ming_flash_omni_image.yaml allocates the thinker on GPUs 0–3 (TP=4) and the diffusion stage on GPU 4 (TP=1). Copy the YAML and edit devices per stage to relocate; with fewer GPUs available, drop the thinker TP to 2 and run the diffusion stage on a free card. Image-gen warmup takes roughly an extra 30–60 s on top of the thinker — set --stage-init-timeout 1800 if the default 300 s is too tight.
Modality control¶
modalities | Server config | Output |
|---|---|---|
["text"] or omitted | Thinker only | Text |
["audio"] | Thinker + Talker | Audio (speech) |
["text", "audio"] | Thinker + Talker | Text + Audio |
["image"] | Thinker + Imagegen (image deploy YAML) | Image (PNG, base64 in choices[0].message.content) |
For ready-to-copy curl examples (text / audio / multimodal input, SSE streaming, reasoning mode), see the recipe at recipes/inclusionAI/Ming-flash-omni-2.0.md.
OpenAI Python SDK — streaming¶
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8091/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="Jonathan1909/Ming-flash-omni-2.0",
messages=[
{"role": "system", "content": [{"type": "text", "text": "你是一个友好的AI助手。\n\ndetailed thinking off"}]},
{"role": "user", "content": "请详细介绍鹦鹉的生活习性。"},
],
modalities=["text"],
stream=True,
)
for chunk in response:
for choice in chunk.choices:
if hasattr(choice, "delta") and choice.delta.content:
print(choice.delta.content, end="", flush=True)
print()
The --stream flag on the Python client script above shows the same pattern driven by the shared multimodal client.
Example materials¶
run_curl_multimodal_generation.sh
#!/usr/bin/env bash
set -euo pipefail
# Server port
PORT="${PORT:-8091}"
# Default query type
QUERY_TYPE="${1:-text}"
# Validate query type
if [[ ! "$QUERY_TYPE" =~ ^(text|use_audio|use_image|use_video|use_mixed_modalities|use_image_gen)$ ]]; then
echo "Error: Invalid query type '$QUERY_TYPE'"
echo "Usage: $0 [text|use_audio|use_image|use_video|use_mixed_modalities|use_image_gen]"
echo " text: Text-only query"
echo " use_audio: Audio + Text query"
echo " use_image: Image + Text query"
echo " use_video: Video + Text query"
echo " use_mixed_modalities: Audio + Image + Video + Text query"
echo " use_image_gen: Text-to-image (diffusion stage) — saves the generated PNG locally"
exit 1
fi
# Define URLs for assets
MARY_HAD_LAMB_AUDIO_URL="https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/mary_had_lamb.ogg"
CHERRY_BLOSSOM_IMAGE_URL="https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/cherry_blossom.jpg"
SAMPLE_VIDEO_URL="https://huggingface.co/datasets/raushan-testing-hf/videos-test/resolve/main/sample_demo_1.mp4"
# Build user content based on query type
case "$QUERY_TYPE" in
text)
user_content='[
{
"type": "text",
"text": "请详细介绍鹦鹉的生活习性。"
}
]'
;;
use_image)
user_content='[
{
"type": "image_url",
"image_url": {
"url": "'"$CHERRY_BLOSSOM_IMAGE_URL"'"
}
},
{
"type": "text",
"text": "Describe this image in detail."
}
]'
;;
use_audio)
user_content='[
{
"type": "audio_url",
"audio_url": {
"url": "'"$MARY_HAD_LAMB_AUDIO_URL"'"
}
},
{
"type": "text",
"text": "Please recognize the language of this speech and transcribe it. Format: oral."
}
]'
;;
use_video)
user_content='[
{
"type": "video_url",
"video_url": {
"url": "'"$SAMPLE_VIDEO_URL"'"
}
},
{
"type": "text",
"text": "Describe what is happening in this video."
}
]'
;;
use_mixed_modalities)
user_content='[
{
"type": "image_url",
"image_url": {
"url": "'"$CHERRY_BLOSSOM_IMAGE_URL"'"
}
},
{
"type": "audio_url",
"audio_url": {
"url": "'"$MARY_HAD_LAMB_AUDIO_URL"'"
}
},
{
"type": "text",
"text": "Describe the image, and recognize the language of this speech and transcribe it. Format: oral"
}
]'
;;
esac
echo "Running query type: $QUERY_TYPE"
echo ""
if [[ "$QUERY_TYPE" == "use_image_gen" ]]; then
# Image-gen branch: dual-stage diffusion endpoint, modalities=["image"].
# The shipped ming_flash_omni_image.yaml runs the thinker on cards 0-3
# (TP=4) and the diffusion stage on card 4 (TP=1). See the image-gen
# section in README.md for knobs and img2img.
out_path="${OUT_PATH:-/tmp/ming_imagegen.png}"
request_body=$(cat <<'EOF'
{
"model": "Jonathan1909/Ming-flash-omni-2.0",
"modalities": ["image"],
"messages": [
{"role": "user", "content": "Draw a beautiful girl with short black hair and red dress."}
]
}
EOF
)
output=$(curl -sS --retry 3 --retry-delay 3 --retry-connrefused \
-X POST "http://localhost:${PORT}/v1/chat/completions" \
-H "Content-Type: application/json" \
-d "$request_body")
# Server returns base64-encoded PNG either as a data URL in
# ``choices[0].message.content[0].image_url.url`` or as a raw string.
b64=$(echo "$output" | jq -r '.choices[0].message.content
| if type == "array"
then (map(select(.image_url? != null)) | .[0].image_url.url // "")
elif type == "string" then . else "" end')
if [[ -z "$b64" ]]; then
echo "Error: no image returned. Raw response:"
echo "$output" | jq '.'
exit 1
fi
if [[ "$b64" == data:image* ]]; then b64="${b64#*,}"; fi
printf '%s' "$b64" | base64 --decode > "$out_path"
echo "Saved generated image: $out_path ($(stat -c%s "$out_path" 2>/dev/null || stat -f%z "$out_path") bytes)"
exit 0
fi
request_body=$(cat <<EOF
{
"model": "Jonathan1909/Ming-flash-omni-2.0",
"modalities": ["text"],
"messages": [
{
"role": "system",
"content": [
{
"type": "text",
"text": "你是一个友好的AI助手。\n\ndetailed thinking off"
}
]
},
{
"role": "user",
"content": $user_content
}
]
}
EOF
)
output=$(curl -sS --retry 3 --retry-delay 3 --retry-connrefused \
-X POST http://localhost:${PORT}/v1/chat/completions \
-H "Content-Type: application/json" \
-d "$request_body")
echo "Output of request: $(echo "$output" | jq '.choices[0].message.content')"