Qwen2.5-Omni¶
Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/qwen2_5_omni.
🛠️ Installation¶
Please refer to README.md
Run examples (Qwen2.5-Omni)¶
Launch the Server¶
If you have custom stage configs file, launch the server with command below
Send Multi-modal Request¶
Get into the example folder
Send request via python¶
python examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py --model Qwen/Qwen2.5-Omni-7B --query-type use_mixed_modalities --port 8091 --host "localhost"
The Python client supports the following command-line arguments:
--query-type(or-q): Query type (default:mixed_modalities). Options:mixed_modalities,use_audio_in_video,multi_audios,text--video-path(or-v): Path to local video file or URL. If not provided and query-type uses video, uses default video URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs. Example:--video-path /path/to/video.mp4or--video-path https://example.com/video.mp4--image-path(or-i): Path to local image file or URL. If not provided and query-type uses image, uses default image URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common image formats: JPEG, PNG, GIF, WebP. Example:--image-path /path/to/image.jpgor--image-path https://example.com/image.png--audio-path(or-a): Path to local audio file or URL. If not provided and query-type uses audio, uses default audio URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common audio formats: MP3, WAV, OGG, FLAC, M4A. Example:--audio-path /path/to/audio.wavor--audio-path https://example.com/audio.mp3--prompt(or-p): Custom text prompt/question. If not provided, uses default prompt for the selected query type. Example:--prompt "What are the main activities shown in this video?"
For example, to use mixed modalities with all local files:
python examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py \
--query-type use_mixed_modalities \
--video-path /path/to/your/video.mp4 \
--image-path /path/to/your/image.jpg \
--audio-path /path/to/your/audio.wav \
--model Qwen/Qwen2.5-Omni-7B \
--prompt "Analyze all the media content and provide a comprehensive summary."
Send request via curl¶
Modality control¶
You can control output modalities to specify which types of output the model should generate. This is useful when you only need text output and want to skip audio generation stages for better performance.
Supported modalities¶
| Modalities | Output |
|---|---|
["text"] | Text only |
["audio"] | Text + Audio |
["text", "audio"] | Text + Audio |
| Not specified | Text + Audio (default) |
Using curl¶
Text only¶
curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-Omni-7B",
"messages": [{"role": "user", "content": "Describe vLLM in brief."}],
"modalities": ["text"]
}'
Text + Audio¶
curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-Omni-7B",
"messages": [{"role": "user", "content": "Describe vLLM in brief."}],
"modalities": ["audio"]
}'
Using Python client¶
python examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py \
--query-type use_mixed_modalities \
--model Qwen/Qwen2.5-Omni-7B \
--modalities text
Using OpenAI Python SDK¶
Text only¶
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8091/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-Omni-7B",
messages=[{"role": "user", "content": "Describe vLLM in brief."}],
modalities=["text"]
)
print(response.choices[0].message.content)
Text + Audio¶
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8091/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-Omni-7B",
messages=[{"role": "user", "content": "Describe vLLM in brief."}],
modalities=["audio"]
)
# Response contains two choices: one with text, one with audio
print(response.choices[0].message.content) # Text response
print(response.choices[1].message.audio) # Audio response
Streaming Output¶
If you want to enable streaming output, please set the argument as below. The final output will be obtained just after generated by corresponding stage. Now we only support text streaming output. Other modalities can output normally.
python examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py \
--query-type use_mixed_modalities \
--model Qwen/Qwen2.5-Omni-7B \
--stream
Run Local Web UI Demo¶
This Web UI demo allows users to interact with the model through a web browser.
Running Gradio Demo¶
The Gradio demo connects to a vLLM API server. You have two options:
Option 1: One-step Launch Script (Recommended)¶
The convenience script launches both the vLLM server and Gradio demo together:
This script will: 1. Start the vLLM server in the background 2. Wait for the server to be ready 3. Launch the Gradio demo 4. Handle cleanup when you press Ctrl+C
The script supports the following arguments: - --model: Model name/path (default: Qwen/Qwen2.5-Omni-7B) - --server-port: Port for vLLM server (default: 8091) - --gradio-port: Port for Gradio demo (default: 7861) - --stage-configs-path: Path to custom stage configs YAML file (optional) - --server-host: Host for vLLM server (default: 0.0.0.0) - --gradio-ip: IP for Gradio demo (default: 127.0.0.1) - --share: Share Gradio demo publicly (creates a public link)
Option 2: Manual Launch (Two-Step Process)¶
Step 1: Launch the vLLM API server
If you have custom stage configs file:
Step 2: Run the Gradio demo
In a separate terminal:
Then open http://localhost:7861/ on your local browser to interact with the web UI.
The gradio script supports the following arguments:
--model: Model name/path (should match the server model)--api-base: Base URL for the vLLM API server (default: http://localhost:8091/v1)--ip: Host/IP for Gradio server (default: 127.0.0.1)--port: Port for Gradio server (default: 7861)--share: Share the Gradio demo publicly (creates a public link)
Example materials¶
gradio_demo.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/qwen2_5_omni/gradio_demo.py.
run_curl_multimodal_generation.sh
#!/usr/bin/env bash
set -euo pipefail
# Default query type
QUERY_TYPE="${1:-mixed_modalities}"
# Default modalities argument
MODALITIES="${2:-null}"
# Validate query type
if [[ ! "$QUERY_TYPE" =~ ^(mixed_modalities|use_audio_in_video|multi_audios|text)$ ]]; then
echo "Error: Invalid query type '$QUERY_TYPE'"
echo "Usage: $0 [mixed_modalities|use_audio_in_video|multi_audios|text] [modalities]"
echo " mixed_modalities: Audio + Image + Video + Text query"
echo " use_audio_in_video: Video + Text query (with audio extraction from video)"
echo " multi_audios: Two audio clips + Text query"
echo " text: Text query"
echo " modalities: Modalities parameter (default: null)"
exit 1
fi
SEED=42
thinker_sampling_params='{
"temperature": 0.0,
"top_p": 1.0,
"top_k": -1,
"max_tokens": 2048,
"seed": 42,
"detokenize": true,
"repetition_penalty": 1.1
}'
talker_sampling_params='{
"temperature": 0.9,
"top_p": 0.8,
"top_k": 40,
"max_tokens": 2048,
"seed": 42,
"detokenize": true,
"repetition_penalty": 1.05,
"stop_token_ids": [8294]
}'
code2wav_sampling_params='{
"temperature": 0.0,
"top_p": 1.0,
"top_k": -1,
"max_tokens": 2048,
"seed": 42,
"detokenize": true,
"repetition_penalty": 1.1
}'
# Above is optional, it has a default setting in stage_configs of the corresponding model.
# Define URLs for assets
MARY_HAD_LAMB_AUDIO_URL="https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/mary_had_lamb.ogg"
WINNING_CALL_AUDIO_URL="https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/winning_call.ogg"
CHERRY_BLOSSOM_IMAGE_URL="https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/cherry_blossom.jpg"
SAMPLE_VIDEO_URL="https://huggingface.co/datasets/raushan-testing-hf/videos-test/resolve/main/sample_demo_1.mp4"
# Build user content and extra fields based on query type
case "$QUERY_TYPE" in
text)
user_content='[
{
"type": "text",
"text": "Explain the system architecture for a scalable audio generation pipeline. Answer in 15 words."
}
]'
sampling_params_list='[
'"$thinker_sampling_params"',
'"$talker_sampling_params"',
'"$code2wav_sampling_params"'
]'
mm_processor_kwargs="{}"
;;
mixed_modalities)
user_content='[
{
"type": "audio_url",
"audio_url": {
"url": "'"$MARY_HAD_LAMB_AUDIO_URL"'"
}
},
{
"type": "image_url",
"image_url": {
"url": "'"$CHERRY_BLOSSOM_IMAGE_URL"'"
}
},
{
"type": "video_url",
"video_url": {
"url": "'"$SAMPLE_VIDEO_URL"'"
}
},
{
"type": "text",
"text": "What is recited in the audio? What is the content of this image? Why is this video funny?"
}
]'
sampling_params_list='[
'"$thinker_sampling_params"',
'"$talker_sampling_params"',
'"$code2wav_sampling_params"'
]'
mm_processor_kwargs="{}"
;;
use_audio_in_video)
user_content='[
{
"type": "video_url",
"video_url": {
"url": "'"$SAMPLE_VIDEO_URL"'"
}
},
{
"type": "text",
"text": "Describe the content of the video, then convert what the baby say into text."
}
]'
sampling_params_list='[
'"$thinker_sampling_params"',
'"$talker_sampling_params"',
'"$code2wav_sampling_params"'
]'
mm_processor_kwargs='{
"use_audio_in_video": true
}'
;;
multi_audios)
user_content='[
{
"type": "audio_url",
"audio_url": {
"url": "'"$MARY_HAD_LAMB_AUDIO_URL"'"
}
},
{
"type": "audio_url",
"audio_url": {
"url": "'"$WINNING_CALL_AUDIO_URL"'"
}
},
{
"type": "text",
"text": "Are these two audio clips the same?"
}
]'
sampling_params_list='[
'"$thinker_sampling_params"',
'"$talker_sampling_params"',
'"$code2wav_sampling_params"'
]'
mm_processor_kwargs="{}"
;;
esac
echo "Running query type: $QUERY_TYPE"
echo ""
request_body=$(cat <<EOF
{
"model": "Qwen/Qwen2.5-Omni-7B",
"sampling_params_list": $sampling_params_list,
"mm_processor_kwargs": $mm_processor_kwargs,
"modalities": $MODALITIES,
"messages": [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."
}
]
},
{
"role": "user",
"content": $user_content
}
]
}
EOF
)
output=$(curl -sS --retry 3 --retry-delay 3 --retry-connrefused \
-X POST http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d "$request_body")
# Here it only shows the text content of the first choice. Audio content has many binaries, so it's not displayed here.
echo "Output of request: $(echo "$output" | jq '.choices[0].message.content')"
run_gradio_demo.sh
#!/bin/bash
# Convenience script to launch both vLLM server and Gradio demo for Qwen2.5-Omni
#
# Usage:
# ./run_gradio_demo.sh [OPTIONS]
#
# Example:
# ./run_gradio_demo.sh --model Qwen/Qwen2.5-Omni-7B --server-port 8091 --gradio-port 7861
set -e
# Default values
MODEL="Qwen/Qwen2.5-Omni-7B"
SERVER_PORT=8091
GRADIO_PORT=7861
STAGE_CONFIGS_PATH=""
SERVER_HOST="0.0.0.0"
GRADIO_IP="127.0.0.1"
GRADIO_SHARE=false
# Parse command line arguments
while [[ $# -gt 0 ]]; do
case $1 in
--model)
MODEL="$2"
shift 2
;;
--server-port)
SERVER_PORT="$2"
shift 2
;;
--gradio-port)
GRADIO_PORT="$2"
shift 2
;;
--stage-configs-path)
STAGE_CONFIGS_PATH="$2"
shift 2
;;
--server-host)
SERVER_HOST="$2"
shift 2
;;
--gradio-ip)
GRADIO_IP="$2"
shift 2
;;
--share)
GRADIO_SHARE=true
shift
;;
--help)
echo "Usage: $0 [OPTIONS]"
echo ""
echo "Options:"
echo " --model MODEL Model name/path (default: Qwen/Qwen2.5-Omni-7B)"
echo " --server-port PORT Port for vLLM server (default: 8091)"
echo " --gradio-port PORT Port for Gradio demo (default: 7861)"
echo " --stage-configs-path PATH Path to custom stage configs YAML file (optional)"
echo " --server-host HOST Host for vLLM server (default: 0.0.0.0)"
echo " --gradio-ip IP IP for Gradio demo (default: 127.0.0.1)"
echo " --share Share Gradio demo publicly"
echo " --help Show this help message"
echo ""
exit 0
;;
*)
echo "Unknown option: $1"
echo "Use --help for usage information"
exit 1
;;
esac
done
# Get the directory where this script is located
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
API_BASE="http://localhost:${SERVER_PORT}/v1"
HEALTH_URL="http://localhost:${SERVER_PORT}/health"
echo "=========================================="
echo "Starting vLLM-Omni Gradio Demo"
echo "=========================================="
echo "Model: $MODEL"
echo "Server: http://${SERVER_HOST}:${SERVER_PORT}"
echo "Gradio: http://${GRADIO_IP}:${GRADIO_PORT}"
echo "=========================================="
# Build vLLM server command
SERVER_CMD=("vllm" "serve" "$MODEL" "--omni" "--port" "$SERVER_PORT" "--host" "$SERVER_HOST")
if [ -n "$STAGE_CONFIGS_PATH" ]; then
SERVER_CMD+=("--stage-configs-path" "$STAGE_CONFIGS_PATH")
fi
# Function to cleanup on exit
cleanup() {
echo ""
echo "Shutting down..."
if [ -n "$SERVER_PID" ]; then
echo "Stopping vLLM server (PID: $SERVER_PID)..."
kill "$SERVER_PID" 2>/dev/null || true
wait "$SERVER_PID" 2>/dev/null || true
fi
if [ -n "$GRADIO_PID" ]; then
echo "Stopping Gradio demo (PID: $GRADIO_PID)..."
kill "$GRADIO_PID" 2>/dev/null || true
wait "$GRADIO_PID" 2>/dev/null || true
fi
echo "Cleanup complete"
exit 0
}
# Set up signal handlers
trap cleanup SIGINT SIGTERM
# Start vLLM server with output shown in real-time and saved to log
echo ""
echo "Starting vLLM server..."
LOG_FILE="/tmp/vllm_server_${SERVER_PORT}.log"
"${SERVER_CMD[@]}" 2>&1 | tee "$LOG_FILE" &
SERVER_PID=$!
# Start a background process to monitor the log for startup completion
STARTUP_COMPLETE=false
TAIL_PID=""
# Function to cleanup tail process
cleanup_tail() {
if [ -n "$TAIL_PID" ]; then
kill "$TAIL_PID" 2>/dev/null || true
wait "$TAIL_PID" 2>/dev/null || true
fi
}
# Wait for server to be ready by checking log output
echo ""
echo "Waiting for vLLM server to be ready (checking for 'Application startup complete' message)..."
echo ""
# Monitor log file for startup completion message
MAX_WAIT=300 # 5 minutes timeout as fallback
ELAPSED=0
# Use a temporary file to track startup completion
STARTUP_FLAG="/tmp/vllm_startup_flag_${SERVER_PORT}.tmp"
rm -f "$STARTUP_FLAG"
# Start monitoring in background
(
tail -f "$LOG_FILE" 2>/dev/null | grep -m 1 "Application startup complete" > /dev/null && touch "$STARTUP_FLAG"
) &
TAIL_PID=$!
while [ $ELAPSED -lt $MAX_WAIT ]; do
# Check if startup flag file exists (startup complete)
if [ -f "$STARTUP_FLAG" ]; then
cleanup_tail
echo ""
echo "✓ vLLM server is ready!"
STARTUP_COMPLETE=true
break
fi
# Check if server process is still running
if ! kill -0 "$SERVER_PID" 2>/dev/null; then
cleanup_tail
echo ""
echo "Error: vLLM server failed to start (process terminated)"
wait "$SERVER_PID" 2>/dev/null || true
exit 1
fi
sleep 1
ELAPSED=$((ELAPSED + 1))
done
cleanup_tail
rm -f "$STARTUP_FLAG"
if [ "$STARTUP_COMPLETE" != "true" ]; then
echo ""
echo "Error: vLLM server did not complete startup within ${MAX_WAIT} seconds"
kill "$SERVER_PID" 2>/dev/null || true
exit 1
fi
# Start Gradio demo
echo ""
echo "Starting Gradio demo..."
cd "$SCRIPT_DIR"
GRADIO_CMD=("python" "gradio_demo.py" "--model" "$MODEL" "--api-base" "$API_BASE" "--ip" "$GRADIO_IP" "--port" "$GRADIO_PORT")
if [ "$GRADIO_SHARE" = true ]; then
GRADIO_CMD+=("--share")
fi
"${GRADIO_CMD[@]}" > /tmp/gradio_demo.log 2>&1 &
GRADIO_PID=$!
echo ""
echo "=========================================="
echo "Both services are running!"
echo "=========================================="
echo "vLLM Server: http://${SERVER_HOST}:${SERVER_PORT}"
echo "Gradio Demo: http://${GRADIO_IP}:${GRADIO_PORT}"
echo ""
echo "Press Ctrl+C to stop both services"
echo "=========================================="
echo ""
# Wait for either process to exit
wait $SERVER_PID $GRADIO_PID || true
cleanup