Qwen2.5-Omni¶

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/qwen2_5_omni.

🛠️ Installation¶

Please refer to README.md

Run examples (Qwen2.5-Omni)¶

Launch the Server¶

vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091

If you have custom stage configs file, launch the server with command below

vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091 --stage-configs-path /path/to/stage_configs_file

Get into the example folder

cd examples/online_serving/qwen2_5_omni

Send request via python¶

python examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py --model Qwen/Qwen2.5-Omni-7B --query-type use_mixed_modalities --port 8091 --host "localhost"

The Python client supports the following command-line arguments:

--query-type (or -q): Query type (default: mixed_modalities). Options: mixed_modalities, use_audio_in_video, multi_audios, text
--video-path (or -v): Path to local video file or URL. If not provided and query-type uses video, uses default video URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs. Example: --video-path /path/to/video.mp4 or --video-path https://example.com/video.mp4
--image-path (or -i): Path to local image file or URL. If not provided and query-type uses image, uses default image URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common image formats: JPEG, PNG, GIF, WebP. Example: --image-path /path/to/image.jpg or --image-path https://example.com/image.png
--audio-path (or -a): Path to local audio file or URL. If not provided and query-type uses audio, uses default audio URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common audio formats: MP3, WAV, OGG, FLAC, M4A. Example: --audio-path /path/to/audio.wav or --audio-path https://example.com/audio.mp3
--prompt (or -p): Custom text prompt/question. If not provided, uses default prompt for the selected query type. Example: --prompt "What are the main activities shown in this video?"

For example, to use mixed modalities with all local files:

python examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py \
    --query-type use_mixed_modalities \
    --video-path /path/to/your/video.mp4 \
    --image-path /path/to/your/image.jpg \
    --audio-path /path/to/your/audio.wav \
    --model Qwen/Qwen2.5-Omni-7B \
    --prompt "Analyze all the media content and provide a comprehensive summary."

Send request via curl¶

bash run_curl_multimodal_generation.sh mixed_modalities

Modality control¶

You can control output modalities to specify which types of output the model should generate. This is useful when you only need text output and want to skip audio generation stages for better performance.

Supported modalities¶

Modalities	Output
`["text"]`	Text only
`["audio"]`	Text + Audio
`["text", "audio"]`	Text + Audio
Not specified	Text + Audio (default)

Using curl¶

Text only¶

curl http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-Omni-7B",
    "messages": [{"role": "user", "content": "Describe vLLM in brief."}],
    "modalities": ["text"]
  }'

Text + Audio¶

curl http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-Omni-7B",
    "messages": [{"role": "user", "content": "Describe vLLM in brief."}],
    "modalities": ["audio"]
  }'

Using Python client¶

python examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py \
    --query-type use_mixed_modalities \
    --model Qwen/Qwen2.5-Omni-7B \
    --modalities text

Using OpenAI Python SDK¶

Text only¶

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8091/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Omni-7B",
    messages=[{"role": "user", "content": "Describe vLLM in brief."}],
    modalities=["text"]
)
print(response.choices[0].message.content)

Text + Audio¶

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8091/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Omni-7B",
    messages=[{"role": "user", "content": "Describe vLLM in brief."}],
    modalities=["audio"]
)
# Response contains two choices: one with text, one with audio
print(response.choices[0].message.content)  # Text response
print(response.choices[1].message.audio)    # Audio response

Streaming Output¶

If you want to enable streaming output, please set the argument as below. The final output will be obtained just after generated by corresponding stage. Now we only support text streaming output. Other modalities can output normally.

python examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py \
    --query-type use_mixed_modalities \
    --model Qwen/Qwen2.5-Omni-7B \
    --stream

Run Local Web UI Demo¶

This Web UI demo allows users to interact with the model through a web browser.

Running Gradio Demo¶

The Gradio demo connects to a vLLM API server. You have two options:

Option 1: One-step Launch Script (Recommended)¶

The convenience script launches both the vLLM server and Gradio demo together:

./run_gradio_demo.sh --model Qwen/Qwen2.5-Omni-7B --server-port 8091 --gradio-port 7861

This script will: 1. Start the vLLM server in the background 2. Wait for the server to be ready 3. Launch the Gradio demo 4. Handle cleanup when you press Ctrl+C

The script supports the following arguments: - --model: Model name/path (default: Qwen/Qwen2.5-Omni-7B) - --server-port: Port for vLLM server (default: 8091) - --gradio-port: Port for Gradio demo (default: 7861) - --stage-configs-path: Path to custom stage configs YAML file (optional) - --server-host: Host for vLLM server (default: 0.0.0.0) - --gradio-ip: IP for Gradio demo (default: 127.0.0.1) - --share: Share Gradio demo publicly (creates a public link)

Option 2: Manual Launch (Two-Step Process)¶

Step 1: Launch the vLLM API server

vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091

If you have custom stage configs file:

vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091 --stage-configs-path /path/to/stage_configs_file

Step 2: Run the Gradio demo

In a separate terminal:

python gradio_demo.py --model Qwen/Qwen2.5-Omni-7B --api-base http://localhost:8091/v1 --port 7861

Then open http://localhost:7861/ on your local browser to interact with the web UI.

The gradio script supports the following arguments:

--model: Model name/path (should match the server model)
--api-base: Base URL for the vLLM API server (default: http://localhost:8091/v1)
--ip: Host/IP for Gradio server (default: 127.0.0.1)
--port: Port for Gradio server (default: 7861)
--share: Share the Gradio demo publicly (creates a public link)

Example materials¶

gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/qwen2_5_omni/gradio_demo.py.

run_curl_multimodal_generation.sh

#!/usr/bin/env bash
set -euo pipefail

# Default query type
QUERY_TYPE="${1:-mixed_modalities}"

# Default modalities argument
MODALITIES="${2:-null}"

# Validate query type
if [[ ! "$QUERY_TYPE" =~ ^(mixed_modalities|use_audio_in_video|multi_audios|text)$ ]]; then
    echo "Error: Invalid query type '$QUERY_TYPE'"
    echo "Usage: $0 [mixed_modalities|use_audio_in_video|multi_audios|text] [modalities]"
    echo "  mixed_modalities: Audio + Image + Video + Text query"
    echo "  use_audio_in_video: Video + Text query (with audio extraction from video)"
    echo "  multi_audios: Two audio clips + Text query"
    echo "  text: Text query"
    echo "  modalities: Modalities parameter (default: null)"
    exit 1
fi

SEED=42

thinker_sampling_params='{
  "temperature": 0.0,
  "top_p": 1.0,
  "top_k": -1,
  "max_tokens": 2048,
  "seed": 42,
  "detokenize": true,
  "repetition_penalty": 1.1
}'

talker_sampling_params='{
  "temperature": 0.9,
  "top_p": 0.8,
  "top_k": 40,
  "max_tokens": 2048,
  "seed": 42,
  "detokenize": true,
  "repetition_penalty": 1.05,
  "stop_token_ids": [8294]
}'

code2wav_sampling_params='{
  "temperature": 0.0,
  "top_p": 1.0,
  "top_k": -1,
  "max_tokens": 2048,
  "seed": 42,
  "detokenize": true,
  "repetition_penalty": 1.1
}'
# Above is optional, it has a default setting in stage_configs of the corresponding model.

# Define URLs for assets
MARY_HAD_LAMB_AUDIO_URL="https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/mary_had_lamb.ogg"
WINNING_CALL_AUDIO_URL="https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/winning_call.ogg"
CHERRY_BLOSSOM_IMAGE_URL="https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/cherry_blossom.jpg"
SAMPLE_VIDEO_URL="https://huggingface.co/datasets/raushan-testing-hf/videos-test/resolve/main/sample_demo_1.mp4"

# Build user content and extra fields based on query type
case "$QUERY_TYPE" in
  text)
    user_content='[
      {
        "type": "text",
        "text": "Explain the system architecture for a scalable audio generation pipeline. Answer in 15 words."
      }
    ]'
    sampling_params_list='[
      '"$thinker_sampling_params"',
      '"$talker_sampling_params"',
      '"$code2wav_sampling_params"'
    ]'
    mm_processor_kwargs="{}"
    ;;
  mixed_modalities)
    user_content='[
        {
          "type": "audio_url",
          "audio_url": {
            "url": "'"$MARY_HAD_LAMB_AUDIO_URL"'"
          }
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "'"$CHERRY_BLOSSOM_IMAGE_URL"'"
          }
        },
        {
          "type": "video_url",
          "video_url": {
            "url": "'"$SAMPLE_VIDEO_URL"'"
          }
        },
        {
          "type": "text",
          "text": "What is recited in the audio? What is the content of this image? Why is this video funny?"
        }
      ]'
    sampling_params_list='[
      '"$thinker_sampling_params"',
      '"$talker_sampling_params"',
      '"$code2wav_sampling_params"'
    ]'
    mm_processor_kwargs="{}"
    ;;
  use_audio_in_video)
    user_content='[
        {
          "type": "video_url",
          "video_url": {
            "url": "'"$SAMPLE_VIDEO_URL"'"
          }
        },
        {
          "type": "text",
          "text": "Describe the content of the video, then convert what the baby say into text."
        }
      ]'
    sampling_params_list='[
      '"$thinker_sampling_params"',
      '"$talker_sampling_params"',
      '"$code2wav_sampling_params"'
    ]'
    mm_processor_kwargs='{
      "use_audio_in_video": true
    }'
    ;;
  multi_audios)
    user_content='[
        {
          "type": "audio_url",
          "audio_url": {
            "url": "'"$MARY_HAD_LAMB_AUDIO_URL"'"
          }
        },
        {
          "type": "audio_url",
          "audio_url": {
            "url": "'"$WINNING_CALL_AUDIO_URL"'"
          }
        },
        {
          "type": "text",
          "text": "Are these two audio clips the same?"
        }
      ]'
    sampling_params_list='[
      '"$thinker_sampling_params"',
      '"$talker_sampling_params"',
      '"$code2wav_sampling_params"'
    ]'
    mm_processor_kwargs="{}"
    ;;
esac

echo "Running query type: $QUERY_TYPE"
echo ""

request_body=$(cat <<EOF
{
  "model": "Qwen/Qwen2.5-Omni-7B",
  "sampling_params_list": $sampling_params_list,
  "mm_processor_kwargs": $mm_processor_kwargs,
  "modalities": $MODALITIES,
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."
        }
      ]
    },
    {
      "role": "user",
      "content": $user_content
    }
  ]
}
EOF
)

output=$(curl -sS --retry 3 --retry-delay 3 --retry-connrefused \
    -X POST http://localhost:8091/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d "$request_body")

# Here it only shows the text content of the first choice. Audio content has many binaries, so it's not displayed here.
echo "Output of request: $(echo "$output" | jq '.choices[0].message.content')"

run_gradio_demo.sh

#!/bin/bash
# Convenience script to launch both vLLM server and Gradio demo for Qwen2.5-Omni
#
# Usage:
#   ./run_gradio_demo.sh [OPTIONS]
#
# Example:
#   ./run_gradio_demo.sh --model Qwen/Qwen2.5-Omni-7B --server-port 8091 --gradio-port 7861

set -e

# Default values
MODEL="Qwen/Qwen2.5-Omni-7B"
SERVER_PORT=8091
GRADIO_PORT=7861
STAGE_CONFIGS_PATH=""
SERVER_HOST="0.0.0.0"
GRADIO_IP="127.0.0.1"
GRADIO_SHARE=false

# Parse command line arguments
while [[ $# -gt 0 ]]; do
    case $1 in
        --model)
            MODEL="$2"
            shift 2
            ;;
        --server-port)
            SERVER_PORT="$2"
            shift 2
            ;;
        --gradio-port)
            GRADIO_PORT="$2"
            shift 2
            ;;
        --stage-configs-path)
            STAGE_CONFIGS_PATH="$2"
            shift 2
            ;;
        --server-host)
            SERVER_HOST="$2"
            shift 2
            ;;
        --gradio-ip)
            GRADIO_IP="$2"
            shift 2
            ;;
        --share)
            GRADIO_SHARE=true
            shift
            ;;
        --help)
            echo "Usage: $0 [OPTIONS]"
            echo ""
            echo "Options:"
            echo "  --model MODEL                 Model name/path (default: Qwen/Qwen2.5-Omni-7B)"
            echo "  --server-port PORT            Port for vLLM server (default: 8091)"
            echo "  --gradio-port PORT            Port for Gradio demo (default: 7861)"
            echo "  --stage-configs-path PATH     Path to custom stage configs YAML file (optional)"
            echo "  --server-host HOST            Host for vLLM server (default: 0.0.0.0)"
            echo "  --gradio-ip IP                IP for Gradio demo (default: 127.0.0.1)"
            echo "  --share                       Share Gradio demo publicly"
            echo "  --help                        Show this help message"
            echo ""
            exit 0
            ;;
        *)
            echo "Unknown option: $1"
            echo "Use --help for usage information"
            exit 1
            ;;
    esac
done

# Get the directory where this script is located
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
API_BASE="http://localhost:${SERVER_PORT}/v1"
HEALTH_URL="http://localhost:${SERVER_PORT}/health"

echo "=========================================="
echo "Starting vLLM-Omni Gradio Demo"
echo "=========================================="
echo "Model: $MODEL"
echo "Server: http://${SERVER_HOST}:${SERVER_PORT}"
echo "Gradio: http://${GRADIO_IP}:${GRADIO_PORT}"
echo "=========================================="

# Build vLLM server command
SERVER_CMD=("vllm" "serve" "$MODEL" "--omni" "--port" "$SERVER_PORT" "--host" "$SERVER_HOST")
if [ -n "$STAGE_CONFIGS_PATH" ]; then
    SERVER_CMD+=("--stage-configs-path" "$STAGE_CONFIGS_PATH")
fi

# Function to cleanup on exit
cleanup() {
    echo ""
    echo "Shutting down..."
    if [ -n "$SERVER_PID" ]; then
        echo "Stopping vLLM server (PID: $SERVER_PID)..."
        kill "$SERVER_PID" 2>/dev/null || true
        wait "$SERVER_PID" 2>/dev/null || true
    fi
    if [ -n "$GRADIO_PID" ]; then
        echo "Stopping Gradio demo (PID: $GRADIO_PID)..."
        kill "$GRADIO_PID" 2>/dev/null || true
        wait "$GRADIO_PID" 2>/dev/null || true
    fi
    echo "Cleanup complete"
    exit 0
}

# Set up signal handlers
trap cleanup SIGINT SIGTERM

# Start vLLM server with output shown in real-time and saved to log
echo ""
echo "Starting vLLM server..."
LOG_FILE="/tmp/vllm_server_${SERVER_PORT}.log"
"${SERVER_CMD[@]}" 2>&1 | tee "$LOG_FILE" &
SERVER_PID=$!

# Start a background process to monitor the log for startup completion
STARTUP_COMPLETE=false
TAIL_PID=""

# Function to cleanup tail process
cleanup_tail() {
    if [ -n "$TAIL_PID" ]; then
        kill "$TAIL_PID" 2>/dev/null || true
        wait "$TAIL_PID" 2>/dev/null || true
    fi
}

# Wait for server to be ready by checking log output
echo ""
echo "Waiting for vLLM server to be ready (checking for 'Application startup complete' message)..."
echo ""

# Monitor log file for startup completion message
MAX_WAIT=300  # 5 minutes timeout as fallback
ELAPSED=0

# Use a temporary file to track startup completion
STARTUP_FLAG="/tmp/vllm_startup_flag_${SERVER_PORT}.tmp"
rm -f "$STARTUP_FLAG"

# Start monitoring in background
(
    tail -f "$LOG_FILE" 2>/dev/null | grep -m 1 "Application startup complete" > /dev/null && touch "$STARTUP_FLAG"
) &
TAIL_PID=$!

while [ $ELAPSED -lt $MAX_WAIT ]; do
    # Check if startup flag file exists (startup complete)
    if [ -f "$STARTUP_FLAG" ]; then
        cleanup_tail
        echo ""
        echo "✓ vLLM server is ready!"
        STARTUP_COMPLETE=true
        break
    fi

    # Check if server process is still running
    if ! kill -0 "$SERVER_PID" 2>/dev/null; then
        cleanup_tail
        echo ""
        echo "Error: vLLM server failed to start (process terminated)"
        wait "$SERVER_PID" 2>/dev/null || true
        exit 1
    fi

    sleep 1
    ELAPSED=$((ELAPSED + 1))
done

cleanup_tail
rm -f "$STARTUP_FLAG"

if [ "$STARTUP_COMPLETE" != "true" ]; then
    echo ""
    echo "Error: vLLM server did not complete startup within ${MAX_WAIT} seconds"
    kill "$SERVER_PID" 2>/dev/null || true
    exit 1
fi

# Start Gradio demo
echo ""
echo "Starting Gradio demo..."
cd "$SCRIPT_DIR"
GRADIO_CMD=("python" "gradio_demo.py" "--model" "$MODEL" "--api-base" "$API_BASE" "--ip" "$GRADIO_IP" "--port" "$GRADIO_PORT")
if [ "$GRADIO_SHARE" = true ]; then
    GRADIO_CMD+=("--share")
fi

"${GRADIO_CMD[@]}" > /tmp/gradio_demo.log 2>&1 &
GRADIO_PID=$!

echo ""
echo "=========================================="
echo "Both services are running!"
echo "=========================================="
echo "vLLM Server: http://${SERVER_HOST}:${SERVER_PORT}"
echo "Gradio Demo: http://${GRADIO_IP}:${GRADIO_PORT}"
echo ""
echo "Press Ctrl+C to stop both services"
echo "=========================================="
echo ""

# Wait for either process to exit
wait $SERVER_PID $GRADIO_PID || true

cleanup

Qwen2.5-Omni¶

🛠️ Installation¶

Run examples (Qwen2.5-Omni)¶

Launch the Server¶

Send Multi-modal Request¶

Send request via python¶

Send request via curl¶

Modality control¶

Supported modalities¶

Using curl¶

Text only¶

Text + Audio¶

Using Python client¶

Using OpenAI Python SDK¶

Text only¶

Text + Audio¶

Streaming Output¶

Run Local Web UI Demo¶

Running Gradio Demo¶

Option 1: One-step Launch Script (Recommended)¶

Option 2: Manual Launch (Two-Step Process)¶

Example materials¶