Qwen3-Omni¶
Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/qwen3_omni.
🛠️ Installation¶
Please refer to README.md
Run examples (Qwen3-Omni)¶
Launch the Server¶
The default deployment configuration, situated at vllm_omni/deploy/qwen3_omni_moe.yaml, is resolved and loaded automatically via the model registry, obviating the --deploy-config flag in standard deployment topologies. Asynchronous chunk streaming operates as enabled by default within this bundled configuration. Additionally, NPU, ROCm, and XPU per-platform configuration deltas are deterministically merged from the platforms: section of the corresponding YAML.
To explicitly utilize a custom deployment YAML, mandate the configuration path accordingly:
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--deploy-config /path/to/your_deploy_config.yaml
For the bundled 3x-GPU multi-replica layout (talker/code2wav scale-out), use:
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--deploy-config vllm_omni/deploy/qwen3_omni_moe_multi_replicas.yaml
Launch individual stages (stage-based CLI)¶
Use the stage-based CLI when you want to run one stage per process. The example below pins Stage 0 to GPU 0 and Stage 1/2 to GPU 1 via CUDA_VISIBLE_DEVICES.
1. Stage 0 (Thinker + API server)
CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--port 8091 \
--stage-id 0 \
--omni-master-address 127.0.0.1 \
--omni-master-port 26000
2. Stage 1 (Talker)
CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--stage-id 1 \
--headless \
--omni-master-address 127.0.0.1 \
--omni-master-port 26000
3. Stage 2 (Code2Wav)
CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--stage-id 2 \
--headless \
--omni-master-address 127.0.0.1 \
--omni-master-port 26000
Append --deploy-config /path/to/your_deploy_config.yaml to each node invocation if it is necessary to explicitly override the bundled deployment YAML schema.
For standard unified-process launcher, stage-specific CLI configuration tuning is conventionally implemented via the --stage-overrides directive, as demonstrated below:
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--stage-overrides '{"1": {"gpu_memory_utilization": 0.5}}'
Conversely, within the stage-based CLI paradigm, --stage-overrides modifiers are typically unnecessary for this category of optimization. Given that each instantiation strictly initiates a single functional stage, parameter flags can be systematically assigned directly onto that specific stage's command sequence:
CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--stage-id 1 \
--headless \
--gpu-memory-utilization 0.5 \
--omni-master-address 127.0.0.1 \
--omni-master-port 26000
Tuning deployment parameters¶
Most engine knobs (max_num_batched_tokens, max_model_len, enforce_eager, gpu_memory_utilization, tensor_parallel_size, …) can be tuned without editing the YAML. There are three layers, in increasing specificity:
1. Global CLI flags (apply to every stage)¶
# Tighter memory budget on a smaller GPU
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--gpu-memory-utilization 0.85
# Disable cudagraphs (e.g. for debugging)
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--enforce-eager
# Reduce context length
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--max-model-len 32768
# Toggle prefix caching on every stage (yaml default: off)
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--enable-prefix-caching
# ...or force it off if the yaml turned it on
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--no-enable-prefix-caching
# Toggle pipeline-wide async chunked streaming between stages
# (yaml default for qwen3_omni_moe: on)
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--no-async-chunk
For the TTS counterpart (synchronous codec variant), see the Qwen3-TTS section of the online TTS hub: examples/online_serving/text_to_speech/README.md#qwen3-tts.
Explicit CLI flags override the deploy YAML (which itself overrides the parser defaults). If you don't pass a flag, the YAML value wins.
Note on
--no-async-chunk: Flips the deploy yaml'sasync_chunk:bool. Pipelines that implement alternate processor functions for chunked vs end-to-end modes (e.g. qwen3_tts code2wav) dispatch automatically based on that bool — no extra flag or variant yaml is needed.⚠️ For multi-stage models that share GPUs (qwen3_omni_moe by default shares cuda:1 between stages 1 and 2), avoid using global memory flags. A global
--gpu-memory-utilization 0.85would apply to every stage and oversubscribe the shared device. Use per-stage overrides instead — see below.
2. Per-stage overrides via --stage-overrides (recommended for memory)¶
# Lower stage 1's memory budget; leave others at the YAML default
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--stage-overrides '{
"1": {"gpu_memory_utilization": 0.5},
"2": {"max_num_batched_tokens": 65536}
}'
Per-stage values are always treated as explicit and beat YAML defaults for the named stage. Other stages keep their YAML values.
If you switch to the stage-based CLI, the same per-stage tuning can usually be passed directly on that stage's command instead of using --stage-overrides.
3. Custom deploy YAML¶
When per-stage overrides get long, write a small overlay YAML that inherits from the bundled default:
# my_qwen3_omni_overrides.yaml
base_config: /path/to/vllm_omni/deploy/qwen3_omni_moe.yaml
stages:
- stage_id: 0
max_num_batched_tokens: 65536
enforce_eager: true
- stage_id: 1
gpu_memory_utilization: 0.5
- stage_id: 2
max_model_len: 8192
Then start the server with --deploy-config my_qwen3_omni_overrides.yaml. The base_config: line tells the loader to inherit everything else (stages, connectors, edges, platforms section) from the bundled production YAML, so you only need to spell out the deltas.
4. Multi-node deployment (cross-host transfer connector)¶
The bundled qwen3_omni_moe.yaml uses SharedMemoryConnector between stages, which only works when all stages run on the same physical host. For cross-node deployments, write a small overlay YAML that swaps in a network-capable connector (e.g. MooncakeStoreConnector) and re-points each stage's connector wiring at it. The connector spec carries your own server addresses — there is no checked-in default because every cluster is different.
# my_qwen3_omni_multinode.yaml
base_config: /path/to/vllm_omni/deploy/qwen3_omni_moe.yaml
connectors:
mooncake_connector:
name: MooncakeStoreConnector
extra:
host: "127.0.0.1"
metadata_server: "http://YOUR_METADATA_HOST:8080/metadata"
master: "YOUR_MASTER_HOST:50051"
segment: 512000000 # 512 MB transfer segment
localbuf: 64000000 # 64 MB local buffer
proto: "tcp"
stages:
- stage_id: 0
output_connectors:
to_stage_1: mooncake_connector
- stage_id: 1
input_connectors:
from_stage_0: mooncake_connector
output_connectors:
to_stage_2: mooncake_connector
- stage_id: 2
input_connectors:
from_stage_1: mooncake_connector
Then launch with --deploy-config my_qwen3_omni_multinode.yaml. Same pattern works for Qwen2.5-Omni — replace base_config: with the path to vllm_omni/deploy/qwen2_5_omni.yaml.
⚠️ Replace
YOUR_METADATA_HOST/YOUR_MASTER_HOSTwith the actual mooncake server addresses for your cluster. Thebase_config:overlay inherits all stage budgets, devices, and edges from the bundled prod YAML — you only need to spell out the connector swap.
Send Multi-modal Request¶
Get into the example folder
Send request via python¶
python examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py --model Qwen/Qwen3-Omni-30B-A3B-Instruct --query-type use_image --port 8091 --host "localhost"
Realtime WebSocket client (openai_realtime_client.py)¶
openai_realtime_client.py connects to ws://<host>:<port>/v1/realtime, streams a local WAV as PCM16 mono @ 16 kHz in fixed-size chunks (OpenAI-style input_audio_buffer.append / commit), and receives response.audio.delta (incremental PCM for the reply) plus transcription.* events. By default it concatenates audio deltas and writes --output-wav (model output is typically 24 kHz). Optional --delta-dump-dir saves each delta as delta_000001.wav, … for debugging.
Streaming input works well for translation-style use cases; if the Thinker runs while input is still incomplete, consider limiting max_tokens in your session / server defaults to avoid over-generation.
Dependencies:
From this directory (examples/online_serving/qwen3_omni):
python openai_realtime_client.py \
--url ws://localhost:8091/v1/realtime \
--model Qwen/Qwen3-Omni-30B-A3B-Instruct \
--input-wav /path/to/input_16k_mono.wav \
--output-wav realtime_output.wav \
--delta-dump-dir ./rt_delta_wavs
Arguments:
| Flag | Default | Description |
|---|---|---|
--url | ws://localhost:8091/v1/realtime | Full WebSocket URL including path |
--model | Qwen/Qwen3-Omni-30B-A3B-Instruct | Must match the served model (sent in session.update) |
--input-wav | (required) | Input WAV: mono, 16-bit PCM, 16 kHz |
--output-wav | realtime_output.wav | Output path for concatenated reply audio |
--output-text | (optional) | If set, write final transcription text to this path |
--chunk-ms | 200 | Size of each uploaded audio chunk (milliseconds of audio) |
--send-delay-ms | 0 | Delay between chunk sends (simulate realtime upload) |
--delta-dump-dir | (optional) | Directory to write per-response.audio.delta WAV files |
--num-requests | 1 | Number of sequential sessions (see --concurrency) |
--concurrency | 1 | Max concurrent WebSocket sessions when --num-requests > 1 |
Ensure the server is running, for example:
The Python client supports the following command-line arguments:
--query-type(or-q): Query type (default:use_video). Options:text,use_audio,use_image,use_video--model(or-m): Model name/path (default:Qwen/Qwen3-Omni-30B-A3B-Instruct)--video-path(or-v): Path to local video file or URL. If not provided and query-type isuse_video, uses default video URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs. Example:--video-path /path/to/video.mp4or--video-path https://example.com/video.mp4--image-path(or-i): Path to local image file or URL. If not provided and query-type isuse_image, uses default image URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common image formats: JPEG, PNG, GIF, WebP. Example:--image-path /path/to/image.jpgor--image-path https://example.com/image.png--audio-path(or-a): Path to local audio file or URL. If not provided and query-type isuse_audio, uses default audio URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common audio formats: MP3, WAV, OGG, FLAC, M4A. Example:--audio-path /path/to/audio.wavor--audio-path https://example.com/audio.mp3--prompt(or-p): Custom text prompt/question. If not provided, uses default prompt for the selected query type. Example:--prompt "What are the main activities shown in this video?"--speaker: TTS speaker/voice for audio output when requesting audio (e.g.ethan,chelsie,aiden). Omit to use the model default. Example:--speaker "chelsie"
For example, to use a local video file with custom prompt:
python examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py \
--query-type use_video \
--video-path /path/to/your/video.mp4 \
--model Qwen/Qwen3-Omni-30B-A3B-Instruct \
--prompt "What are the main activities shown in this video?"
Send request via curl¶
FAQ¶
Modality control¶
You can control output modalities to specify which types of output the model should generate. This is useful when you only need text output and want to skip audio generation stages for better performance.
Supported modalities¶
| Modalities | Output |
|---|---|
["text"] | Text only |
["audio"] | Audio only |
["text", "audio"] | Text + Audio |
| Not specified | Text + Audio (default) |
Using curl¶
Text only¶
curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
"messages": [{"role": "user", "content": "Describe vLLM in brief."}],
"modalities": ["text"]
}'
Text + Audio¶
curl -s http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
"messages": [{"role": "user", "content": "Describe vLLM in brief."}],
"modalities": ["audio"]
}' | jq -r '.choices[0].message.audio.data' | base64 -d > output.wav
Using Python client¶
python examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py \
--query-type use_image \
--model Qwen/Qwen3-Omni-30B-A3B-Instruct \
--modalities text
Using OpenAI Python SDK¶
Text only¶
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8091/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="Qwen/Qwen3-Omni-30B-A3B-Instruct",
messages=[{"role": "user", "content": "Describe vLLM in brief."}],
modalities=["text"]
)
print(response.choices[0].message.content)
Text + Audio¶
import base64
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8091/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="Qwen/Qwen3-Omni-30B-A3B-Instruct",
messages=[{"role": "user", "content": "Describe vLLM in brief."}],
modalities=["text", "audio"]
)
# Response contains two choices: one with text, one with audio
print(response.choices[0].message.content) # Text response
# Save audio to file
audio_data = base64.b64decode(response.choices[1].message.audio.data)
with open("output.wav", "wb") as f:
f.write(audio_data)
Speaker selection¶
When requesting audio output, you can choose the TTS speaker (voice) used for synthesis. If not specified, the model uses its default speaker.
Using curl¶
Pass a speaker field in the request body:
curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
"messages": [{"role": "user", "content": "Say hello in one sentence."}],
"modalities": ["audio"],
"speaker": "chelsie"
}'
Using Python client¶
Use the --speaker argument when generating audio:
python examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py \
--query-type use_image \
--modalities audio \
--model Qwen/Qwen3-Omni-30B-A3B-Instruct \
--speaker "chelsie"
Using OpenAI Python SDK¶
Pass speaker in extra_body:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8091/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="Qwen/Qwen3-Omni-30B-A3B-Instruct",
messages=[{"role": "user", "content": "Say hello in one sentence."}],
modalities=["audio"],
extra_body={"speaker": "chelsie"}
)
# Audio uses the specified speaker
print(response.choices[1].message.audio)
Supported speaker names depend on the model (e.g. Ethan, Chelsie, Aiden). Omit speaker to use the default.
Streaming Output¶
If you want to enable streaming output, please set the argument as below. The final output will be obtained just after generated by corresponding stage. We support both text streaming output and audio streaming output. Other modalities can output normally.
python examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py \
--query-type use_image \
--model Qwen/Qwen3-Omni-30B-A3B-Instruct \
--stream
Run Local Web UI Demo¶
This Web UI demo allows users to interact with the model through a web browser.
Running Gradio Demo¶
The Gradio demo connects to a vLLM API server. You have two options:
Option 1: One-step Launch Script (Recommended)¶
The convenience script launches both the vLLM server and Gradio demo together:
This script will: 1. Start the vLLM server in the background 2. Wait for the server to be ready 3. Launch the Gradio demo 4. Handle cleanup when you press Ctrl+C
The script supports the following arguments: - --model: Model name/path (default: Qwen/Qwen3-Omni-30B-A3B-Instruct) - --server-port: Port for vLLM server (default: 8091) - --gradio-port: Port for Gradio demo (default: 7861) - --deploy-config: Path to custom deploy config YAML file (optional) - --server-host: Host for vLLM server (default: 0.0.0.0) - --gradio-ip: IP for Gradio demo (default: 127.0.0.1) - --share: Share Gradio demo publicly (creates a public link)
Option 2: Manual Launch (Two-Step Process)¶
Step 1: Launch the vLLM API server
If you have custom stage configs file:
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 --deploy-config /path/to/deploy_config_file
Step 2: Run the Gradio demo
In a separate terminal:
python gradio_demo.py --model Qwen/Qwen3-Omni-30B-A3B-Instruct --api-base http://localhost:8091/v1 --port 7861
Then open http://localhost:7861/ on your local browser to interact with the web UI.
The gradio script supports the following arguments:
--model: Model name/path (should match the server model)--api-base: Base URL for the vLLM API server (default: http://localhost:8091/v1)--ip: Host/IP for Gradio server (default: 127.0.0.1)--port: Port for Gradio server (default: 7861)--share: Share the Gradio demo publicly (creates a public link)
Example materials¶
gradio_demo.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/qwen3_omni/gradio_demo.py.
openai_realtime_client.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/qwen3_omni/openai_realtime_client.py.
qwen3_omni_moe_thinking.yaml
# Stage config for running Qwen3-Omni-MoE-Thinking (text-only output)
# This config is for models like Qwen3-Omni-30B-A3B-Thinking that only have the
# thinker component and do not support audio output.
#
# Single stage: Thinker (multimodal understanding + text generation)
# The following config has been verified on 2x H100-80G GPUs.
stage_args:
- stage_id: 0
runtime:
devices: "0,1"
engine_args:
model_stage: thinker
max_num_seqs: 1
model_arch: Qwen3OmniMoeForConditionalGeneration
worker_type: ar
scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
gpu_memory_utilization: 0.9
enforce_eager: true
trust_remote_code: true
engine_output_type: text
distributed_executor_backend: "mp"
enable_prefix_caching: false
hf_config_name: thinker_config
tensor_parallel_size: 2
final_output: true
final_output_type: text
is_comprehension: true
default_sampling_params:
temperature: 0.4
top_p: 0.9
top_k: 1
max_tokens: 2048
seed: 42
detokenize: True
repetition_penalty: 1.05
run_curl_multimodal_generation.sh
#!/usr/bin/env bash
set -euo pipefail
# Default query type
QUERY_TYPE="${1:-use_video}"
# Default modalities argument
MODALITIES="${2:-null}"
# Validate query type
if [[ ! "$QUERY_TYPE" =~ ^(text|use_audio|use_image|use_video)$ ]]; then
echo "Error: Invalid query type '$QUERY_TYPE'"
echo "Usage: $0 [text|use_audio|use_image|use_video] [modalities]"
echo " text: Text query"
echo " use_audio: Audio + Text query"
echo " use_image: Image + Text query"
echo " use_video: Video + Text query"
echo " modalities: Modalities parameter (default: null)"
exit 1
fi
SEED=42
thinker_sampling_params='{
"temperature": 0.4,
"top_p": 0.9,
"top_k": 1,
"max_tokens": 16384,
"seed": 42,
"repetition_penalty": 1.05,
"stop_token_ids": [151645]
}'
talker_sampling_params='{
"temperature": 0.9,
"top_k": 50,
"max_tokens": 4096,
"seed": 42,
"detokenize": false,
"repetition_penalty": 1.05,
"stop_token_ids": [2150]
}'
code2wav_sampling_params='{
"temperature": 0.0,
"top_p": 1.0,
"top_k": -1,
"max_tokens": 65536,
"seed": 42,
"detokenize": true,
"repetition_penalty": 1.1
}'
# Above is optional, it has a default setting in stage_configs of the corresponding model.
# Define URLs for assets
MARY_HAD_LAMB_AUDIO_URL="https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/mary_had_lamb.ogg"
CHERRY_BLOSSOM_IMAGE_URL="https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/cherry_blossom.jpg"
SAMPLE_VIDEO_URL="https://huggingface.co/datasets/raushan-testing-hf/videos-test/resolve/main/sample_demo_1.mp4"
# Build user content and extra fields based on query type
case "$QUERY_TYPE" in
text)
user_content='[
{
"type": "text",
"text": "Explain the system architecture for a scalable audio generation pipeline. Answer in 15 words."
}
]'
sampling_params_list='[
'"$thinker_sampling_params"',
'"$talker_sampling_params"',
'"$code2wav_sampling_params"'
]'
mm_processor_kwargs="{}"
;;
use_audio)
user_content='[
{
"type": "audio_url",
"audio_url": {
"url": "'"$MARY_HAD_LAMB_AUDIO_URL"'"
}
},
{
"type": "text",
"text": "What is the content of this audio?"
}
]'
sampling_params_list='[
'"$thinker_sampling_params"',
'"$talker_sampling_params"',
'"$code2wav_sampling_params"'
]'
mm_processor_kwargs="{}"
;;
use_image)
user_content='[
{
"type": "image_url",
"image_url": {
"url": "'"$CHERRY_BLOSSOM_IMAGE_URL"'"
}
},
{
"type": "text",
"text": "What is the content of this image?"
}
]'
sampling_params_list='[
'"$thinker_sampling_params"',
'"$talker_sampling_params"',
'"$code2wav_sampling_params"'
]'
mm_processor_kwargs="{}"
;;
use_video)
user_content='[
{
"type": "video_url",
"video_url": {
"url": "'"$SAMPLE_VIDEO_URL"'"
}
},
{
"type": "text",
"text": "Why is this video funny?"
}
]'
sampling_params_list='[
'"$thinker_sampling_params"',
'"$talker_sampling_params"',
'"$code2wav_sampling_params"'
]'
mm_processor_kwargs="{}"
;;
esac
echo "Running query type: $QUERY_TYPE"
echo ""
request_body=$(cat <<EOF
{
"model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
"sampling_params_list": $sampling_params_list,
"mm_processor_kwargs": $mm_processor_kwargs,
"modalities": $MODALITIES,
"messages": [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."
}
]
},
{
"role": "user",
"content": $user_content
}
]
}
EOF
)
output=$(curl -sS --retry 3 --retry-delay 3 --retry-connrefused \
-X POST http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d "$request_body")
# Here it only shows the text content of the first choice. Audio content has many binaries, so it's not displayed here.
echo "Output of request: $(echo "$output" | jq '.choices[0].message.content')"
run_gradio_demo.sh
#!/bin/bash
# Convenience script to launch both vLLM server and Gradio demo for Qwen3-Omni
#
# Usage:
# ./run_gradio_demo.sh [OPTIONS]
#
# Example:
# ./run_gradio_demo.sh --model Qwen/Qwen3-Omni-30B-A3B-Instruct --server-port 8091 --gradio-port 7861
set -e
# Default values
MODEL="Qwen/Qwen3-Omni-30B-A3B-Instruct"
SERVER_PORT=8091
GRADIO_PORT=7861
STAGE_CONFIGS_PATH=""
SERVER_HOST="0.0.0.0"
GRADIO_IP="127.0.0.1"
GRADIO_SHARE=false
# Parse command line arguments
while [[ $# -gt 0 ]]; do
case $1 in
--model)
MODEL="$2"
shift 2
;;
--server-port)
SERVER_PORT="$2"
shift 2
;;
--gradio-port)
GRADIO_PORT="$2"
shift 2
;;
--stage-configs-path)
STAGE_CONFIGS_PATH="$2"
shift 2
;;
--server-host)
SERVER_HOST="$2"
shift 2
;;
--gradio-ip)
GRADIO_IP="$2"
shift 2
;;
--share)
GRADIO_SHARE=true
shift
;;
--help)
echo "Usage: $0 [OPTIONS]"
echo ""
echo "Options:"
echo " --model MODEL Model name/path (default: Qwen/Qwen3-Omni-30B-A3B-Instruct)"
echo " --server-port PORT Port for vLLM server (default: 8091)"
echo " --gradio-port PORT Port for Gradio demo (default: 7861)"
echo " --stage-configs-path PATH Path to custom stage configs YAML file (optional)"
echo " --server-host HOST Host for vLLM server (default: 0.0.0.0)"
echo " --gradio-ip IP IP for Gradio demo (default: 127.0.0.1)"
echo " --share Share Gradio demo publicly"
echo " --help Show this help message"
echo ""
exit 0
;;
*)
echo "Unknown option: $1"
echo "Use --help for usage information"
exit 1
;;
esac
done
# Get the directory where this script is located
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
API_BASE="http://localhost:${SERVER_PORT}/v1"
HEALTH_URL="http://localhost:${SERVER_PORT}/health"
echo "=========================================="
echo "Starting vLLM-Omni Gradio Demo"
echo "=========================================="
echo "Model: $MODEL"
echo "Server: http://${SERVER_HOST}:${SERVER_PORT}"
echo "Gradio: http://${GRADIO_IP}:${GRADIO_PORT}"
echo "=========================================="
# Build vLLM server command
SERVER_CMD=("vllm" "serve" "$MODEL" "--omni" "--port" "$SERVER_PORT" "--host" "$SERVER_HOST")
if [ -n "$STAGE_CONFIGS_PATH" ]; then
SERVER_CMD+=("--stage-configs-path" "$STAGE_CONFIGS_PATH")
fi
# Function to cleanup on exit
cleanup() {
echo ""
echo "Shutting down..."
if [ -n "$SERVER_PID" ]; then
echo "Stopping vLLM server (PID: $SERVER_PID)..."
kill "$SERVER_PID" 2>/dev/null || true
wait "$SERVER_PID" 2>/dev/null || true
fi
if [ -n "$GRADIO_PID" ]; then
echo "Stopping Gradio demo (PID: $GRADIO_PID)..."
kill "$GRADIO_PID" 2>/dev/null || true
wait "$GRADIO_PID" 2>/dev/null || true
fi
echo "Cleanup complete"
exit 0
}
# Set up signal handlers
trap cleanup SIGINT SIGTERM
# Start vLLM server with output shown in real-time and saved to log
echo ""
echo "Starting vLLM server..."
LOG_FILE="/tmp/vllm_server_${SERVER_PORT}.log"
"${SERVER_CMD[@]}" 2>&1 | tee "$LOG_FILE" &
SERVER_PID=$!
# Start a background process to monitor the log for startup completion
STARTUP_COMPLETE=false
TAIL_PID=""
# Function to cleanup tail process
cleanup_tail() {
if [ -n "$TAIL_PID" ]; then
kill "$TAIL_PID" 2>/dev/null || true
wait "$TAIL_PID" 2>/dev/null || true
fi
}
# Wait for server to be ready by checking log output
echo ""
echo "Waiting for vLLM server to be ready (checking for 'Application startup complete' message)..."
echo ""
# Monitor log file for startup completion message
MAX_WAIT=300 # 5 minutes timeout as fallback
ELAPSED=0
# Use a temporary file to track startup completion
STARTUP_FLAG="/tmp/vllm_startup_flag_${SERVER_PORT}.tmp"
rm -f "$STARTUP_FLAG"
# Start monitoring in background
(
tail -f "$LOG_FILE" 2>/dev/null | grep -m 1 "Application startup complete" > /dev/null && touch "$STARTUP_FLAG"
) &
TAIL_PID=$!
while [ $ELAPSED -lt $MAX_WAIT ]; do
# Check if startup flag file exists (startup complete)
if [ -f "$STARTUP_FLAG" ]; then
cleanup_tail
echo ""
echo "✓ vLLM server is ready!"
STARTUP_COMPLETE=true
break
fi
# Check if server process is still running
if ! kill -0 "$SERVER_PID" 2>/dev/null; then
cleanup_tail
echo ""
echo "Error: vLLM server failed to start (process terminated)"
wait "$SERVER_PID" 2>/dev/null || true
exit 1
fi
sleep 1
ELAPSED=$((ELAPSED + 1))
done
cleanup_tail
rm -f "$STARTUP_FLAG"
if [ "$STARTUP_COMPLETE" != "true" ]; then
echo ""
echo "Error: vLLM server did not complete startup within ${MAX_WAIT} seconds"
kill "$SERVER_PID" 2>/dev/null || true
exit 1
fi
# Start Gradio demo
echo ""
echo "Starting Gradio demo..."
cd "$SCRIPT_DIR"
GRADIO_CMD=("python" "gradio_demo.py" "--model" "$MODEL" "--api-base" "$API_BASE" "--ip" "$GRADIO_IP" "--port" "$GRADIO_PORT")
if [ "$GRADIO_SHARE" = true ]; then
GRADIO_CMD+=("--share")
fi
"${GRADIO_CMD[@]}" > /tmp/gradio_demo.log 2>&1 &
GRADIO_PID=$!
echo ""
echo "=========================================="
echo "Both services are running!"
echo "=========================================="
echo "vLLM Server: http://${SERVER_HOST}:${SERVER_PORT}"
echo "Gradio Demo: http://${GRADIO_IP}:${GRADIO_PORT}"
echo ""
echo "Press Ctrl+C to stop both services"
echo "=========================================="
echo ""
# Wait for either process to exit
wait $SERVER_PID $GRADIO_PID || true
cleanup
streaming_video_client.py
#!/usr/bin/env python3
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""Example WebSocket client for the /v1/video/chat/stream endpoint.
Sends video frames from a local file (or generates synthetic ones), submits a
query, and prints the streamed text response.
Requirements:
pip install websockets pillow
Usage:
# With a video file (requires opencv-python):
python streaming_video_client.py --video my_clip.mp4 \\
--query "What is happening in this video?"
# Synthetic frames (no extra deps):
python streaming_video_client.py \\
--query "Describe what you see." \\
--synthetic-frames 10
# With audio (Phase 3):
python streaming_video_client.py --video my_clip.mp4 \\
--audio my_audio.pcm \\
--query "What is the person saying and doing?"
"""
from __future__ import annotations
import argparse
import asyncio
import base64
import io
import json
import sys
try:
import websockets
except ImportError:
print("Please install websockets: pip install websockets")
sys.exit(1)
from PIL import Image
def _generate_synthetic_frame(index: int, width: int = 320, height: int = 240) -> bytes:
"""Generate a simple synthetic JPEG frame with a colour gradient."""
r = (index * 37) % 256
g = (index * 73) % 256
b = (index * 113) % 256
img = Image.new("RGB", (width, height), (r, g, b))
buf = io.BytesIO()
img.save(buf, format="JPEG", quality=80)
return buf.getvalue()
def _load_video_frames(path: str, max_frames: int = 64, fps: int = 2) -> list[bytes]:
"""Extract frames from a video file using OpenCV."""
try:
import cv2
except ImportError:
print("opencv-python is required to read video files: pip install opencv-python")
sys.exit(1)
cap = cv2.VideoCapture(path)
if not cap.isOpened():
print(f"Cannot open video: {path}")
sys.exit(1)
video_fps = cap.get(cv2.CAP_PROP_FPS) or 30.0
frame_interval = max(1, int(video_fps / fps))
frames: list[bytes] = []
idx = 0
while len(frames) < max_frames:
ret, frame = cap.read()
if not ret:
break
if idx % frame_interval == 0:
_, buf = cv2.imencode(".jpg", frame, [cv2.IMWRITE_JPEG_QUALITY, 80])
frames.append(buf.tobytes())
idx += 1
cap.release()
print(f"Loaded {len(frames)} frames from {path} (interval={frame_interval})")
return frames
async def run(args: argparse.Namespace) -> None:
uri = f"ws://{args.host}:{args.port}/v1/video/chat/stream"
# Prepare frames
if args.video:
frames = _load_video_frames(args.video, max_frames=args.max_frames, fps=args.fps)
else:
frames = [_generate_synthetic_frame(i) for i in range(args.synthetic_frames)]
print(f"Generated {len(frames)} synthetic frames")
# Prepare audio (optional, Phase 3)
audio_data: bytes | None = None
if args.audio:
with open(args.audio, "rb") as f:
audio_data = f.read()
print(f"Loaded audio: {len(audio_data)} bytes")
async with websockets.connect(uri, max_size=16 * 1024 * 1024) as ws:
# 1. Send session.config
config = {
"type": "session.config",
"model": args.model,
"modalities": ["text", "audio"] if audio_data else ["text"],
"max_frames": args.max_frames,
"num_frames": args.num_sample_frames,
"enable_frame_filter": args.evs,
"frame_filter_threshold": args.evs_threshold,
"use_audio_in_video": bool(audio_data),
}
await ws.send(json.dumps(config))
print(f"Sent session.config: model={args.model} evs={args.evs}")
# 2. Send frames
for i, frame in enumerate(frames):
msg = {
"type": "video.frame",
"data": base64.b64encode(frame).decode(),
}
await ws.send(json.dumps(msg))
if (i + 1) % 10 == 0:
print(f" Sent {i + 1}/{len(frames)} frames")
print(f"Sent all {len(frames)} frames")
# 3. Send audio chunks (Phase 3)
if audio_data:
chunk_size = 16000 * 2 # 1 second of 16 kHz 16-bit PCM
for offset in range(0, len(audio_data), chunk_size):
chunk = audio_data[offset : offset + chunk_size]
msg = {
"type": "audio.chunk",
"data": base64.b64encode(chunk).decode(),
}
await ws.send(json.dumps(msg))
print(f"Sent audio in {(len(audio_data) + chunk_size - 1) // chunk_size} chunks")
# 4. Send query, then immediately send video.done so the server
# knows the session is complete (avoids deadlock where client
# waits for session.done while server waits for video.done).
await ws.send(json.dumps({"type": "video.query", "text": args.query}))
print(f"\nQuery: {args.query}")
print("Response: ", end="", flush=True)
# Signal end of session right after the query. The server will
# process the query first (it's already queued), then handle
# video.done and reply with session.done.
await ws.send(json.dumps({"type": "video.done"}))
# 5. Receive response until session.done
recv_timeout = 120 # seconds — avoid infinite hang if server stalls
while True:
raw = await asyncio.wait_for(ws.recv(), timeout=recv_timeout)
data = json.loads(raw)
msg_type = data.get("type")
if msg_type == "response.text.delta":
print(data.get("delta", ""), end="", flush=True)
elif msg_type == "response.text.done":
print() # newline
elif msg_type == "response.evs_stats":
retained = data.get("retained_count", 0)
dropped = data.get("dropped_count", 0)
rate = data.get("drop_rate", 0)
print(f"\nEVS stats: retained={retained} dropped={dropped} drop_rate={rate:.1%}")
elif msg_type == "session.done":
print("Session complete.")
break
elif msg_type == "error":
print(f"\nError: {data.get('message')}")
break
elif msg_type == "response.start":
pass # expected
else:
print(f"\n[unknown message] {data}")
def main() -> None:
parser = argparse.ArgumentParser(description="Streaming video chat client")
parser.add_argument("--host", default="localhost")
parser.add_argument("--port", type=int, default=8000)
parser.add_argument("--model", default="Qwen/Qwen3-Omni-MoE")
parser.add_argument("--video", help="Path to video file (requires opencv-python)")
parser.add_argument("--audio", help="Path to raw PCM 16kHz audio file (Phase 3)")
parser.add_argument("--query", default="What do you see in this video?")
parser.add_argument(
"--synthetic-frames", type=int, default=10, help="Number of synthetic frames if --video is not set"
)
parser.add_argument("--max-frames", type=int, default=64)
parser.add_argument("--num-sample-frames", type=int, default=16)
parser.add_argument("--fps", type=int, default=2, help="Frame extraction rate from video")
parser.add_argument(
"--no-evs", dest="evs", action="store_false", help="Disable EVS frame filtering (enabled by default)"
)
parser.set_defaults(evs=True)
parser.add_argument("--evs-threshold", type=float, default=0.95)
args = parser.parse_args()
asyncio.run(run(args))
if __name__ == "__main__":
main()