Skip to content

Qwen3-Omni

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/qwen3_omni.

🛠️ Installation

Please refer to README.md

Run examples (Qwen3-Omni)

Launch the Server

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091

The default deployment configuration, situated at vllm_omni/deploy/qwen3_omni_moe.yaml, is resolved and loaded automatically via the model registry, obviating the --deploy-config flag in standard deployment topologies. Asynchronous chunk streaming operates as enabled by default within this bundled configuration. Additionally, NPU, ROCm, and XPU per-platform configuration deltas are deterministically merged from the platforms: section of the corresponding YAML.

To explicitly utilize a custom deployment YAML, mandate the configuration path accordingly:

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
    --deploy-config /path/to/your_deploy_config.yaml

For the bundled 3x-GPU multi-replica layout (talker/code2wav scale-out), use:

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
    --deploy-config vllm_omni/deploy/qwen3_omni_moe_multi_replicas.yaml

Launch individual stages (stage-based CLI)

Use the stage-based CLI when you want to run one stage per process. The example below pins Stage 0 to GPU 0 and Stage 1/2 to GPU 1 via CUDA_VISIBLE_DEVICES.

1. Stage 0 (Thinker + API server)

CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
    --port 8091 \
    --stage-id 0 \
    --omni-master-address 127.0.0.1 \
    --omni-master-port 26000

2. Stage 1 (Talker)

CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
    --stage-id 1 \
    --headless \
    --omni-master-address 127.0.0.1 \
    --omni-master-port 26000

3. Stage 2 (Code2Wav)

CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
    --stage-id 2 \
    --headless \
    --omni-master-address 127.0.0.1 \
    --omni-master-port 26000

Append --deploy-config /path/to/your_deploy_config.yaml to each node invocation if it is necessary to explicitly override the bundled deployment YAML schema.

For standard unified-process launcher, stage-specific CLI configuration tuning is conventionally implemented via the --stage-overrides directive, as demonstrated below:

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
    --stage-overrides '{"1": {"gpu_memory_utilization": 0.5}}'

Conversely, within the stage-based CLI paradigm, --stage-overrides modifiers are typically unnecessary for this category of optimization. Given that each instantiation strictly initiates a single functional stage, parameter flags can be systematically assigned directly onto that specific stage's command sequence:

CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
    --stage-id 1 \
    --headless \
    --gpu-memory-utilization 0.5 \
    --omni-master-address 127.0.0.1 \
    --omni-master-port 26000

Tuning deployment parameters

Most engine knobs (max_num_batched_tokens, max_model_len, enforce_eager, gpu_memory_utilization, tensor_parallel_size, …) can be tuned without editing the YAML. There are three layers, in increasing specificity:

1. Global CLI flags (apply to every stage)

# Tighter memory budget on a smaller GPU
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
    --gpu-memory-utilization 0.85

# Disable cudagraphs (e.g. for debugging)
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
    --enforce-eager

# Reduce context length
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
    --max-model-len 32768

# Toggle prefix caching on every stage (yaml default: off)
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
    --enable-prefix-caching
# ...or force it off if the yaml turned it on
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
    --no-enable-prefix-caching

# Toggle pipeline-wide async chunked streaming between stages
# (yaml default for qwen3_omni_moe: on)
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
    --no-async-chunk

For the TTS counterpart (synchronous codec variant), see the Qwen3-TTS section of the online TTS hub: examples/online_serving/text_to_speech/README.md#qwen3-tts.

Explicit CLI flags override the deploy YAML (which itself overrides the parser defaults). If you don't pass a flag, the YAML value wins.

Note on --no-async-chunk: Flips the deploy yaml's async_chunk: bool. Pipelines that implement alternate processor functions for chunked vs end-to-end modes (e.g. qwen3_tts code2wav) dispatch automatically based on that bool — no extra flag or variant yaml is needed.

⚠️ For multi-stage models that share GPUs (qwen3_omni_moe by default shares cuda:1 between stages 1 and 2), avoid using global memory flags. A global --gpu-memory-utilization 0.85 would apply to every stage and oversubscribe the shared device. Use per-stage overrides instead — see below.

# Lower stage 1's memory budget; leave others at the YAML default
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
    --stage-overrides '{
        "1": {"gpu_memory_utilization": 0.5},
        "2": {"max_num_batched_tokens": 65536}
    }'

Per-stage values are always treated as explicit and beat YAML defaults for the named stage. Other stages keep their YAML values.

If you switch to the stage-based CLI, the same per-stage tuning can usually be passed directly on that stage's command instead of using --stage-overrides.

3. Custom deploy YAML

When per-stage overrides get long, write a small overlay YAML that inherits from the bundled default:

# my_qwen3_omni_overrides.yaml
base_config: /path/to/vllm_omni/deploy/qwen3_omni_moe.yaml

stages:
  - stage_id: 0
    max_num_batched_tokens: 65536
    enforce_eager: true
  - stage_id: 1
    gpu_memory_utilization: 0.5
  - stage_id: 2
    max_model_len: 8192

Then start the server with --deploy-config my_qwen3_omni_overrides.yaml. The base_config: line tells the loader to inherit everything else (stages, connectors, edges, platforms section) from the bundled production YAML, so you only need to spell out the deltas.

4. Multi-node deployment (cross-host transfer connector)

The bundled qwen3_omni_moe.yaml uses SharedMemoryConnector between stages, which only works when all stages run on the same physical host. For cross-node deployments, write a small overlay YAML that swaps in a network-capable connector (e.g. MooncakeStoreConnector) and re-points each stage's connector wiring at it. The connector spec carries your own server addresses — there is no checked-in default because every cluster is different.

# my_qwen3_omni_multinode.yaml
base_config: /path/to/vllm_omni/deploy/qwen3_omni_moe.yaml

connectors:
  mooncake_connector:
    name: MooncakeStoreConnector
    extra:
      host: "127.0.0.1"
      metadata_server: "http://YOUR_METADATA_HOST:8080/metadata"
      master: "YOUR_MASTER_HOST:50051"
      segment: 512000000    # 512 MB transfer segment
      localbuf: 64000000    # 64 MB local buffer
      proto: "tcp"

stages:
  - stage_id: 0
    output_connectors:
      to_stage_1: mooncake_connector
  - stage_id: 1
    input_connectors:
      from_stage_0: mooncake_connector
    output_connectors:
      to_stage_2: mooncake_connector
  - stage_id: 2
    input_connectors:
      from_stage_1: mooncake_connector

Then launch with --deploy-config my_qwen3_omni_multinode.yaml. Same pattern works for Qwen2.5-Omni — replace base_config: with the path to vllm_omni/deploy/qwen2_5_omni.yaml.

⚠️ Replace YOUR_METADATA_HOST / YOUR_MASTER_HOST with the actual mooncake server addresses for your cluster. The base_config: overlay inherits all stage budgets, devices, and edges from the bundled prod YAML — you only need to spell out the connector swap.

Send Multi-modal Request

Get into the example folder

cd examples/online_serving/qwen3_omni

Send request via python

python examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py --model Qwen/Qwen3-Omni-30B-A3B-Instruct --query-type use_image --port 8091 --host "localhost"

Realtime WebSocket client (openai_realtime_client.py)

openai_realtime_client.py connects to ws://<host>:<port>/v1/realtime, streams a local WAV as PCM16 mono @ 16 kHz in fixed-size chunks (OpenAI-style input_audio_buffer.append / commit), and receives response.audio.delta (incremental PCM for the reply) plus transcription.* events. By default it concatenates audio deltas and writes --output-wav (model output is typically 24 kHz). Optional --delta-dump-dir saves each delta as delta_000001.wav, … for debugging.

Streaming input works well for translation-style use cases; if the Thinker runs while input is still incomplete, consider limiting max_tokens in your session / server defaults to avoid over-generation.

Dependencies:

pip install websockets

From this directory (examples/online_serving/qwen3_omni):

python openai_realtime_client.py \
  --url ws://localhost:8091/v1/realtime \
  --model Qwen/Qwen3-Omni-30B-A3B-Instruct \
  --input-wav /path/to/input_16k_mono.wav \
  --output-wav realtime_output.wav \
  --delta-dump-dir ./rt_delta_wavs

Arguments:

Flag Default Description
--url ws://localhost:8091/v1/realtime Full WebSocket URL including path
--model Qwen/Qwen3-Omni-30B-A3B-Instruct Must match the served model (sent in session.update)
--input-wav (required) Input WAV: mono, 16-bit PCM, 16 kHz
--output-wav realtime_output.wav Output path for concatenated reply audio
--output-text (optional) If set, write final transcription text to this path
--chunk-ms 200 Size of each uploaded audio chunk (milliseconds of audio)
--send-delay-ms 0 Delay between chunk sends (simulate realtime upload)
--delta-dump-dir (optional) Directory to write per-response.audio.delta WAV files
--num-requests 1 Number of sequential sessions (see --concurrency)
--concurrency 1 Max concurrent WebSocket sessions when --num-requests > 1

Ensure the server is running, for example:

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091

The Python client supports the following command-line arguments:

  • --query-type (or -q): Query type (default: use_video). Options: text, use_audio, use_image, use_video
  • --model (or -m): Model name/path (default: Qwen/Qwen3-Omni-30B-A3B-Instruct)
  • --video-path (or -v): Path to local video file or URL. If not provided and query-type is use_video, uses default video URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs. Example: --video-path /path/to/video.mp4 or --video-path https://example.com/video.mp4
  • --image-path (or -i): Path to local image file or URL. If not provided and query-type is use_image, uses default image URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common image formats: JPEG, PNG, GIF, WebP. Example: --image-path /path/to/image.jpg or --image-path https://example.com/image.png
  • --audio-path (or -a): Path to local audio file or URL. If not provided and query-type is use_audio, uses default audio URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common audio formats: MP3, WAV, OGG, FLAC, M4A. Example: --audio-path /path/to/audio.wav or --audio-path https://example.com/audio.mp3
  • --prompt (or -p): Custom text prompt/question. If not provided, uses default prompt for the selected query type. Example: --prompt "What are the main activities shown in this video?"
  • --speaker: TTS speaker/voice for audio output when requesting audio (e.g. ethan, chelsie, aiden). Omit to use the model default. Example: --speaker "chelsie"

For example, to use a local video file with custom prompt:

python examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py \
    --query-type use_video \
    --video-path /path/to/your/video.mp4 \
    --model Qwen/Qwen3-Omni-30B-A3B-Instruct \
    --prompt "What are the main activities shown in this video?"

Send request via curl

bash run_curl_multimodal_generation.sh use_image

FAQ

Modality control

You can control output modalities to specify which types of output the model should generate. This is useful when you only need text output and want to skip audio generation stages for better performance.

Supported modalities

Modalities Output
["text"] Text only
["audio"] Audio only
["text", "audio"] Text + Audio
Not specified Text + Audio (default)

Using curl

Text only

curl http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
    "messages": [{"role": "user", "content": "Describe vLLM in brief."}],
    "modalities": ["text"]
  }'

Text + Audio

curl -s http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
    "messages": [{"role": "user", "content": "Describe vLLM in brief."}],
    "modalities": ["audio"]
  }' | jq -r '.choices[0].message.audio.data' | base64 -d > output.wav

Using Python client

python examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py \
    --query-type use_image \
    --model Qwen/Qwen3-Omni-30B-A3B-Instruct \
    --modalities text

Using OpenAI Python SDK

Text only

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8091/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="Qwen/Qwen3-Omni-30B-A3B-Instruct",
    messages=[{"role": "user", "content": "Describe vLLM in brief."}],
    modalities=["text"]
)
print(response.choices[0].message.content)

Text + Audio

import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8091/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="Qwen/Qwen3-Omni-30B-A3B-Instruct",
    messages=[{"role": "user", "content": "Describe vLLM in brief."}],
    modalities=["text", "audio"]
)
# Response contains two choices: one with text, one with audio
print(response.choices[0].message.content)  # Text response

# Save audio to file
audio_data = base64.b64decode(response.choices[1].message.audio.data)
with open("output.wav", "wb") as f:
    f.write(audio_data)

Speaker selection

When requesting audio output, you can choose the TTS speaker (voice) used for synthesis. If not specified, the model uses its default speaker.

Using curl

Pass a speaker field in the request body:

curl http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
    "messages": [{"role": "user", "content": "Say hello in one sentence."}],
    "modalities": ["audio"],
    "speaker": "chelsie"
  }'

Using Python client

Use the --speaker argument when generating audio:

python examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py \
    --query-type use_image \
    --modalities audio \
    --model Qwen/Qwen3-Omni-30B-A3B-Instruct \
    --speaker "chelsie"

Using OpenAI Python SDK

Pass speaker in extra_body:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8091/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="Qwen/Qwen3-Omni-30B-A3B-Instruct",
    messages=[{"role": "user", "content": "Say hello in one sentence."}],
    modalities=["audio"],
    extra_body={"speaker": "chelsie"}
)
# Audio uses the specified speaker
print(response.choices[1].message.audio)

Supported speaker names depend on the model (e.g. Ethan, Chelsie, Aiden). Omit speaker to use the default.

Streaming Output

If you want to enable streaming output, please set the argument as below. The final output will be obtained just after generated by corresponding stage. We support both text streaming output and audio streaming output. Other modalities can output normally.

python examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py \
    --query-type use_image \
    --model Qwen/Qwen3-Omni-30B-A3B-Instruct \
    --stream

Run Local Web UI Demo

This Web UI demo allows users to interact with the model through a web browser.

Running Gradio Demo

The Gradio demo connects to a vLLM API server. You have two options:

The convenience script launches both the vLLM server and Gradio demo together:

./run_gradio_demo.sh --model Qwen/Qwen3-Omni-30B-A3B-Instruct --server-port 8091 --gradio-port 7861

This script will: 1. Start the vLLM server in the background 2. Wait for the server to be ready 3. Launch the Gradio demo 4. Handle cleanup when you press Ctrl+C

The script supports the following arguments: - --model: Model name/path (default: Qwen/Qwen3-Omni-30B-A3B-Instruct) - --server-port: Port for vLLM server (default: 8091) - --gradio-port: Port for Gradio demo (default: 7861) - --deploy-config: Path to custom deploy config YAML file (optional) - --server-host: Host for vLLM server (default: 0.0.0.0) - --gradio-ip: IP for Gradio demo (default: 127.0.0.1) - --share: Share Gradio demo publicly (creates a public link)

Option 2: Manual Launch (Two-Step Process)

Step 1: Launch the vLLM API server

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091

If you have custom stage configs file:

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 --deploy-config /path/to/deploy_config_file

Step 2: Run the Gradio demo

In a separate terminal:

python gradio_demo.py --model Qwen/Qwen3-Omni-30B-A3B-Instruct --api-base http://localhost:8091/v1 --port 7861

Then open http://localhost:7861/ on your local browser to interact with the web UI.

The gradio script supports the following arguments:

  • --model: Model name/path (should match the server model)
  • --api-base: Base URL for the vLLM API server (default: http://localhost:8091/v1)
  • --ip: Host/IP for Gradio server (default: 127.0.0.1)
  • --port: Port for Gradio server (default: 7861)
  • --share: Share the Gradio demo publicly (creates a public link)

Example materials

gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/qwen3_omni/gradio_demo.py.

openai_realtime_client.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/qwen3_omni/openai_realtime_client.py.

qwen3_omni_moe_thinking.yaml
# Stage config for running Qwen3-Omni-MoE-Thinking (text-only output)
# This config is for models like Qwen3-Omni-30B-A3B-Thinking that only have the
# thinker component and do not support audio output.
#
# Single stage: Thinker (multimodal understanding + text generation)

# The following config has been verified on 2x H100-80G GPUs.
stage_args:
  - stage_id: 0
    runtime:
      devices: "0,1"
    engine_args:
      model_stage: thinker
      max_num_seqs: 1
      model_arch: Qwen3OmniMoeForConditionalGeneration
      worker_type: ar
      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
      gpu_memory_utilization: 0.9
      enforce_eager: true
      trust_remote_code: true
      engine_output_type: text
      distributed_executor_backend: "mp"
      enable_prefix_caching: false
      hf_config_name: thinker_config
      tensor_parallel_size: 2
    final_output: true
    final_output_type: text
    is_comprehension: true
    default_sampling_params:
      temperature: 0.4
      top_p: 0.9
      top_k: 1
      max_tokens: 2048
      seed: 42
      detokenize: True
      repetition_penalty: 1.05
run_curl_multimodal_generation.sh
#!/usr/bin/env bash
set -euo pipefail

# Default query type
QUERY_TYPE="${1:-use_video}"

# Default modalities argument
MODALITIES="${2:-null}"

# Validate query type
if [[ ! "$QUERY_TYPE" =~ ^(text|use_audio|use_image|use_video)$ ]]; then
    echo "Error: Invalid query type '$QUERY_TYPE'"
    echo "Usage: $0 [text|use_audio|use_image|use_video] [modalities]"
    echo "  text: Text query"
    echo "  use_audio: Audio + Text query"
    echo "  use_image: Image + Text query"
    echo "  use_video: Video + Text query"
    echo "  modalities: Modalities parameter (default: null)"
    exit 1
fi

SEED=42

thinker_sampling_params='{
  "temperature": 0.4,
  "top_p": 0.9,
  "top_k": 1,
  "max_tokens": 16384,
  "seed": 42,
  "repetition_penalty": 1.05,
  "stop_token_ids": [151645]
}'

talker_sampling_params='{
  "temperature": 0.9,
  "top_k": 50,
  "max_tokens": 4096,
  "seed": 42,
  "detokenize": false,
  "repetition_penalty": 1.05,
  "stop_token_ids": [2150]
}'

code2wav_sampling_params='{
  "temperature": 0.0,
  "top_p": 1.0,
  "top_k": -1,
  "max_tokens": 65536,
  "seed": 42,
  "detokenize": true,
  "repetition_penalty": 1.1
}'
# Above is optional, it has a default setting in stage_configs of the corresponding model.

# Define URLs for assets
MARY_HAD_LAMB_AUDIO_URL="https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/mary_had_lamb.ogg"
CHERRY_BLOSSOM_IMAGE_URL="https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/cherry_blossom.jpg"
SAMPLE_VIDEO_URL="https://huggingface.co/datasets/raushan-testing-hf/videos-test/resolve/main/sample_demo_1.mp4"

# Build user content and extra fields based on query type
case "$QUERY_TYPE" in
  text)
    user_content='[
      {
        "type": "text",
        "text": "Explain the system architecture for a scalable audio generation pipeline. Answer in 15 words."
      }
    ]'
    sampling_params_list='[
      '"$thinker_sampling_params"',
      '"$talker_sampling_params"',
      '"$code2wav_sampling_params"'
    ]'
    mm_processor_kwargs="{}"
    ;;
  use_audio)
    user_content='[
        {
          "type": "audio_url",
          "audio_url": {
            "url": "'"$MARY_HAD_LAMB_AUDIO_URL"'"
          }
        },
        {
          "type": "text",
          "text": "What is the content of this audio?"
        }
      ]'
    sampling_params_list='[
      '"$thinker_sampling_params"',
      '"$talker_sampling_params"',
      '"$code2wav_sampling_params"'
    ]'
    mm_processor_kwargs="{}"
    ;;
  use_image)
    user_content='[
        {
          "type": "image_url",
          "image_url": {
            "url": "'"$CHERRY_BLOSSOM_IMAGE_URL"'"
          }
        },
        {
          "type": "text",
          "text": "What is the content of this image?"
        }
      ]'
    sampling_params_list='[
      '"$thinker_sampling_params"',
      '"$talker_sampling_params"',
      '"$code2wav_sampling_params"'
    ]'
    mm_processor_kwargs="{}"
    ;;
  use_video)
    user_content='[
        {
          "type": "video_url",
          "video_url": {
            "url": "'"$SAMPLE_VIDEO_URL"'"
          }
        },
        {
          "type": "text",
          "text": "Why is this video funny?"
        }
      ]'
    sampling_params_list='[
      '"$thinker_sampling_params"',
      '"$talker_sampling_params"',
      '"$code2wav_sampling_params"'
    ]'
    mm_processor_kwargs="{}"
    ;;
esac

echo "Running query type: $QUERY_TYPE"
echo ""

request_body=$(cat <<EOF
{
  "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
  "sampling_params_list": $sampling_params_list,
  "mm_processor_kwargs": $mm_processor_kwargs,
  "modalities": $MODALITIES,
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."
        }
      ]
    },
    {
      "role": "user",
      "content": $user_content
    }
  ]
}
EOF
)

output=$(curl -sS --retry 3 --retry-delay 3 --retry-connrefused \
    -X POST http://localhost:8091/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d "$request_body")

# Here it only shows the text content of the first choice. Audio content has many binaries, so it's not displayed here.
echo "Output of request: $(echo "$output" | jq '.choices[0].message.content')"
run_gradio_demo.sh
#!/bin/bash
# Convenience script to launch both vLLM server and Gradio demo for Qwen3-Omni
#
# Usage:
#   ./run_gradio_demo.sh [OPTIONS]
#
# Example:
#   ./run_gradio_demo.sh --model Qwen/Qwen3-Omni-30B-A3B-Instruct --server-port 8091 --gradio-port 7861

set -e

# Default values
MODEL="Qwen/Qwen3-Omni-30B-A3B-Instruct"
SERVER_PORT=8091
GRADIO_PORT=7861
STAGE_CONFIGS_PATH=""
SERVER_HOST="0.0.0.0"
GRADIO_IP="127.0.0.1"
GRADIO_SHARE=false

# Parse command line arguments
while [[ $# -gt 0 ]]; do
    case $1 in
        --model)
            MODEL="$2"
            shift 2
            ;;
        --server-port)
            SERVER_PORT="$2"
            shift 2
            ;;
        --gradio-port)
            GRADIO_PORT="$2"
            shift 2
            ;;
        --stage-configs-path)
            STAGE_CONFIGS_PATH="$2"
            shift 2
            ;;
        --server-host)
            SERVER_HOST="$2"
            shift 2
            ;;
        --gradio-ip)
            GRADIO_IP="$2"
            shift 2
            ;;
        --share)
            GRADIO_SHARE=true
            shift
            ;;
        --help)
            echo "Usage: $0 [OPTIONS]"
            echo ""
            echo "Options:"
            echo "  --model MODEL                 Model name/path (default: Qwen/Qwen3-Omni-30B-A3B-Instruct)"
            echo "  --server-port PORT            Port for vLLM server (default: 8091)"
            echo "  --gradio-port PORT            Port for Gradio demo (default: 7861)"
            echo "  --stage-configs-path PATH     Path to custom stage configs YAML file (optional)"
            echo "  --server-host HOST            Host for vLLM server (default: 0.0.0.0)"
            echo "  --gradio-ip IP                IP for Gradio demo (default: 127.0.0.1)"
            echo "  --share                       Share Gradio demo publicly"
            echo "  --help                        Show this help message"
            echo ""
            exit 0
            ;;
        *)
            echo "Unknown option: $1"
            echo "Use --help for usage information"
            exit 1
            ;;
    esac
done

# Get the directory where this script is located
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
API_BASE="http://localhost:${SERVER_PORT}/v1"
HEALTH_URL="http://localhost:${SERVER_PORT}/health"

echo "=========================================="
echo "Starting vLLM-Omni Gradio Demo"
echo "=========================================="
echo "Model: $MODEL"
echo "Server: http://${SERVER_HOST}:${SERVER_PORT}"
echo "Gradio: http://${GRADIO_IP}:${GRADIO_PORT}"
echo "=========================================="

# Build vLLM server command
SERVER_CMD=("vllm" "serve" "$MODEL" "--omni" "--port" "$SERVER_PORT" "--host" "$SERVER_HOST")
if [ -n "$STAGE_CONFIGS_PATH" ]; then
    SERVER_CMD+=("--stage-configs-path" "$STAGE_CONFIGS_PATH")
fi

# Function to cleanup on exit
cleanup() {
    echo ""
    echo "Shutting down..."
    if [ -n "$SERVER_PID" ]; then
        echo "Stopping vLLM server (PID: $SERVER_PID)..."
        kill "$SERVER_PID" 2>/dev/null || true
        wait "$SERVER_PID" 2>/dev/null || true
    fi
    if [ -n "$GRADIO_PID" ]; then
        echo "Stopping Gradio demo (PID: $GRADIO_PID)..."
        kill "$GRADIO_PID" 2>/dev/null || true
        wait "$GRADIO_PID" 2>/dev/null || true
    fi
    echo "Cleanup complete"
    exit 0
}

# Set up signal handlers
trap cleanup SIGINT SIGTERM

# Start vLLM server with output shown in real-time and saved to log
echo ""
echo "Starting vLLM server..."
LOG_FILE="/tmp/vllm_server_${SERVER_PORT}.log"
"${SERVER_CMD[@]}" 2>&1 | tee "$LOG_FILE" &
SERVER_PID=$!

# Start a background process to monitor the log for startup completion
STARTUP_COMPLETE=false
TAIL_PID=""

# Function to cleanup tail process
cleanup_tail() {
    if [ -n "$TAIL_PID" ]; then
        kill "$TAIL_PID" 2>/dev/null || true
        wait "$TAIL_PID" 2>/dev/null || true
    fi
}

# Wait for server to be ready by checking log output
echo ""
echo "Waiting for vLLM server to be ready (checking for 'Application startup complete' message)..."
echo ""

# Monitor log file for startup completion message
MAX_WAIT=300  # 5 minutes timeout as fallback
ELAPSED=0

# Use a temporary file to track startup completion
STARTUP_FLAG="/tmp/vllm_startup_flag_${SERVER_PORT}.tmp"
rm -f "$STARTUP_FLAG"

# Start monitoring in background
(
    tail -f "$LOG_FILE" 2>/dev/null | grep -m 1 "Application startup complete" > /dev/null && touch "$STARTUP_FLAG"
) &
TAIL_PID=$!

while [ $ELAPSED -lt $MAX_WAIT ]; do
    # Check if startup flag file exists (startup complete)
    if [ -f "$STARTUP_FLAG" ]; then
        cleanup_tail
        echo ""
        echo "✓ vLLM server is ready!"
        STARTUP_COMPLETE=true
        break
    fi

    # Check if server process is still running
    if ! kill -0 "$SERVER_PID" 2>/dev/null; then
        cleanup_tail
        echo ""
        echo "Error: vLLM server failed to start (process terminated)"
        wait "$SERVER_PID" 2>/dev/null || true
        exit 1
    fi

    sleep 1
    ELAPSED=$((ELAPSED + 1))
done

cleanup_tail
rm -f "$STARTUP_FLAG"

if [ "$STARTUP_COMPLETE" != "true" ]; then
    echo ""
    echo "Error: vLLM server did not complete startup within ${MAX_WAIT} seconds"
    kill "$SERVER_PID" 2>/dev/null || true
    exit 1
fi

# Start Gradio demo
echo ""
echo "Starting Gradio demo..."
cd "$SCRIPT_DIR"
GRADIO_CMD=("python" "gradio_demo.py" "--model" "$MODEL" "--api-base" "$API_BASE" "--ip" "$GRADIO_IP" "--port" "$GRADIO_PORT")
if [ "$GRADIO_SHARE" = true ]; then
    GRADIO_CMD+=("--share")
fi

"${GRADIO_CMD[@]}" > /tmp/gradio_demo.log 2>&1 &
GRADIO_PID=$!

echo ""
echo "=========================================="
echo "Both services are running!"
echo "=========================================="
echo "vLLM Server: http://${SERVER_HOST}:${SERVER_PORT}"
echo "Gradio Demo: http://${GRADIO_IP}:${GRADIO_PORT}"
echo ""
echo "Press Ctrl+C to stop both services"
echo "=========================================="
echo ""

# Wait for either process to exit
wait $SERVER_PID $GRADIO_PID || true

cleanup
streaming_video_client.py
#!/usr/bin/env python3
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""Example WebSocket client for the /v1/video/chat/stream endpoint.

Sends video frames from a local file (or generates synthetic ones), submits a
query, and prints the streamed text response.

Requirements:
    pip install websockets pillow

Usage:
    # With a video file (requires opencv-python):
    python streaming_video_client.py --video my_clip.mp4 \\
        --query "What is happening in this video?"

    # Synthetic frames (no extra deps):
    python streaming_video_client.py \\
        --query "Describe what you see." \\
        --synthetic-frames 10

    # With audio (Phase 3):
    python streaming_video_client.py --video my_clip.mp4 \\
        --audio my_audio.pcm \\
        --query "What is the person saying and doing?"
"""

from __future__ import annotations

import argparse
import asyncio
import base64
import io
import json
import sys

try:
    import websockets
except ImportError:
    print("Please install websockets:  pip install websockets")
    sys.exit(1)

from PIL import Image


def _generate_synthetic_frame(index: int, width: int = 320, height: int = 240) -> bytes:
    """Generate a simple synthetic JPEG frame with a colour gradient."""
    r = (index * 37) % 256
    g = (index * 73) % 256
    b = (index * 113) % 256
    img = Image.new("RGB", (width, height), (r, g, b))
    buf = io.BytesIO()
    img.save(buf, format="JPEG", quality=80)
    return buf.getvalue()


def _load_video_frames(path: str, max_frames: int = 64, fps: int = 2) -> list[bytes]:
    """Extract frames from a video file using OpenCV."""
    try:
        import cv2
    except ImportError:
        print("opencv-python is required to read video files: pip install opencv-python")
        sys.exit(1)

    cap = cv2.VideoCapture(path)
    if not cap.isOpened():
        print(f"Cannot open video: {path}")
        sys.exit(1)

    video_fps = cap.get(cv2.CAP_PROP_FPS) or 30.0
    frame_interval = max(1, int(video_fps / fps))

    frames: list[bytes] = []
    idx = 0
    while len(frames) < max_frames:
        ret, frame = cap.read()
        if not ret:
            break
        if idx % frame_interval == 0:
            _, buf = cv2.imencode(".jpg", frame, [cv2.IMWRITE_JPEG_QUALITY, 80])
            frames.append(buf.tobytes())
        idx += 1

    cap.release()
    print(f"Loaded {len(frames)} frames from {path} (interval={frame_interval})")
    return frames


async def run(args: argparse.Namespace) -> None:
    uri = f"ws://{args.host}:{args.port}/v1/video/chat/stream"

    # Prepare frames
    if args.video:
        frames = _load_video_frames(args.video, max_frames=args.max_frames, fps=args.fps)
    else:
        frames = [_generate_synthetic_frame(i) for i in range(args.synthetic_frames)]
        print(f"Generated {len(frames)} synthetic frames")

    # Prepare audio (optional, Phase 3)
    audio_data: bytes | None = None
    if args.audio:
        with open(args.audio, "rb") as f:
            audio_data = f.read()
        print(f"Loaded audio: {len(audio_data)} bytes")

    async with websockets.connect(uri, max_size=16 * 1024 * 1024) as ws:
        # 1. Send session.config
        config = {
            "type": "session.config",
            "model": args.model,
            "modalities": ["text", "audio"] if audio_data else ["text"],
            "max_frames": args.max_frames,
            "num_frames": args.num_sample_frames,
            "enable_frame_filter": args.evs,
            "frame_filter_threshold": args.evs_threshold,
            "use_audio_in_video": bool(audio_data),
        }
        await ws.send(json.dumps(config))
        print(f"Sent session.config: model={args.model} evs={args.evs}")

        # 2. Send frames
        for i, frame in enumerate(frames):
            msg = {
                "type": "video.frame",
                "data": base64.b64encode(frame).decode(),
            }
            await ws.send(json.dumps(msg))
            if (i + 1) % 10 == 0:
                print(f"  Sent {i + 1}/{len(frames)} frames")
        print(f"Sent all {len(frames)} frames")

        # 3. Send audio chunks (Phase 3)
        if audio_data:
            chunk_size = 16000 * 2  # 1 second of 16 kHz 16-bit PCM
            for offset in range(0, len(audio_data), chunk_size):
                chunk = audio_data[offset : offset + chunk_size]
                msg = {
                    "type": "audio.chunk",
                    "data": base64.b64encode(chunk).decode(),
                }
                await ws.send(json.dumps(msg))
            print(f"Sent audio in {(len(audio_data) + chunk_size - 1) // chunk_size} chunks")

        # 4. Send query, then immediately send video.done so the server
        #    knows the session is complete (avoids deadlock where client
        #    waits for session.done while server waits for video.done).
        await ws.send(json.dumps({"type": "video.query", "text": args.query}))
        print(f"\nQuery: {args.query}")
        print("Response: ", end="", flush=True)

        # Signal end of session right after the query.  The server will
        # process the query first (it's already queued), then handle
        # video.done and reply with session.done.
        await ws.send(json.dumps({"type": "video.done"}))

        # 5. Receive response until session.done
        recv_timeout = 120  # seconds — avoid infinite hang if server stalls
        while True:
            raw = await asyncio.wait_for(ws.recv(), timeout=recv_timeout)
            data = json.loads(raw)
            msg_type = data.get("type")

            if msg_type == "response.text.delta":
                print(data.get("delta", ""), end="", flush=True)
            elif msg_type == "response.text.done":
                print()  # newline
            elif msg_type == "response.evs_stats":
                retained = data.get("retained_count", 0)
                dropped = data.get("dropped_count", 0)
                rate = data.get("drop_rate", 0)
                print(f"\nEVS stats: retained={retained} dropped={dropped} drop_rate={rate:.1%}")
            elif msg_type == "session.done":
                print("Session complete.")
                break
            elif msg_type == "error":
                print(f"\nError: {data.get('message')}")
                break
            elif msg_type == "response.start":
                pass  # expected
            else:
                print(f"\n[unknown message] {data}")


def main() -> None:
    parser = argparse.ArgumentParser(description="Streaming video chat client")
    parser.add_argument("--host", default="localhost")
    parser.add_argument("--port", type=int, default=8000)
    parser.add_argument("--model", default="Qwen/Qwen3-Omni-MoE")
    parser.add_argument("--video", help="Path to video file (requires opencv-python)")
    parser.add_argument("--audio", help="Path to raw PCM 16kHz audio file (Phase 3)")
    parser.add_argument("--query", default="What do you see in this video?")
    parser.add_argument(
        "--synthetic-frames", type=int, default=10, help="Number of synthetic frames if --video is not set"
    )
    parser.add_argument("--max-frames", type=int, default=64)
    parser.add_argument("--num-sample-frames", type=int, default=16)
    parser.add_argument("--fps", type=int, default=2, help="Frame extraction rate from video")
    parser.add_argument(
        "--no-evs", dest="evs", action="store_false", help="Disable EVS frame filtering (enabled by default)"
    )
    parser.set_defaults(evs=True)
    parser.add_argument("--evs-threshold", type=float, default=0.95)
    args = parser.parse_args()
    asyncio.run(run(args))


if __name__ == "__main__":
    main()