Skip to content

Streaming Video Input API

vLLM-Omni provides a WebSocket API for streaming video frames and optional audio chunks into Qwen3-Omni, then asking questions over the buffered session context.

Each server instance runs a single model specified at startup with vllm serve <model> --omni.

Quick Start

Start the Server

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct \
    --deploy-config vllm_omni/deploy/qwen3_omni.yaml \
    --omni \
    --port 8000 \
    --trust-remote-code

Run the Example Client

python examples/online_serving/qwen3_omni/streaming_video_client.py \
    --url ws://localhost:8000/v1/video/chat/stream \
    --video /path/to/video.mp4 \
    --query "Describe what is happening in the video."

API Reference

Endpoint

WebSocket /v1/video/chat/stream

Protocol

Direction Type Required fields Description
Client -> Server session.config none First message. Configures output modalities, frame sampling, EVS, and prompts.
Client -> Server video.frame data Base64 JPEG/PNG frame.
Client -> Server audio.chunk data Base64 PCM16 16 kHz mono audio bytes.
Client -> Server video.query text Ask a question over the buffered frames and audio.
Client -> Server video.done none End the WebSocket session.
Server -> Client response.start none Query generation started.
Server -> Client response.text.delta delta Incremental text output.
Server -> Client response.text.done text Final text output for the query.
Server -> Client response.audio.delta data, format Incremental generated audio, base64 WAV.
Server -> Client response.audio.done none Audio output finished.
Server -> Client session.done none Session closed.
Server -> Client error message Recoverable protocol or generation error.

session.config Fields

Field Type Default Description
model string or null null Optional model name. Usually omitted because the server hosts one model.
modalities list[string] ["text", "audio"] Output modalities. Use ["text"], ["audio"], or both.
num_frames integer, 1-128 4 Number of buffered frames sampled for each query.
max_frames integer, 1-256 50 Maximum retained frame buffer size. Oldest frames are evicted first.
system_prompt string or null null Optional custom system prompt.
use_audio_in_video bool true Include streamed audio chunks in multimodal video understanding when audio is present.
sampling_params_list list or null null Optional per-stage sampling parameter overrides.
enable_frame_filter bool true Enable EVS near-duplicate frame filtering.
frame_filter_threshold float, 0.0-1.0 0.95 EVS similarity threshold. Higher keeps more frames; lower drops more near-duplicates.

Legacy Aliases

The server accepts these legacy field names and rewrites them before validation. New clients should send the canonical names above.

Legacy field Canonical field
num_sample_frames num_frames
evs_enabled enable_frame_filter
evs_threshold frame_filter_threshold

Environment Variables

Variable Values Default Description
VLLM_VIDEO_ASYNC_CHUNK on, off on Wire-level streaming switch. off buffers server-side deltas and emits coalesced outputs at the end of a query.
VLLM_VIDEO_AUDIO_DELTA_MODE fast, slow fast Audio delta extraction strategy. fast emits only newly produced chunks; slow recomputes from accumulated audio and exists for A/B verification.

EVS Semantics

EVS compares downsampled frames and drops near-duplicate frames before they enter the session frame buffer. frame_filter_threshold controls retention: higher values are more permissive and keep more frames; lower values are more aggressive and drop more similar frames.

Known Limitations

  • Session KV reuse and incremental prefill are not implemented in this PR. Each video.query rebuilds the model prompt from the retained frame and audio buffers.
  • Back-to-back short replies can still expose an engine-layer scheduler race. The PR notes an observed workaround of at least 200 ms idle between turns when clients repeatedly see idle timeouts.
  • If the audio buffer exceeds the server limit, the server emits Audio buffer overflow and clears the currently buffered audio for the session.
  • The API is intended for Qwen3-Omni streaming video understanding; other models may not support the same multimodal processor arguments.