Streaming Video Input API¶
vLLM-Omni provides a WebSocket API for streaming video frames and optional audio chunks into Qwen3-Omni, then asking questions over the buffered session context.
Each server instance runs a single model specified at startup with vllm serve <model> --omni.
Quick Start¶
Start the Server¶
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct \
--deploy-config vllm_omni/deploy/qwen3_omni.yaml \
--omni \
--port 8000 \
--trust-remote-code
Run the Example Client¶
python examples/online_serving/qwen3_omni/streaming_video_client.py \
--url ws://localhost:8000/v1/video/chat/stream \
--video /path/to/video.mp4 \
--query "Describe what is happening in the video."
API Reference¶
Endpoint¶
Protocol¶
| Direction | Type | Required fields | Description |
|---|---|---|---|
| Client -> Server | session.config | none | First message. Configures output modalities, frame sampling, EVS, and prompts. |
| Client -> Server | video.frame | data | Base64 JPEG/PNG frame. |
| Client -> Server | audio.chunk | data | Base64 PCM16 16 kHz mono audio bytes. |
| Client -> Server | video.query | text | Ask a question over the buffered frames and audio. |
| Client -> Server | video.done | none | End the WebSocket session. |
| Server -> Client | response.start | none | Query generation started. |
| Server -> Client | response.text.delta | delta | Incremental text output. |
| Server -> Client | response.text.done | text | Final text output for the query. |
| Server -> Client | response.audio.delta | data, format | Incremental generated audio, base64 WAV. |
| Server -> Client | response.audio.done | none | Audio output finished. |
| Server -> Client | session.done | none | Session closed. |
| Server -> Client | error | message | Recoverable protocol or generation error. |
session.config Fields¶
| Field | Type | Default | Description |
|---|---|---|---|
model | string or null | null | Optional model name. Usually omitted because the server hosts one model. |
modalities | list[string] | ["text", "audio"] | Output modalities. Use ["text"], ["audio"], or both. |
num_frames | integer, 1-128 | 4 | Number of buffered frames sampled for each query. |
max_frames | integer, 1-256 | 50 | Maximum retained frame buffer size. Oldest frames are evicted first. |
system_prompt | string or null | null | Optional custom system prompt. |
use_audio_in_video | bool | true | Include streamed audio chunks in multimodal video understanding when audio is present. |
sampling_params_list | list or null | null | Optional per-stage sampling parameter overrides. |
enable_frame_filter | bool | true | Enable EVS near-duplicate frame filtering. |
frame_filter_threshold | float, 0.0-1.0 | 0.95 | EVS similarity threshold. Higher keeps more frames; lower drops more near-duplicates. |
Legacy Aliases¶
The server accepts these legacy field names and rewrites them before validation. New clients should send the canonical names above.
| Legacy field | Canonical field |
|---|---|
num_sample_frames | num_frames |
evs_enabled | enable_frame_filter |
evs_threshold | frame_filter_threshold |
Environment Variables¶
| Variable | Values | Default | Description |
|---|---|---|---|
VLLM_VIDEO_ASYNC_CHUNK | on, off | on | Wire-level streaming switch. off buffers server-side deltas and emits coalesced outputs at the end of a query. |
VLLM_VIDEO_AUDIO_DELTA_MODE | fast, slow | fast | Audio delta extraction strategy. fast emits only newly produced chunks; slow recomputes from accumulated audio and exists for A/B verification. |
EVS Semantics¶
EVS compares downsampled frames and drops near-duplicate frames before they enter the session frame buffer. frame_filter_threshold controls retention: higher values are more permissive and keep more frames; lower values are more aggressive and drop more similar frames.
Known Limitations¶
- Session KV reuse and incremental prefill are not implemented in this PR. Each
video.queryrebuilds the model prompt from the retained frame and audio buffers. - Back-to-back short replies can still expose an engine-layer scheduler race. The PR notes an observed workaround of at least 200 ms idle between turns when clients repeatedly see idle timeouts.
- If the audio buffer exceeds the server limit, the server emits
Audio buffer overflowand clears the currently buffered audio for the session. - The API is intended for Qwen3-Omni streaming video understanding; other models may not support the same multimodal processor arguments.