Qwen3-Omni¶
Source https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/qwen3_omni.
Setup¶
Please refer to the stage configuration documentation to configure memory allocation appropriately for your hardware setup.
Run examples¶
Multiple Prompts¶
Get into the example folder
Then run the command below. Note: for processing large volume data, it uses py_generator mode, which will return a python generator from Omni class.Single Prompt¶
Get into the example folder
Then run the command below. If you have not enough memory, you can set thinker with tensor parallel. Just run the command below.Modality control¶
If you want to control output modalities, e.g. only output text, you can run the command below:
Using Local Media Files¶
The end2end.py script supports local media files (audio, video, image) via command-line arguments:
# Use local video file
python end2end.py --query-type use_video --video-path /path/to/video.mp4
# Use local image file
python end2end.py --query-type use_image --image-path /path/to/image.jpg
# Use local audio file
python end2end.py --query-type use_audio --audio-path /path/to/audio.wav
# Combine multiple local media files
python end2end.py --query-type mixed_modalities \
--video-path /path/to/video.mp4 \
--image-path /path/to/image.jpg \
--audio-path /path/to/audio.wav
If media file paths are not provided, the script will use default assets. Supported query types: - use_video: Video input - use_image: Image input - use_audio: Audio input - text: Text-only query - multi_audios: Multiple audio inputs - mixed_modalities: Combination of video, image, and audio inputs
Async-chunk (offline)¶
For true stage-level concurrency -- where downstream stages (Talker, Code2Wav) start before the upstream stage (Thinker) finishes -- use the async_chunk example. This requires:
- A deploy config YAML with
async_chunk: true(e.g.qwen3_omni_moe.yaml). - Hardware that matches the config (e.g. 2x H100 for the default 3-stage config).
The async_chunk example uses AsyncOmni instead of the synchronous Omni class, which enables the async orchestrator to receive stage-0 intermediate outputs and trigger downstream stages early. Chunk data flows directly between stage workers via the in-worker OmniChunkTransferAdapter / connector, not through the orchestrator.
Single prompt¶
Multiple prompts with concurrency control¶
Text-only output (skip audio generation)¶
Custom stage config¶
python end2end_async_chunk.py \
--query-type use_audio \
--deploy-config /path/to/your_deploy_config.yaml
Note: The synchronous
end2end.py(usingOmni) is still the recommended entry point for non-async-chunk workflows. Only use the async_chunk example when you need the stage-level concurrency semantics described in PR #962 / #1151.
Example materials¶
end2end.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end.py.
end2end_async_chunk.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end_async_chunk.py.
run_multiple_prompts.sh
run_multiple_prompts_async_chunk.sh
#!/bin/bash
# Run multiple Qwen3-Omni requests with async_chunk enabled.
#
# Uses AsyncOmni with --max-in-flight to control request-level
# concurrency (each request still gets true stage-level concurrency
# via async_chunk).
#
# Usage:
# bash run_multiple_prompts_async_chunk.sh
# bash run_multiple_prompts_async_chunk.sh --max-in-flight 4
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "${SCRIPT_DIR}/../../.." && pwd)"
python "${SCRIPT_DIR}/end2end_async_chunk.py" \
--query-type text \
--txt-prompts "${SCRIPT_DIR}/text_prompts_10.txt" \
--deploy-config "${REPO_ROOT}/vllm_omni/deploy/qwen3_omni_moe.yaml" \
--output-dir output_audio_async_chunk \
--max-in-flight 2 \
"$@"
run_single_prompt_async_chunk.sh
#!/bin/bash
# Run a single Qwen3-Omni request with async_chunk enabled.
#
# This uses AsyncOmni (async orchestrator) so that downstream stages
# (Talker, Code2Wav) start *before* stage-0 (Thinker) finishes,
# achieving true stage-level concurrency via chunk-level streaming.
#
# Prerequisites:
# - A deploy config YAML (e.g. qwen3_omni_moe.yaml)
# - Hardware matching the config (e.g. 2x H100 for the default 3-stage config)
#
# Usage:
# bash run_single_prompt_async_chunk.sh
# bash run_single_prompt_async_chunk.sh --query-type text --modalities text
# bash run_single_prompt_async_chunk.sh --deploy-config /path/to/custom.yaml
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "${SCRIPT_DIR}/../../.." && pwd)"
python "${SCRIPT_DIR}/end2end_async_chunk.py" \
--query-type use_audio \
--deploy-config "${REPO_ROOT}/vllm_omni/deploy/qwen3_omni_moe.yaml" \
--output-dir output_audio_async_chunk \
"$@"
run_single_prompt_tp.sh
text_prompts_10.txt
What is the capital of France?
How many planets are in our solar system?
What is the largest ocean on Earth?
Who wrote the novel "1984"?
What is the chemical symbol for water?
What year did World War II end?
What is the tallest mountain in the world?
What is the speed of light in vacuum?
Who painted the Mona Lisa?
What is the smallest prime number?