Qwen3-Omni¶

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/qwen3_omni.

Setup¶

Please refer to the stage configuration documentation to configure memory allocation appropriately for your hardware setup.

Run examples¶

Multiple Prompts¶

Get into the example folder

cd examples/offline_inference/qwen3_omni

Then run the command below. Note: for processing large volume data, it uses py_generator mode, which will return a python generator from Omni class.

bash run_multiple_prompts.sh

Single Prompt¶

Get into the example folder

cd examples/offline_inference/qwen3_omni

Then run the command below.

bash run_single_prompt.sh

If you have not enough memory, you can set thinker with tensor parallel. Just run the command below.

bash run_single_prompt_tp.sh

Modality control¶

If you want to control output modalities, e.g. only output text, you can run the command below:

python end2end.py --output-wav output_audio \
                  --query-type use_audio \
                  --modalities text

Using Local Media Files¶

The end2end.py script supports local media files (audio, video, image) via command-line arguments:

# Use local video file
python end2end.py --query-type use_video --video-path /path/to/video.mp4

# Use local image file
python end2end.py --query-type use_image --image-path /path/to/image.jpg

# Use local audio file
python end2end.py --query-type use_audio --audio-path /path/to/audio.wav

# Combine multiple local media files
python end2end.py --query-type mixed_modalities \
    --video-path /path/to/video.mp4 \
    --image-path /path/to/image.jpg \
    --audio-path /path/to/audio.wav

If media file paths are not provided, the script will use default assets. Supported query types: - use_video: Video input - use_image: Image input - use_audio: Audio input - text: Text-only query - multi_audios: Multiple audio inputs - mixed_modalities: Combination of video, image, and audio inputs

Async-chunk (offline)¶

For true stage-level concurrency -- where downstream stages (Talker, Code2Wav) start before the upstream stage (Thinker) finishes -- use the async_chunk example. This requires:

A deploy config YAML with async_chunk: true (e.g. qwen3_omni_moe.yaml).
Hardware that matches the config (e.g. 2x H100 for the default 3-stage config).

The async_chunk example uses AsyncOmni instead of the synchronous Omni class, which enables the async orchestrator to receive stage-0 intermediate outputs and trigger downstream stages early. Chunk data flows directly between stage workers via the in-worker OmniChunkTransferAdapter / connector, not through the orchestrator.

Single prompt¶

cd examples/offline_inference/qwen3_omni
bash run_single_prompt_async_chunk.sh

Multiple prompts with concurrency control¶

bash run_multiple_prompts_async_chunk.sh --max-in-flight 4

Text-only output (skip audio generation)¶

python end2end_async_chunk.py --query-type text --modalities text

Custom stage config¶

python end2end_async_chunk.py \
    --query-type use_audio \
    --deploy-config /path/to/your_deploy_config.yaml

Note: The synchronous end2end.py (using Omni) is still the recommended entry point for non-async-chunk workflows. Only use the async_chunk example when you need the stage-level concurrency semantics described in PR #962 / #1151.

Example materials¶

end2end.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end.py.

end2end_async_chunk.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end_async_chunk.py.

run_multiple_prompts.sh

python end2end.py --output-wav output_audio \
                  --query-type text \
                  --txt-prompts text_prompts_10.txt \
                  --py-generator

run_multiple_prompts_async_chunk.sh

#!/bin/bash
# Run multiple Qwen3-Omni requests with async_chunk enabled.
#
# Uses AsyncOmni with --max-in-flight to control request-level
# concurrency (each request still gets true stage-level concurrency
# via async_chunk).
#
# Usage:
#   bash run_multiple_prompts_async_chunk.sh
#   bash run_multiple_prompts_async_chunk.sh --max-in-flight 4

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "${SCRIPT_DIR}/../../.." && pwd)"

python "${SCRIPT_DIR}/end2end_async_chunk.py" \
    --query-type text \
    --txt-prompts "${SCRIPT_DIR}/text_prompts_10.txt" \
    --deploy-config "${REPO_ROOT}/vllm_omni/deploy/qwen3_omni_moe.yaml" \
    --output-dir output_audio_async_chunk \
    --max-in-flight 2 \
    "$@"

run_single_prompt.sh

python end2end.py --output-wav output_audio \
                  --query-type use_audio

run_single_prompt_async_chunk.sh

#!/bin/bash
# Run a single Qwen3-Omni request with async_chunk enabled.
#
# This uses AsyncOmni (async orchestrator) so that downstream stages
# (Talker, Code2Wav) start *before* stage-0 (Thinker) finishes,
# achieving true stage-level concurrency via chunk-level streaming.
#
# Prerequisites:
#   - A deploy config YAML (e.g. qwen3_omni_moe.yaml)
#   - Hardware matching the config (e.g. 2x H100 for the default 3-stage config)
#
# Usage:
#   bash run_single_prompt_async_chunk.sh
#   bash run_single_prompt_async_chunk.sh --query-type text --modalities text
#   bash run_single_prompt_async_chunk.sh --deploy-config /path/to/custom.yaml

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "${SCRIPT_DIR}/../../.." && pwd)"

python "${SCRIPT_DIR}/end2end_async_chunk.py" \
    --query-type use_audio \
    --deploy-config "${REPO_ROOT}/vllm_omni/deploy/qwen3_omni_moe.yaml" \
    --output-dir output_audio_async_chunk \
    "$@"

run_single_prompt_tp.sh

python end2end.py --output-wav output_audio \
                  --query-type use_audio \
                  --stage-init-timeout 300

# stage-init-timeout sets the maximum wait to avoid two vLLM stages initializing at the same time on the same card.

text_prompts_10.txt

What is the capital of France?
How many planets are in our solar system?
What is the largest ocean on Earth?
Who wrote the novel "1984"?
What is the chemical symbol for water?
What year did World War II end?
What is the tallest mountain in the world?
What is the speed of light in vacuum?
Who painted the Mona Lisa?
What is the smallest prime number?