Skip to content

Qwen2.5-Omni

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/qwen2_5_omni.

Setup

Please refer to the stage configuration documentation to configure memory allocation appropriately for your hardware setup.

Run examples

Multiple Prompts

Get into the example folder

cd examples/offline_inference/qwen2_5_omni
Then run the command below. Note: for processing large volume data, it uses py_generator mode, which will return a python generator from Omni class.
bash run_multiple_prompts.sh

Single Prompt

Get into the example folder

cd examples/offline_inference/qwen2_5_omni
Then run the command below.
bash run_single_prompt.sh

Modality control

If you want to control output modalities, e.g. only output text, you can run the command below:

python end2end.py --output-wav output_audio \
                  --query-type mixed_modalities \
                  --modalities text

Using Local Media Files

The end2end.py script supports local media files (audio, video, image) via CLI arguments:

# Use single local media files
python end2end.py --query-type use_image --image-path /path/to/image.jpg
python end2end.py --query-type use_video --video-path /path/to/video.mp4
python end2end.py --query-type use_audio --audio-path /path/to/audio.wav

# Combine multiple local media files
python end2end.py --query-type mixed_modalities \
    --video-path /path/to/video.mp4 \
    --image-path /path/to/image.jpg \
    --audio-path /path/to/audio.wav

# Use audio from video file
python end2end.py --query-type use_audio_in_video --video-path /path/to/video.mp4

If media file paths are not provided, the script will use default assets. Supported query types: - use_image: Image input only - use_video: Video input only - use_audio: Audio input only - mixed_modalities: Audio + image + video - use_audio_in_video: Extract audio from video - text: Text-only query

Example materials

end2end.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen2_5_omni/end2end.py.

extract_prompts.py
#!/usr/bin/env python3
import argparse


def extract_prompt(line: str) -> str | None:
    # Extract the content between the first '|' and the second '|'
    i = line.find("|")
    if i == -1:
        return None
    j = line.find("|", i + 1)
    if j == -1:
        return None
    return line[i + 1 : j].strip()


def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument("--input", "-i", required=True, help="Input .lst file path")
    parser.add_argument("--output", "-o", required=True, help="Output file path")
    parser.add_argument(
        "--topk",
        "-k",
        type=int,
        default=100,
        help="Extract the top K prompts (default: 100)",
    )
    args = parser.parse_args()

    prompts = []
    with open(args.input, encoding="utf-8", errors="ignore") as f:
        for line in f:
            if len(prompts) >= args.topk:
                break
            p = extract_prompt(line.rstrip("\n"))
            if p:
                prompts.append(p)

    with open(args.output, "w", encoding="utf-8") as f:
        for p in prompts:
            f.write(p + "\n")


if __name__ == "__main__":
    main()
run_multiple_prompts.sh
python end2end.py --output-wav output_audio \
                  --query-type text \
                  --txt-prompts ../qwen3_omni/text_prompts_10.txt \
                  --py-generator
run_single_prompt.sh
python end2end.py --output-wav output_audio \
                  --query-type use_mixed_modalities