Skip to content

GPU

vLLM-Omni is a Python library that supports the following GPU variants. The library itself mainly contains python implementations for framework and models.

Requirements

  • OS: Linux
  • Python: 3.12

Note

vLLM-Omni is currently not natively supported on Windows.

  • GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)
  • GPU: Validated on gfx942 (It should be supported on the AMD GPUs that are supported by vLLM.)
  • GPU: Validated on Intel® Arc™ B-Series.
  • GPU: Moore Threads GPU with MUSA SDK installed (validated on MTT S5000)

Set up using Python

Create a new Python environment

It's recommended to use uv, a very fast Python environment manager, to create and manage Python environments. Please follow the documentation to install uv. After installing uv, you can create a new Python environment using the following commands:

uv venv --python 3.12 --seed
source .venv/bin/activate

Pre-built wheels

Note: Pre-built wheels are currently available for vLLM-Omni 0.11.0rc1, 0.12.0rc1, 0.14.0rc1, 0.14.0, 0.16.0, 0.18.0, 0.20.0, 0.21.0, and 0.22.0. If you need a newer unreleased revision, please build from source.

Installation of vLLM

vLLM-Omni is built based on vLLM. Please install it with command below.

uv pip install vllm==0.22.0 --torch-backend=auto

Installation of vLLM-Omni

uv pip install vllm-omni

To run Gradio demos, also install the optional extras:

uv pip install 'vllm-omni[demo]'

Installation of vLLM

vLLM-Omni is built based on vLLM. Please install it with command below.

uv pip install vllm==0.22.0+rocm722 --extra-index-url https://wheels.vllm.ai/rocm/0.22.0/rocm722

Installation of vLLM-Omni

# we need to add --no-build-isolation as the torch
# is not obtained from pypi, we have to install using the
# torch installed in our environment
uv pip install vllm-omni

# Optional if want to run Qwen3 TTS
uv pip uninstall onnxruntime # should be removed before we can install onnxruntime-rocm
uv pip install onnxruntime-rocm

Build wheel from source

Installation of vLLM

If you do not need to modify source code of vLLM, you can directly install the stable 0.22.0 release version of the library

uv pip install vllm==0.22.0 --torch-backend=auto

The 0.22.0 release of vLLM ships CUDA 13.0-compatible binaries by default. If you need a different CUDA variant or want to reuse an existing PyTorch installation, build vLLM from source instead.

Installation of vLLM-Omni

Since vllm-omni is rapidly evolving, it's recommended to install it from source

git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
uv pip install -e .

To run Gradio demos, install with optional extras:

uv pip install -e '.[demo]'

(Optional) Installation of vLLM from source If you want to check, modify or debug with source code of vLLM, install the library from source with the following instructions:

git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout v0.22.0
Set up environment variables to get pre-built wheels. If there are internet problems, just download the whl file manually. And set VLLM_PRECOMPILED_WHEEL_LOCATION as your local absolute path of whl file.
#For CUDA 13.0 (the default for v0.22.0; the wheel filename has no `+cu130` suffix)
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://github.com/vllm-project/vllm/releases/download/v0.22.0/vllm-0.22.0-cp38-abi3-manylinux_2_35_x86_64.whl
Install vllm with command below (If you have no existing PyTorch).
uv pip install --editable .
Install vllm with command below (If you already have PyTorch).
python use_existing_torch.py
uv pip install -r requirements/build/cuda.txt
uv pip install --no-build-isolation --editable .

Installation of vLLM

If you do not need to modify source code of vLLM, you can directly install the stable 0.22.0 release version of the library

uv pip install vllm==0.22.0+rocm722 --extra-index-url https://wheels.vllm.ai/rocm/0.22.0/rocm722

The pre-built 0.22.0 vLLM wheel targets ROCm 7.2.2. If you need a different ROCm stack or want to reuse an existing PyTorch installation, build vLLM from source instead.

Installation of vLLM-Omni

Since vllm-omni is rapidly evolving, it's recommended to install it from source

git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
VLLM_OMNI_TARGET_DEVICE=rocm uv pip install -e .
# OR
uv pip install -e . --no-build-isolation

(Optional) Installation of vLLM from source If you want to check, modify or debug with source code of vLLM, install the library from source with the following instructions:

git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout v0.22.0
python3 -m pip install -r requirements/rocm.txt
python3 setup.py develop

Prerequisites

  • MUSA SDK: Download from MUSA SDK Download
  • torchada: CUDA→MUSA compatibility layer for PyTorch (pip install torchada)
  • mthreads-ml-py: MTML Python bindings (pip install mthreads-ml-py)
  • MATE: MUSA AI Tensor Engine (GitHub)

Installation of vLLM-MUSA

git clone https://github.com/MooreThreads/vllm-musa.git
cd vllm-musa
git checkout v0.18.0-dev
pip install . --no-build-isolation -v

Installation of vLLM-Omni

git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
VLLM_OMNI_TARGET_DEVICE=musa pip install -e . --no-build-isolation

For Gradio demos:

pip install -e '.[demo]' --no-build-isolation

Environment Variables

export MUSA_VISIBLE_DEVICES=0,1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_MUSA_CUSTOM_OP_USE_NATIVE=false

Set up using Docker

Pre-built images

vLLM-Omni offers an official docker image for deployment. These images are built on top of vLLM docker images and available on Docker Hub as vllm/vllm-omni. The version of vLLM-Omni indicates which release of vLLM it is based on.

Here's an example deployment command that has been verified on 2 x H100's:

docker run --runtime nvidia --gpus 2 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=$HF_TOKEN" \
    -p 8091:8091 \
    --ipc=host \
    vllm/vllm-omni:v0.22.0 \
    vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091

Tip

The CUDA image does not define a default entrypoint, so include vllm serve ... --omni after the image name.

vLLM-Omni offers an official docker image for deployment. These images are built on top of vLLM docker images and available on Docker Hub as vllm/vllm-omni-rocm. The version of vLLM-Omni indicates which release of vLLM it is based on.

Launch vLLM-Omni Server

Here's an example deployment command that has been verified on 2 x MI300's:

docker run --rm \
  --group-add=video \
  --ipc=host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v <path/to/model>:/app/model \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  -p 8091:8091 \
  vllm/vllm-omni-rocm:v0.22.0 \
  --model Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091

Launch an interactive terminal with prebuilt docker image.

If you want to run in dev environment you can launch the docker image as follows:

docker run --rm -it \
  --network=host \
  --group-add=video \
  --ipc=host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v <path/to/model>:/app/model \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  --entrypoint bash \
  vllm/vllm-omni-rocm:v0.22.0

Build your own docker image

Build docker image

DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm-omni-rocm .

Launch the docker image

Launch with OpenAI API Server
docker run --rm \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8091:8091 \
--ipc=host \
vllm-omni-rocm \
--model Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8091
Launch with interactive session for development
docker run --rm -it \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/model>:/app/model \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--entrypoint bash \
vllm-omni-rocm

Build docker image

DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.xpu -t vllm-omni-xpu --shm-size=4g .

Launch the docker image

Launch with OpenAI API Server
docker run -it -d --shm-size 10g \
  --name {container_name} \
  --net=host \
  --ipc=host \
  --privileged \
  -v /dev/dri/by-path:/dev/dri/by-path \
  --device /dev/dri:/dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  vllm-omni-xpu \
  --model Qwen/Qwen2.5-Omni-3B --port 8091