GPU¶
vLLM-Omni is a Python library that supports the following GPU variants. The library itself mainly contains python implementations for framework and models.
Requirements¶
- OS: Linux
- Python: 3.12
Note
vLLM-Omni is currently not natively supported on Windows.
- GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)
- GPU: Validated on gfx942 (It should be supported on the AMD GPUs that are supported by vLLM.)
- GPU: Validated on Intel® Arc™ B-Series.
- GPU: Moore Threads GPU with MUSA SDK installed (validated on MTT S5000)
Set up using Python¶
Create a new Python environment¶
It's recommended to use uv, a very fast Python environment manager, to create and manage Python environments. Please follow the documentation to install uv. After installing uv, you can create a new Python environment using the following commands:
Pre-built wheels¶
Note: Pre-built wheels are currently available for vLLM-Omni 0.11.0rc1, 0.12.0rc1, 0.14.0rc1, 0.14.0, 0.16.0, 0.18.0, 0.20.0, 0.21.0, and 0.22.0. If you need a newer unreleased revision, please build from source.
Installation of vLLM¶
vLLM-Omni is built based on vLLM. Please install it with command below.
Installation of vLLM-Omni¶
To run Gradio demos, also install the optional extras:
Installation of vLLM¶
vLLM-Omni is built based on vLLM. Please install it with command below.
Installation of vLLM-Omni¶
# we need to add --no-build-isolation as the torch
# is not obtained from pypi, we have to install using the
# torch installed in our environment
uv pip install vllm-omni
# Optional if want to run Qwen3 TTS
uv pip uninstall onnxruntime # should be removed before we can install onnxruntime-rocm
uv pip install onnxruntime-rocm
Build wheel from source¶
Installation of vLLM¶
If you do not need to modify source code of vLLM, you can directly install the stable 0.22.0 release version of the library
The 0.22.0 release of vLLM ships CUDA 13.0-compatible binaries by default. If you need a different CUDA variant or want to reuse an existing PyTorch installation, build vLLM from source instead.
Installation of vLLM-Omni¶
Since vllm-omni is rapidly evolving, it's recommended to install it from source
To run Gradio demos, install with optional extras:
(Optional) Installation of vLLM from source
If you want to check, modify or debug with source code of vLLM, install the library from source with the following instructions: Set up environment variables to get pre-built wheels. If there are internet problems, just download the whl file manually. And setVLLM_PRECOMPILED_WHEEL_LOCATION as your local absolute path of whl file. Install vllm with command below (If you have no existing PyTorch). Install vllm with command below (If you already have PyTorch). Installation of vLLM¶
If you do not need to modify source code of vLLM, you can directly install the stable 0.22.0 release version of the library
The pre-built 0.22.0 vLLM wheel targets ROCm 7.2.2. If you need a different ROCm stack or want to reuse an existing PyTorch installation, build vLLM from source instead.
Installation of vLLM-Omni¶
Since vllm-omni is rapidly evolving, it's recommended to install it from source
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
VLLM_OMNI_TARGET_DEVICE=rocm uv pip install -e .
# OR
uv pip install -e . --no-build-isolation
(Optional) Installation of vLLM from source
If you want to check, modify or debug with source code of vLLM, install the library from source with the following instructions:Prerequisites¶
- MUSA SDK: Download from MUSA SDK Download
- torchada: CUDA→MUSA compatibility layer for PyTorch (
pip install torchada) - mthreads-ml-py: MTML Python bindings (
pip install mthreads-ml-py) - MATE: MUSA AI Tensor Engine (GitHub)
Installation of vLLM-MUSA¶
git clone https://github.com/MooreThreads/vllm-musa.git
cd vllm-musa
git checkout v0.18.0-dev
pip install . --no-build-isolation -v
Installation of vLLM-Omni¶
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
VLLM_OMNI_TARGET_DEVICE=musa pip install -e . --no-build-isolation
For Gradio demos:
Environment Variables¶
Set up using Docker¶
Pre-built images¶
vLLM-Omni offers an official docker image for deployment. These images are built on top of vLLM docker images and available on Docker Hub as vllm/vllm-omni. The version of vLLM-Omni indicates which release of vLLM it is based on.
Here's an example deployment command that has been verified on 2 x H100's:
docker run --runtime nvidia --gpus 2 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8091:8091 \
--ipc=host \
vllm/vllm-omni:v0.22.0 \
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091
Tip
The CUDA image does not define a default entrypoint, so include vllm serve ... --omni after the image name.
vLLM-Omni offers an official docker image for deployment. These images are built on top of vLLM docker images and available on Docker Hub as vllm/vllm-omni-rocm. The version of vLLM-Omni indicates which release of vLLM it is based on.
Launch vLLM-Omni Server¶
Here's an example deployment command that has been verified on 2 x MI300's:
docker run --rm \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/model>:/app/model \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8091:8091 \
vllm/vllm-omni-rocm:v0.22.0 \
--model Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091
Launch an interactive terminal with prebuilt docker image.¶
If you want to run in dev environment you can launch the docker image as follows:
docker run --rm -it \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/model>:/app/model \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
--entrypoint bash \
vllm/vllm-omni-rocm:v0.22.0
Build your own docker image¶
Build docker image¶
Launch the docker image¶
Launch with OpenAI API Server¶
docker run --rm \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8091:8091 \
--ipc=host \
vllm-omni-rocm \
--model Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8091
Launch with interactive session for development¶
Build docker image¶
Launch the docker image¶
Launch with OpenAI API Server¶
docker run -it -d --shm-size 10g \
--name {container_name} \
--net=host \
--ipc=host \
--privileged \
-v /dev/dri/by-path:/dev/dri/by-path \
--device /dev/dri:/dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
vllm-omni-xpu \
--model Qwen/Qwen2.5-Omni-3B --port 8091