GPU¶

vLLM-Omni is a Python library that supports the following GPU variants. The library itself mainly contains python implementations for framework and models.

Requirements¶

OS: Linux
Python: 3.12

Note

vLLM-Omni is currently not natively supported on Windows.

NVIDIA CUDAAMD ROCmIntel XPUMThreads MUSA

GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)

GPU: Validated on gfx942 (It should be supported on the AMD GPUs that are supported by vLLM.)

GPU: Validated on Intel® Arc™ B-Series.

GPU: Moore Threads GPU with MUSA SDK installed (validated on MTT S5000)

Set up using Python¶

Create a new Python environment¶

It's recommended to use uv, a very fast Python environment manager, to create and manage Python environments. Please follow the documentation to install uv. After installing uv, you can create a new Python environment using the following commands:

uv venv --python 3.12 --seed
source .venv/bin/activate

Pre-built wheels¶

Note: Pre-built wheels are currently available for vLLM-Omni 0.11.0rc1, 0.12.0rc1, 0.14.0rc1, 0.14.0, 0.16.0, 0.18.0, 0.20.0, 0.21.0, 0.22.0, 0.23.0, and 0.24.0. If you need a newer unreleased revision, please build from source.

NVIDIA CUDAAMD ROCmIntel XPUMThreads MUSA

Installation of vLLM¶

vLLM-Omni is built based on vLLM. Please install it with command below.

uv pip install vllm==0.24.0 --torch-backend=auto

Installation of vLLM-Omni¶

uv pip install vllm-omni

To run Gradio demos, also install the optional extras:

uv pip install 'vllm-omni[demo]'

Installation of vLLM¶

vLLM-Omni is built based on vLLM. Please install it with command below.

uv pip install vllm==0.24.0+rocm722 --extra-index-url https://wheels.vllm.ai/rocm/0.24.0/rocm722

Installation of vLLM-Omni¶

# we need to add --no-build-isolation as the torch
# is not obtained from pypi, we have to install using the
# torch installed in our environment
uv pip install vllm-omni

# Optional if want to run Qwen3 TTS
uv pip uninstall onnxruntime # should be removed before we can install onnxruntime-rocm
uv pip install onnxruntime-rocm

Build wheel from source¶

NVIDIA CUDAAMD ROCmIntel XPUMThreads MUSA

Installation of vLLM¶

If you do not need to modify source code of vLLM, you can directly install the stable 0.24.0 release version of the library

uv pip install vllm==0.24.0 --torch-backend=auto

The 0.24.0 release of vLLM ships CUDA 13.0-compatible binaries by default. If you need a different CUDA variant or want to reuse an existing PyTorch installation, build vLLM from source instead.

Installation of vLLM-Omni¶

Since vllm-omni is rapidly evolving, it's recommended to install it from source

git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
uv pip install -e .

To run Gradio demos, install with optional extras:

uv pip install -e '.[demo]'

(Optional) Installation of vLLM from source

If you want to check, modify or debug with source code of vLLM, install the library from source with the following instructions:

git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout v0.24.0

Set up environment variables to get pre-built wheels. If there are internet problems, just download the whl file manually. And set VLLM_PRECOMPILED_WHEEL_LOCATION as your local absolute path of whl file.

#For CUDA 13.0 (the default for v0.24.0; the wheel filename has no `+cu130` suffix)
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://github.com/vllm-project/vllm/releases/download/v0.24.0/vllm-0.24.0-cp38-abi3-manylinux_2_28_x86_64.whl

Install vllm with command below (If you have no existing PyTorch).

uv pip install --editable .

Install vllm with command below (If you already have PyTorch).

python use_existing_torch.py
uv pip install -r requirements/build/cuda.txt
uv pip install --no-build-isolation --editable .

Installation of vLLM¶

If you do not need to modify source code of vLLM, you can directly install the stable 0.24.0 release version of the library

uv pip install vllm==0.24.0+rocm722 --extra-index-url https://wheels.vllm.ai/rocm/0.24.0/rocm722

The pre-built 0.24.0 vLLM wheel targets ROCm 7.2.2. If you need a different ROCm stack or want to reuse an existing PyTorch installation, build vLLM from source instead.

Installation of vLLM-Omni¶

Since vllm-omni is rapidly evolving, it's recommended to install it from source

git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
VLLM_OMNI_TARGET_DEVICE=rocm uv pip install -e .
# OR
uv pip install -e . --no-build-isolation

(Optional) Installation of vLLM from source

If you want to check, modify or debug with source code of vLLM, install the library from source with the following instructions:

git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout v0.24.0
python3 -m pip install -r requirements/rocm.txt
python3 setup.py develop

Prerequisites¶

MUSA SDK: Download from MUSA SDK Download
torchada: CUDA→MUSA compatibility layer for PyTorch (pip install torchada)
mthreads-ml-py: MTML Python bindings (pip install mthreads-ml-py)
MATE: MUSA AI Tensor Engine (GitHub)

Installation of vLLM-MUSA¶

git clone https://github.com/MooreThreads/vllm-musa.git
cd vllm-musa
git checkout v0.18.0-dev
pip install . --no-build-isolation -v

Installation of vLLM-Omni¶

git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
VLLM_OMNI_TARGET_DEVICE=musa pip install -e . --no-build-isolation

For Gradio demos:

pip install -e '.[demo]' --no-build-isolation

Environment Variables¶

export MUSA_VISIBLE_DEVICES=0,1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_MUSA_CUSTOM_OP_USE_NATIVE=false

Set up using Docker¶

Pre-built images¶

NVIDIA CUDAAMD ROCmIntel XPUMThreads MUSA

vLLM-Omni offers an official docker image for deployment. These images are built on top of vLLM docker images and available on Docker Hub as vllm/vllm-omni. The version of vLLM-Omni indicates which release of vLLM it is based on.

Here's an example deployment command that has been verified on 2 x H100's:

docker run --runtime nvidia --gpus 2 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=$HF_TOKEN" \
    -p 8091:8091 \
    --ipc=host \
    vllm/vllm-omni:v0.24.0 \
    vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091

Tip

The CUDA image does not define a default entrypoint, so include vllm serve ... --omni after the image name.

vLLM-Omni offers an official docker image for deployment. These images are built on top of vLLM docker images and available on Docker Hub as vllm/vllm-omni-rocm. The version of vLLM-Omni indicates which release of vLLM it is based on.

Launch vLLM-Omni Server¶

Here's an example deployment command that has been verified on 2 x MI300's:

docker run --rm \
  --group-add=video \
  --ipc=host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v <path/to/model>:/app/model \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  -p 8091:8091 \
  vllm/vllm-omni-rocm:v0.24.0 \
  --model Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091

Launch an interactive terminal with prebuilt docker image.¶

If you want to run in dev environment you can launch the docker image as follows:

docker run --rm -it \
  --network=host \
  --group-add=video \
  --ipc=host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v <path/to/model>:/app/model \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  --entrypoint bash \
  vllm/vllm-omni-rocm:v0.24.0

Build your own docker image¶

NVIDIA CUDAAMD ROCmIntel XPUMThreads MUSA

Build docker image¶

DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.cuda -t vllm-omni-cuda .

If you want to specify the base vLLM version:

DOCKER_BUILDKIT=1 docker build \
  -f docker/Dockerfile.cuda \
  --build-arg BASE_IMAGE=vllm/vllm-openai:v0.22.1 \
  -t vllm-omni-cuda .

Launch the docker image¶

Launch with OpenAI API Server¶

Note

The model Qwen/Qwen3-Omni-30B-A3B-Instruct requires significant GPU memory. The example below has been verified on 2 x H100's.

docker run --runtime nvidia --gpus 2 \
  -v ${HF_HOME:-$HOME/.cache/huggingface}:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  -p 8091:8091 \
  --ipc=host \
  vllm-omni-cuda \
  vllm serve --omni --model Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8091

By default, this mounts $HOME/.cache/huggingface as the model cache directory. To use a custom location, set the HF_HOME environment variable before running the command (e.g., export HF_HOME=/data/models).

Launch with interactive session for development¶

docker run --runtime nvidia --gpus all -it --rm \
  -v ${HF_HOME:-$HOME/.cache/huggingface}:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  -p 8091:8091 \
  --ipc=host \
  --entrypoint bash \
  vllm-omni-cuda

Build docker image¶

DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm-omni-rocm .

Launch the docker image¶

Launch with OpenAI API Server¶

docker run --rm \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8091:8091 \
--ipc=host \
vllm-omni-rocm \
--model Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8091

Launch with interactive session for development¶

docker run --rm -it \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/model>:/app/model \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--entrypoint bash \
vllm-omni-rocm

Build docker image¶

DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.xpu -t vllm-omni-xpu --shm-size=4g .

Launch the docker image¶

Launch with OpenAI API Server¶

docker run -it -d --shm-size 10g \
  --name {container_name} \
  --net=host \
  --ipc=host \
  --privileged \
  -v /dev/dri/by-path:/dev/dri/by-path \
  --device /dev/dri:/dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  vllm-omni-xpu \
  --model Qwen/Qwen2.5-Omni-3B --port 8091