Basic Quick Start Guide¶

vLLM Hardware Plugin for Intel® Gaudi® offers ready-to-run container images that package vLLM together with the Intel® Gaudi® software. This enables a fast and simple launch of vLLM Hardware Plugin for Intel® Gaudi® using prebuilt Docker images and Docker Compose, with support for custom runtime parameters and benchmarking.

This guide explains the easiest way of running vLLM Hardware Plugin for Intel® Gaudi® on Ubuntu. It includes features for model benchmarking, runtime customization, and selecting validated models such as LLaMA, Mistral, and Qwen. Advanced configuration can be performed through environment variables or YAML configuration files.

If you prefer to build vLLM Hardware Plugin for Intel® Gaudi® from source or with a custom Dockerfile, refer to the Installation guide.

Requirements¶

Before you start, ensure that your environment meets the following requirements:

Ubuntu 22.04 or 24.04, Red Hat Enterprise Linux (RHEL) 9.4 or 9.6
Python 3.10
Intel® Gaudi® 2 or 3 AI accelerator
Intel® Gaudi® software version 1.24.1 or later

Additionally, ensure that the Intel® Gaudi® execution environment is properly set up. If it is not, complete the setup by following the Installation Guide instructions.

Running vLLM Hardware Plugin for Intel® Gaudi® with Docker Compose¶

Follow these steps to run the vLLM server or launch benchmarks on Gaudi using Docker Compose.

Clone the vLLM plugin repository to get the required files and Docker Compose configurations.
```
git clone https://github.com/vllm-project/vllm-gaudi.git
```

Check out the version of vllm-gaudi that matches your target release.

cd vllm-gaudi
# Replace <TAG_OR_BRANCH> with the tag or branch for your desired release.
# You can find available versions at: https://github.com/vllm-project/vllm-gaudi/releases
git checkout <TAG_OR_BRANCH>

Navigate to the .cd/ directory in the cloned vllm-gaudi repository. This directory contains the Docker Compose files and helper scripts used in this quickstart.
```
cd .cd/
```

Select your preferred values of the following variables.

Variable	Description
`MODEL`	Preferred large language model. For a list of the available models, see the next table.
`HF_TOKEN`	Hugging Face token generated from https://huggingface.co.
`DOCKER_IMAGE`	Docker image name or URL for the vLLM Gaudi container. When using the Gaudi repository, make sure to select Docker images with the vllm-plugin prefix in the file name.

The following table lists the supported vLLM models:

Model Name	Validated TP Size
speakleash/Bielik-4.5B-v3.0-Instruct	1
deepseek-ai/DeepSeek-R1-Distill-Llama-70B	8
meta-llama/Llama-3.1-70B-Instruct	4
meta-llama/Llama-3.1-405B-Instruct	8
meta-llama/Llama-3.1-8B-Instruct	1
meta-llama/Llama-3.3-70B-Instruct	4
mistralai/Mistral-7B-Instruct-v0.2	1
mistralai/Mixtral-8x7B-Instruct-v0.1	2
mistralai/Mixtral-8x22B-Instruct-v0.1	4
Qwen/Qwen2.5-7B-Instruct	1
Qwen/Qwen2.5-VL-7B-Instruct	1
Qwen/Qwen2.5-14B-Instruct	1
Qwen/Qwen2.5-32B-Instruct	1
Qwen/Qwen2.5-72B-Instruct	4
Qwen/Qwen3-0.6B	1
Qwen/Qwen3-30B-A3B-Instruct-2507	1
ibm-granite/granite-8b-code-instruct-4k	1
ibm-granite/granite-20b-code-instruct-8k	1

Set the selected environment variables using the following example as a reference.

MODEL="Qwen/Qwen2.5-14B-Instruct" \
HF_TOKEN="<your huggingface token>" \
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/1.24.1/ubuntu24.04/habanalabs/vllm-0.24.0-ptupstream-2.11.0:latest"

Run the vLLM server using Docker Compose.
```
docker compose up
```
To automatically run benchmarking for a selected model using default settings, add the --profile benchmark up option.
```
docker compose --profile benchmark up
```

After completing this step, the vLLM server will be running, and the associated benchmark suite will start automatically. Optionally, to align the setup to your specific needs, you can use advanced configuration options. For most users, the basic setup is sufficient, but advanced users may benefit from additional customizations.

After setting up and running vLLM Hardware Plugin for Intel® Gaudi®, you can begin performing inference to generate model outputs. For detailed instructions, see the Executing Inference guide.

To achieve the best performance on HPU, follow the methods outlined in the Optimizing Training Platform Guide.