Basic Quick Start Guide¶
vLLM Hardware Plugin for Intel® Gaudi® offers ready-to-run container images that package vLLM together with the Intel® Gaudi® software. This enables a fast and simple launch of vLLM Hardware Plugin for Intel® Gaudi® using prebuilt Docker images and Docker Compose, with support for custom runtime parameters and benchmarking.
This guide explains the easiest way of running vLLM Hardware Plugin for Intel® Gaudi® on Ubuntu. It includes features for model benchmarking, runtime customization, and selecting validated models such as LLaMA, Mistral, and Qwen. Advanced configuration can be performed through environment variables or YAML configuration files.
If you prefer to build vLLM Hardware Plugin for Intel® Gaudi® from source or with a custom Dockerfile, refer to the Installation guide.
Requirements¶
Before you start, ensure that your environment meets the following requirements:
- Ubuntu 22.04 or 24.04, Red Hat Enterprise Linux (RHEL) 9.4 or 9.6
- Python 3.10
- Intel® Gaudi® 2 or 3 AI accelerator
- Intel® Gaudi® software version 1.24.0 or later
Additionally, ensure that the Intel® Gaudi® execution environment is properly set up. If it is not, complete the setup by following the Installation Guide instructions.
Running vLLM Hardware Plugin for Intel® Gaudi® with Docker Compose¶
Follow these steps to run the vLLM server or launch benchmarks on Gaudi using Docker Compose.
-
Clone the vLLM plugin repository to get the required files and Docker Compose configurations.
-
Check out the version of
vllm-gaudithat matches your target release. -
Navigate to the
.cd/directory in the clonedvllm-gaudirepository. This directory contains the Docker Compose files and helper scripts used in this quickstart. -
Select your preferred values of the following variables.
Variable Description MODELPreferred large language model. For a list of the available models, see the next table. HF_TOKENHugging Face token generated from https://huggingface.co. DOCKER_IMAGEDocker image name or URL for the vLLM Gaudi container. When using the Intel® Gaudi® vault ( vault.habana.ai/gaudi-docker), select an image with thevllm-prefix. The naming convention isvllm-<VLLM_VERSION>-pt<TYPE>-<PT_VERSION>, where<TYPE>is eitherupstream(upstream PyTorch build, default) orfork(Intel® Gaudi® PyTorch fork build). For example:vllm-0.19.1-ptupstream-2.10.0:latest.The following table lists the supported vLLM models:
Model Name Validated TP Size speakleash/Bielik-4.5B-v3.0-Instruct 1 deepseek-ai/DeepSeek-R1-Distill-Llama-70B 8 meta-llama/Llama-3.1-70B-Instruct 4 meta-llama/Llama-3.1-405B-Instruct 8 meta-llama/Llama-3.1-8B-Instruct 1 meta-llama/Llama-3.3-70B-Instruct 4 mistralai/Mistral-7B-Instruct-v0.2 1 mistralai/Mixtral-8x7B-Instruct-v0.1 2 mistralai/Mixtral-8x22B-Instruct-v0.1 4 Qwen/Qwen2.5-7B-Instruct 1 Qwen/Qwen2.5-VL-7B-Instruct 1 Qwen/Qwen2.5-14B-Instruct 1 Qwen/Qwen2.5-32B-Instruct 1 Qwen/Qwen2.5-72B-Instruct 4 Qwen/Qwen3-0.6B 1 Qwen/Qwen3-30B-A3B-Instruct-2507 1 ibm-granite/granite-8b-code-instruct-4k 1 ibm-granite/granite-20b-code-instruct-8k 1 -
Set the selected environment variables using the following example as a reference.
-
Run the vLLM server using Docker Compose.
To automatically run benchmarking for a selected model using default settings, add the
--profile benchmark upoption.
After completing this step, the vLLM server will be running, and the associated benchmark suite will start automatically. Optionally, to align the setup to your specific needs, you can use advanced configuration options. For most users, the basic setup is sufficient, but advanced users may benefit from additional customizations.
After setting up and running vLLM Hardware Plugin for Intel® Gaudi®, you can begin performing inference to generate model outputs. For detailed instructions, see the Executing Inference guide.
To achieve the best performance on HPU, follow the methods outlined in the Optimizing Training Platform Guide.