Skip to content

Basic Quick Start Guide

vLLM Hardware Plugin for Intel® Gaudi® offers ready-to-run container images that package vLLM together with the Intel® Gaudi® software. This enables a fast and simple launch of vLLM Hardware Plugin for Intel® Gaudi® using prebuilt Docker images and Docker Compose, with support for custom runtime parameters and benchmarking.

This guide explains the easiest way of running vLLM Hardware Plugin for Intel® Gaudi® on Ubuntu. It includes features for model benchmarking, runtime customization, and selecting validated models such as LLaMA, Mistral, and Qwen. Advanced configuration can be performed through environment variables or YAML configuration files.

If you prefer to build vLLM Hardware Plugin for Intel® Gaudi® from source or with a custom Dockerfile, refer to the Installation guide.

Requirements

Before you start, ensure that your environment meets the following requirements:

  • Ubuntu 22.04 or 24.04, Red Hat Enterprise Linux (RHEL) 9.4 or 9.6
  • Python 3.10
  • Intel® Gaudi® 2 or 3 AI accelerator
  • Intel® Gaudi® software version 1.24.0 or later

Additionally, ensure that the Intel® Gaudi® execution environment is properly set up. If it is not, complete the setup by following the Installation Guide instructions.

Running vLLM Hardware Plugin for Intel® Gaudi® with Docker Compose

Follow these steps to run the vLLM server or launch benchmarks on Gaudi using Docker Compose.

  1. Clone the vLLM plugin repository to get the required files and Docker Compose configurations.

    git clone https://github.com/vllm-project/vllm-gaudi.git
    
  2. Check out the version of vllm-gaudi that matches your target release.

    cd vllm-gaudi
    # Replace <TAG_OR_BRANCH> with the tag or branch for your desired release.
    # You can find available versions at: https://github.com/vllm-project/vllm-gaudi/releases
    git checkout <TAG_OR_BRANCH>
    
  3. Navigate to the .cd/ directory in the cloned vllm-gaudi repository. This directory contains the Docker Compose files and helper scripts used in this quickstart.

    cd .cd/
    
  4. Select your preferred values of the following variables.

    Variable Description
    MODEL Preferred large language model. For a list of the available models, see the next table.
    HF_TOKEN Hugging Face token generated from https://huggingface.co.
    DOCKER_IMAGE Docker image name or URL for the vLLM Gaudi container. When using the Intel® Gaudi® vault (vault.habana.ai/gaudi-docker), select an image with the vllm- prefix. The naming convention is vllm-<VLLM_VERSION>-pt<TYPE>-<PT_VERSION>, where <TYPE> is either upstream (upstream PyTorch build, default) or fork (Intel® Gaudi® PyTorch fork build). For example: vllm-0.19.1-ptupstream-2.10.0:latest.

    The following table lists the supported vLLM models:

    Model Name Validated TP Size
    speakleash/Bielik-4.5B-v3.0-Instruct 1
    deepseek-ai/DeepSeek-R1-Distill-Llama-70B 8
    meta-llama/Llama-3.1-70B-Instruct 4
    meta-llama/Llama-3.1-405B-Instruct 8
    meta-llama/Llama-3.1-8B-Instruct 1
    meta-llama/Llama-3.3-70B-Instruct 4
    mistralai/Mistral-7B-Instruct-v0.2 1
    mistralai/Mixtral-8x7B-Instruct-v0.1 2
    mistralai/Mixtral-8x22B-Instruct-v0.1 4
    Qwen/Qwen2.5-7B-Instruct 1
    Qwen/Qwen2.5-VL-7B-Instruct 1
    Qwen/Qwen2.5-14B-Instruct 1
    Qwen/Qwen2.5-32B-Instruct 1
    Qwen/Qwen2.5-72B-Instruct 4
    Qwen/Qwen3-0.6B 1
    Qwen/Qwen3-30B-A3B-Instruct-2507 1
    ibm-granite/granite-8b-code-instruct-4k 1
    ibm-granite/granite-20b-code-instruct-8k 1
  5. Set the selected environment variables using the following example as a reference.

    MODEL="Qwen/Qwen2.5-14B-Instruct" \
    HF_TOKEN="<your huggingface token>" \
    DOCKER_IMAGE="vault.habana.ai/gaudi-docker/1.24.0/ubuntu24.04/habanalabs/vllm-0.19.1-ptupstream-2.10.0:latest"
    
  6. Run the vLLM server using Docker Compose.

    docker compose up
    

    To automatically run benchmarking for a selected model using default settings, add the --profile benchmark up option.

    docker compose --profile benchmark up
    

After completing this step, the vLLM server will be running, and the associated benchmark suite will start automatically. Optionally, to align the setup to your specific needs, you can use advanced configuration options. For most users, the basic setup is sufficient, but advanced users may benefit from additional customizations.

After setting up and running vLLM Hardware Plugin for Intel® Gaudi®, you can begin performing inference to generate model outputs. For detailed instructions, see the Executing Inference guide.

To achieve the best performance on HPU, follow the methods outlined in the Optimizing Training Platform Guide.