Frequently Asked Questions
Prerequisites and System Requirements¶
What are the system requirements for running vLLM on Intel® Gaudi®?¶
- Ubuntu 22.04 LTS OS.
- Python 3.10.
- Intel Gaudi 2 or Intel Gaudi 3 AI accelerator.
- Intel Gaudi software version 1.24.0 and above.
What is the vLLM plugin and where can I find its GitHub repository?¶
Intel develops and maintains its own vLLM plugin project called vLLM Hardware Plugin for Intel® Gaudi® and located in the vLLM-gaudi repository on GitHub.
How do I verify that the Intel® Gaudi® software is installed correctly?¶
-
Run
hl-smito check if Intel® Gaudi® accelerators are visible. Refer to System Verifications and Final Tests for more details. -
Run
apt list --installed | grep habanato verify installed packages. The output should look similar to the following example: -
Check the installed Python packages by running
pip list | grep habanaandpip list | grep neural. The output should look similar to this example:
How can I quickly set up the environment for vLLM using Docker?¶
Use the Dockerfile.ubuntu.pytorch.vllm file provided in the .cd directory on GitHub to build and run a container with the latest Intel® Gaudi® software release.
For more details, see Quick Start Using Dockerfile.
Building and Installing vLLM¶
How can I install vLLM on Intel Gaudi?¶
There are two different installation methods:
-
Running vLLM Hardware Plugin for Intel® Gaudi® using a Dockerfile: We recommend this method as it is the most suitable option for production deployments.
-
Building vLLM Hardware Plugin for Intel® Gaudi® from source: This method is intended for developers working with experimental code or new features that are still under testing.
Examples and Model Support¶
Which models and configurations have been validated on Intel® Gaudi® 2 and Intel® Gaudi® 3 devices?¶
The list of validated models is available in the Validated Models document. The list includes models such as:
-
Llama 2, Llama 3, and Llama 3.1 (7B, 8B, and 70B versions). Refer to Llama-3.1 jupyter notebook example.
-
Mistral and Mixtral models.
-
Different tensor parallelism configurations , such as single HPU, 2x, and 8x HPU.
Features Support¶
Which key features does vLLM support on Intel® Gaudi®?¶
The list of the supported features is available in the Supported Features document. It includes features such as:
-
Offline Batched Inference
-
OpenAI-Compatible Server
-
Paged KV cache optimized for Intel® Gaudi® devices
-
Speculative decoding (experimental)
-
Tensor parallel inference
-
FP8 models and KV Cache quantization and calibration with Intel® Neural Compressor (INC). For more details, see the Intel® Neural Compressor quantization and inference guide.
Performance Tuning¶
Which execution modes does the plugin support?¶
-
PyTorch Eager mode (default)
-
torch.compile (default)
-
HPU Graphs (recommended for best performance)
-
PyTorch Lazy mode
How does the bucketing mechanism work in vLLM Hardware Plugin for Intel® Gaudi®?¶
The bucketing mechanism optimizes performance by grouping tensor shapes. This reduces the number of required graphs and minimizes compilations during server runtime. Buckets are determined by parameters for batch size and sequence length. For more information, see Bucketing Mechanism.
What should I do if a request exceeds the maximum bucket size?¶
Consider increasing the upper bucket boundaries using environment variables to avoid potential latency increases due to graph compilation.