Overview

x

The vLLM Hardware Plugin for Intel® Gaudi® is a community-driven integration layer that enables efficient, high-performance large language model (LLM) inference on Intel® Gaudi® AI accelerators.

The vLLM Hardware Plugin for Intel® Gaudi® connects the vLLM serving engine with Intel® Gaudi® hardware, offering optimized inference capabilities for enterprise-scale LLM workloads. It is developed and maintained by the Intel® Gaudi® team and follows the hardware pluggable RFC and vLLM plugin architecture RFC for modular integration.

Advantages¶

The vLLM Hardware Plugin for Intel® Gaudi® offers the following key benefits:

Optimization for Intel® Gaudi®: Supports advanced features, such as the bucketing mechanism, Floating Point 8-bit (FP8) quantization, and custom graph caching for fast warm-up and efficient memory use.
Scalability and efficiency: Designed to maximize throughput and minimize latency for large-scale deployments, making it ideal for production-grade LLM inference.
Community support: Actively maintained on GitHub by contributions from the Intel® Gaudi® team and the broader vLLM ecosystem.

Getting Started¶

To get started with vLLM Hardware Plugin for Intel® Gaudi®:

Set up your environment using the quickstart guide and use the plugin locally or in your containerized environment.
Run inference using supported models, such as Llama 3.1, Mixtral, or DeepSeek.
Explore advanced features, such as FP8 quantization, recipe caching, and expert parallelism.
Join the community by contributing to the vLLM-Gaudi GitHub repository.

Reference¶

For more information, see: