Skip to content

Overview

vLLM x Intel-Gaudi

Star Watch Fork

The vLLM Hardware Plugin for Intel® Gaudi® is a community-driven integration layer that enables efficient, high-performance large language model (LLM) inference on Intel® Gaudi® AI accelerators.

The vLLM Hardware Plugin for Intel® Gaudi® connects the vLLM serving engine with Intel® Gaudi® hardware, offering optimized inference capabilities for enterprise-scale LLM workloads. It is developed and maintained by the Intel® Gaudi® team and follows the hardware pluggable RFC and vLLM plugin architecture RFC for modular integration.

Advantages

The vLLM Hardware Plugin for Intel® Gaudi® offers the following key benefits:

  • Optimization for Intel® Gaudi®: Supports advanced features, such as the bucketing mechanism, Floating Point 8-bit (FP8) quantization, and custom graph caching for fast warm-up and efficient memory use.
  • Scalability and efficiency: Designed to maximize throughput and minimize latency for large-scale deployments, making it ideal for production-grade LLM inference.
  • Community support: Actively maintained on GitHub by contributions from the Intel® Gaudi® team and the broader vLLM ecosystem.

Getting Started

To get started with vLLM Hardware Plugin for Intel® Gaudi®:

  • Set up your environment using the quickstart guide and use the plugin locally or in your containerized environment.
  • Run inference using supported models, such as Llama 3.1, Mixtral, or DeepSeek.
  • Explore advanced features, such as FP8 quantization, recipe caching, and expert parallelism.
  • Join the community by contributing to the vLLM-Gaudi GitHub repository.

Reference

For more information, see: