Welcome to vLLM!#

vLLM

Easy, fast, and cheap LLM serving for everyone

Star Watch Fork

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

  • State-of-the-art serving throughput

  • Efficient management of attention key and value memory with PagedAttention

  • Continuous batching of incoming requests

  • Optimized CUDA kernels

vLLM is flexible and easy to use with:

  • Seamless integration with popular HuggingFace models

  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more

  • Tensor parallelism support for distributed inference

  • Streaming outputs

  • OpenAI-compatible API server

For more information, check out the following:

Documentation#

Getting Started

Quantization