Welcome to vLLM!#

vLLM

Easy, fast, and cheap LLM serving for everyone

Star Watch Fork

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

  • State-of-the-art serving throughput

  • Efficient management of attention key and value memory with PagedAttention

  • Continuous batching of incoming requests

  • Fast model execution with CUDA/HIP graph

  • Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache

  • Optimized CUDA kernels

vLLM is flexible and easy to use with:

  • Seamless integration with popular HuggingFace models

  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more

  • Tensor parallelism support for distributed inference

  • Streaming outputs

  • OpenAI-compatible API server

  • Support NVIDIA GPUs and AMD GPUs

  • (Experimental) Prefix caching support

  • (Experimental) Multi-lora support

For more information, check out the following:

Documentation#

Offline Inference

Indices and tables#