Welcome to vLLM!#

vLLM

Easy, fast, and cheap LLM serving for everyone

Star Watch Fork

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

  • State-of-the-art serving throughput

  • Efficient management of attention key and value memory with PagedAttention

  • Continuous batching of incoming requests

  • Fast model execution with CUDA/HIP graph

  • Quantization: GPTQ, AWQ, INT4, INT8, and FP8

  • Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.

  • Speculative decoding

  • Chunked prefill

vLLM is flexible and easy to use with:

  • Seamless integration with popular HuggingFace models

  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more

  • Tensor parallelism and pipeline parallelism support for distributed inference

  • Streaming outputs

  • OpenAI-compatible API server

  • Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.

  • Prefix caching support

  • Multi-lora support

For more information, check out the following:

Documentation#

Automatic Prefix Caching

Performance

Indices and tables#