Welcome to vLLM!

Contents

Welcome to vLLM!#

Easy, fast, and cheap LLM serving for everyone

Star Watch Fork

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
Optimized CUDA kernels

vLLM is flexible and easy to use with:

Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
Support NVIDIA GPUs and AMD GPUs
(Experimental) Prefix caching support
(Experimental) Multi-lora support

For more information, check out the following:

vLLM announcing blog post (intro to PagedAttention)
vLLM paper (SOSP 2023)
How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al.

Documentation#

Getting Started

Serving

Models

Quantization

Developer Documentation

Sampling Params
- SamplingParams
vLLM Engine
- LLMEngine
- AsyncLLMEngine
vLLM Paged Attention
- Inputs
- Concepts
- Query
- Key
- QK
- Softmax
- Value
- LV
- Output

Indices and tables#