Welcome to vLLM!#

Easy, fast, and cheap LLM serving for everyone
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:
State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
Speculative decoding
Chunked prefill
vLLM is flexible and easy to use with:
Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor parallelism and pipeline parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
Prefix caching support
Multi-lora support
For more information, check out the following:
vLLM announcing blog post (intro to PagedAttention)
vLLM paper (SOSP 2023)
How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al.
Documentation#
Getting Started
- Installation
- Installation with ROCm
- Installation with OpenVINO
- Installation with CPU
- Installation with Intel® Gaudi® AI Accelerators
- Requirements and Installation
- Supported Features
- Unsupported Features
- Supported Configurations
- Performance Tuning
- Troubleshooting: Tweaking HPU Graphs
- Installation with Neuron
- Installation with TPU
- Installation with XPU
- Quickstart
- Debugging Tips
- Examples
Serving
- OpenAI Compatible Server
- Deploying with Docker
- Deploying with Kubernetes
- Deploying with Nginx Loadbalancer
- Distributed Inference and Serving
- Production Metrics
- Environment Variables
- Usage Stats Collection
- Integrations
- Loading Models with CoreWeave’s Tensorizer
- Compatibility Matrix
- Frequently Asked Questions
Models
Quantization
Automatic Prefix Caching
Performance
Community
API Documentation
Design