Welcome to vLLM!#
Easy, fast, and cheap LLM serving for everyone
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:
State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
Optimized CUDA kernels
vLLM is flexible and easy to use with:
Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
Support NVIDIA GPUs and AMD GPUs
(Experimental) Prefix caching support
(Experimental) Multi-lora support
For more information, check out the following:
vLLM announcing blog post (intro to PagedAttention)
vLLM paper (SOSP 2023)
How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al.
Documentation#
- Sampling Parameters
- Offline Inference
- vLLM Engine
- LLMEngine
LLMEngine
LLMEngine.DO_VALIDATE_OUTPUT
LLMEngine.abort_request()
LLMEngine.add_request()
LLMEngine.do_log_stats()
LLMEngine.from_engine_args()
LLMEngine.get_decoding_config()
LLMEngine.get_model_config()
LLMEngine.get_num_unfinished_requests()
LLMEngine.has_unfinished_requests()
LLMEngine.has_unfinished_requests_for_virtual_engine()
LLMEngine.step()
- AsyncLLMEngine
- LLMEngine
- vLLM Paged Attention
- Input Processing
- Multi-Modality
- Guides
- Module Contents
- Registry
MULTIMODAL_REGISTRY
MultiModalRegistry
MultiModalRegistry.create_input_mapper()
MultiModalRegistry.get_max_multimodal_tokens()
MultiModalRegistry.map_input()
MultiModalRegistry.register_image_input_mapper()
MultiModalRegistry.register_input_mapper()
MultiModalRegistry.register_max_image_tokens()
MultiModalRegistry.register_max_multimodal_tokens()
- Base Classes
- Image Classes
- Registry
- Dockerfile