Skip to main content

Ctrl+K

Getting Started

Installation
Quickstart
Examples
Troubleshooting
Frequently Asked Questions
vLLM V1 User Guide

Models

Generative Models
Pooling Models
List of Supported Models
Built-in Extensions
- Loading models with Run:ai Model Streamer
- Loading models with CoreWeave’s Tensorizer

Features

Quantization
LoRA Adapters
Tool Calling
Reasoning Outputs
Structured Outputs
Automatic Prefix Caching
Disaggregated Prefilling (experimental)
Speculative Decoding
Compatibility Matrix

Training

Transformers Reinforcement Learning
Reinforcement Learning from Human Feedback

Inference and Serving

Offline Inference
OpenAI-Compatible Server
Multimodal Inputs
Distributed Inference and Serving
Production Metrics
Engine Arguments
Environment Variables
Usage Stats Collection
External Integrations
- LangChain
- LlamaIndex

Deployment

Using Docker
Using Kubernetes
Using Nginx
Using other frameworks
- BentoML
- Cerebrium
- dstack
- Helm
- LWS
- Modal
- SkyPilot
- NVIDIA Triton
External Integrations
- KServe
- KubeAI
- Llama Stack
- llmaz
- Production stack

Performance

Optimization and Tuning
Benchmark Suites

Design Documents

Architecture Overview
Integration with HuggingFace
vLLM’s Plugin System
vLLM Paged Attention
Multi-Modal Data Processing
Automatic Prefix Caching
Python Multiprocessing

V1 Design Documents

vLLM’s torch.compile integration
Automatic Prefix Caching
Metrics

Developer Guide

Contributing to vLLM
Profiling vLLM
Dockerfile
Adding a New Model
Vulnerability Management

API Reference

Offline Inference
- LLM Class
- LLM Inputs
vLLM Engine
- LLMEngine
- AsyncLLMEngine
Inference Parameters
Multi-Modality
Model Development

Community

vLLM Blog
vLLM Meetups
Sponsors

Repository
Suggest edit

.md

vLLM Engine

vLLM Engine#

Engines

LLMEngine
- LLMEngine
AsyncLLMEngine
- AsyncLLMEngine

previous

LLM Inputs

next

LLMEngine

By the vLLM Team

© Copyright 2025, vLLM Team.