Skip to main content

Ctrl+K

Getting Started

Installation
Quickstart
Examples
Troubleshooting
Frequently Asked Questions

Models

Generative Models
Pooling Models
List of Supported Models
Built-in Extensions
- Loading models with Run:ai Model Streamer
- Loading models with CoreWeave’s Tensorizer

Features

Quantization
LoRA Adapters
Tool Calling
Structured Outputs
Automatic Prefix Caching
Disaggregated Prefilling (experimental)
Speculative Decoding
Compatibility Matrix

Inference and Serving

Offline Inference
OpenAI-Compatible Server
Multimodal Inputs
Distributed Inference and Serving
Production Metrics
Engine Arguments
Environment Variables
Usage Stats Collection
External Integrations
- LangChain
- LlamaIndex

Deployment

Using Docker
Using Kubernetes
Using Nginx
Using other frameworks
- BentoML
- Cerebrium
- dstack
- Helm
- LWS
- Modal
- SkyPilot
- NVIDIA Triton
External Integrations

Performance

Optimization and Tuning
Benchmark Suites

Design Documents

Architecture Overview
Integration with HuggingFace
vLLM’s Plugin System
vLLM Paged Attention
Multi-Modal Data Processing
Automatic Prefix Caching
Python Multiprocessing

Developer Guide

Contributing to vLLM
Profiling vLLM
Dockerfile
Adding a New Model
Vulnerability Management

API Reference

Offline Inference
- LLM Class
- LLM Inputs
vLLM Engine
- LLMEngine
- AsyncLLMEngine
Inference Parameters
Multi-Modality
Model Development

Community

vLLM Blog
vLLM Meetups
Sponsors

Repository
Suggest edit

.md

vLLM Engine

vLLM Engine#

Engines

LLMEngine
- LLMEngine
AsyncLLMEngine
- AsyncLLMEngine

previous

LLM Inputs

next

LLMEngine

By the vLLM Team

© Copyright 2024, vLLM Team.