Skip to main content

Ctrl+K

Getting Started

Installation
Installation with ROCm
Installation with OpenVINO
Installation with CPU
Installation with Neuron
Installation with TPU
Installation with XPU
Quickstart
Debugging Tips
Examples

Serving

OpenAI Compatible Server
Deploying with Docker
Distributed Inference and Serving
Production Metrics
Environment Variables
Usage Stats Collection
Integrations
Loading Models with CoreWeave’s Tensorizer
Frequently Asked Questions

Models

Supported Models
Adding a New Model
Enabling Multimodal Inputs
Engine Arguments
Using LoRA adapters
Using VLMs
Speculative decoding in vLLM
Performance and Tuning

Quantization

Supported Hardware for Quantization Kernels
AutoAWQ
BitsAndBytes
INT8 W8A8
FP8 W8A8
FP8 E5M2 KV Cache
FP8 E4M3 KV Cache

Automatic Prefix Caching

Introduction
Implementation

Performance benchmarks

Benchmark suites of vLLM

Developer Documentation

Sampling Parameters
Offline Inference
- LLM Class
- LLM Inputs
vLLM Engine
- LLMEngine
- AsyncLLMEngine
vLLM Paged Attention
Input Processing
- Input Processing Pipeline
Multi-Modality
- Adding a Multimodal Plugin
Dockerfile
Profiling vLLM

Community

vLLM Meetups
Sponsors

Repository
Suggest edit

.rst

Examples

Examples#

Scripts

API Client
Aqlm Example
Cpu Offload
Gguf Inference
Gradio OpenAI Chatbot Webserver
Gradio Webserver
LLM Engine Example
Lora With Quantization Inference
MultiLoRA Inference
Offline Inference
Offline Inference Arctic
Offline Inference Audio Language
Offline Inference Chat
Offline Inference Distributed
Offline Inference Embedding
Offline Inference Encoder Decoder
Offline Inference Mlpspeculator
Offline Inference Neuron
Offline Inference Tpu
Offline Inference Vision Language
Offline Inference With Prefix
OpenAI Audio API Client
OpenAI Chat Completion Client
OpenAI Completion Client
OpenAI Embedding Client
OpenAI Vision API Client
Save Sharded State
Tensorize vLLM Model

previous

Debugging Tips

next

API Client

By the vLLM Team

© Copyright 2024, vLLM Team.