Skip to main content

Ctrl+K

Getting Started

Installation
Installation with ROCm
Installation with OpenVINO
Installation with CPU
Installation with Intel® Gaudi® AI Accelerators
Installation for ARM CPUs
Installation with Neuron
Installation with TPU
Installation with XPU
Quickstart
Debugging Tips
Examples

Serving

OpenAI Compatible Server
Deploying with Docker
Deploying with Kubernetes
Deploying with Helm
Deploying with Nginx Loadbalancer
Distributed Inference and Serving
Production Metrics
Integrations
Loading Models with CoreWeave’s Tensorizer
Loading Models with Run:ai Model Streamer

Models

Supported Models
Generative Models
Pooling Models
Adding a New Model
Enabling Multimodal Inputs

Usage

LoRA Adapters
Multimodal Inputs
Tool Calling
Structured Outputs
Speculative decoding
Compatibility Matrix
Performance and Tuning
Frequently Asked Questions
Engine Arguments
Environment Variables
Usage Stats Collection
Disaggregated prefilling (experimental)

Quantization

Supported Hardware for Quantization Kernels
AutoAWQ
BitsAndBytes
GGUF
INT8 W8A8
FP8 W8A8
FP8 E5M2 KV Cache
FP8 E4M3 KV Cache

Automatic Prefix Caching

Introduction
Implementation

Performance

Benchmark Suites

Community

vLLM Meetups
Sponsors

API Documentation

Sampling Parameters
Pooling Parameters
Offline Inference
- LLM Class
- LLM Inputs
vLLM Engine
- LLMEngine
- AsyncLLMEngine

Design

Architecture Overview
Integration with HuggingFace
vLLM’s Plugin System
Input Processing
- Input Processing Pipeline
vLLM Paged Attention
Multi-Modality
- Adding a Multimodal Plugin
Python Multiprocessing

For Developers

Contributing to vLLM
Profiling vLLM
Dockerfile

Repository
Suggest edit

.md

vLLM Engine

vLLM Engine#

Engines

LLMEngine
- LLMEngine
AsyncLLMEngine
- AsyncLLMEngine

previous

LLM Inputs

next

LLMEngine

By the vLLM Team

© Copyright 2024, vLLM Team.