Skip to main content
Ctrl+K
vLLM - Home vLLM - Home

Getting Started

  • Installation
    • GPU
    • CPU
    • Other AI accelerators
  • Quickstart
  • Examples
    • Offline Inference
      • Audio Language
      • Basic
      • Chat With Tools
      • CPU Offload Lmcache
      • Data Parallel
      • Disaggregated Prefill
      • Disaggregated Prefill Lmcache
      • Distributed
      • Eagle
      • Encoder Decoder
      • Encoder Decoder Multimodal
      • LLM Engine Example
      • LoRA With Quantization Inference
      • MLPSpeculator
      • MultiLoRA Inference
      • Neuron
      • Neuron INT8 Quantization
      • Offline Inference with the OpenAI Batch file format
      • Pixtral
      • Prefix Caching
      • Prithvi Geospatial Mae
      • Profiling
      • vLLM TPU Profiling
      • RLHF
      • RLHF Colocate
      • RLHF Utils
      • Save Sharded State
      • Simple Profiling
      • Structured Outputs
      • Torchrun Example
      • TPU
      • Vision Language
      • Vision Language Embedding
      • Vision Language Multi Image
    • Online Serving
      • API Client
      • Helm Charts
      • Cohere Rerank Client
      • Disaggregated Prefill
      • Gradio OpenAI Chatbot Webserver
      • Gradio Webserver
      • Jinaai Rerank Client
      • Multi-Node-Serving
      • OpenAI Chat Completion Client
      • OpenAI Chat Completion Client For Multimodal
      • OpenAI Chat Completion Client With Tools
      • OpenAI Chat Completion Structured Outputs
      • OpenAI Chat Completion Structured Outputs With Reasoning
      • OpenAI Chat Completion With Reasoning
      • OpenAI Chat Completion With Reasoning Streaming
      • OpenAI Chat Embedding Client For Multimodal
      • OpenAI Completion Client
      • OpenAI Cross Encoder Score
      • OpenAI Embedding Client
      • OpenAI Pooling Client
      • OpenAI Transcription Client
      • Setup OpenTelemetry POC
      • Prometheus and Grafana
      • Run Cluster
      • Sagemaker-Entrypoint
    • Other
      • Logging Configuration
      • Tensorize vLLM Model
  • Troubleshooting
  • Frequently Asked Questions
  • vLLM V1 User Guide

Models

  • Generative Models
  • Pooling Models
  • List of Supported Models
  • Built-in Extensions
    • Loading models with Run:ai Model Streamer
    • Loading models with CoreWeave’s Tensorizer

Features

  • Quantization
    • Supported Hardware
    • AutoAWQ
    • BitsAndBytes
    • GGUF
    • GPTQModel
    • INT4 W4A16
    • INT8 W8A8
    • FP8 W8A8
    • Quantized KV Cache
  • LoRA Adapters
  • Tool Calling
  • Reasoning Outputs
  • Structured Outputs
  • Automatic Prefix Caching
  • Disaggregated Prefilling (experimental)
  • Speculative Decoding
  • Compatibility Matrix

Training

  • Transformers Reinforcement Learning
  • Reinforcement Learning from Human Feedback

Inference and Serving

  • Offline Inference
  • OpenAI-Compatible Server
  • Multimodal Inputs
  • Distributed Inference and Serving
  • Production Metrics
  • Engine Arguments
  • Environment Variables
  • Usage Stats Collection
  • External Integrations
    • LangChain
    • LlamaIndex

Deployment

  • Using Docker
  • Using Kubernetes
  • Using Nginx
  • Using other frameworks
    • BentoML
    • Cerebrium
    • dstack
    • Helm
    • LWS
    • Modal
    • SkyPilot
    • NVIDIA Triton
  • External Integrations
    • KServe
    • KubeAI
    • Llama Stack
    • llmaz
    • Production stack

Performance

  • Optimization and Tuning
  • Benchmark Suites

Design Documents

  • Architecture Overview
  • Integration with HuggingFace
  • vLLM’s Plugin System
  • vLLM Paged Attention
  • Multi-Modal Data Processing
  • Automatic Prefix Caching
  • Python Multiprocessing

V1 Design Documents

  • vLLM’s torch.compile integration
  • Automatic Prefix Caching
  • Metrics

Developer Guide

  • Contributing to vLLM
  • Profiling vLLM
  • Dockerfile
  • Adding a New Model
    • Implementing a Basic Model
    • Registering a Model to vLLM
    • Writing Unit Tests
    • Multi-Modal Support
  • Vulnerability Management

API Reference

  • Offline Inference
    • LLM Class
    • LLM Inputs
  • vLLM Engine
    • LLMEngine
    • AsyncLLMEngine
  • Inference Parameters
  • Multi-Modality
    • Input Definitions
    • Data Parsing
    • Data Processing
    • Memory Profiling
    • Registry
  • Model Development
    • Base Model Interfaces
    • Optional Interfaces
    • Model Adapters

Community

  • vLLM Blog
  • vLLM Meetups
  • Sponsors
  • Repository
  • Suggest edit
  • .md

vLLM Engine

vLLM Engine#

Engines

  • LLMEngine
    • LLMEngine
  • AsyncLLMEngine
    • AsyncLLMEngine

previous

LLM Inputs

next

LLMEngine

By the vLLM Team

© Copyright 2025, vLLM Team.