Skip to main content
Ctrl+K
vLLM - Home vLLM - Home

Getting Started

  • Installation
  • Installation with ROCm
  • Installation with OpenVINO
  • Installation with CPU
  • Installation with Intel® Gaudi® AI Accelerators
  • Installation for ARM CPUs
  • Installation with Neuron
  • Installation with TPU
  • Installation with XPU
  • Quickstart
  • Debugging Tips
  • Examples
    • API Client
    • Aqlm Example
    • Cpu Offload
    • Florence2 Inference
    • Gguf Inference
    • Gradio OpenAI Chatbot Webserver
    • Gradio Webserver
    • LLM Engine Example
    • Lora With Quantization Inference
    • MultiLoRA Inference
    • Offline Chat With Tools
    • Offline Inference
    • Offline Inference Arctic
    • Offline Inference Audio Language
    • Offline Inference Chat
    • Offline Inference Classification
    • Offline Inference Cli
    • Offline Inference Distributed
    • Offline Inference Embedding
    • Offline Inference Encoder Decoder
    • Offline Inference Mlpspeculator
    • Offline Inference Neuron
    • Offline Inference Neuron Int8 Quantization
    • Offline Inference Pixtral
    • Offline Inference Scoring
    • Offline Inference Structured Outputs
    • Offline Inference Tpu
    • Offline Inference Vision Language
    • Offline Inference Vision Language Embedding
    • Offline Inference Vision Language Multi Image
    • Offline Inference With Default Generation Config
    • Offline Inference With Prefix
    • Offline Inference With Profiler
    • Offline Profile
    • OpenAI Chat Completion Client
    • OpenAI Chat Completion Client For Multimodal
    • OpenAI Chat Completion Client With Tools
    • OpenAI Chat Completion Structured Outputs
    • OpenAI Chat Embedding Client For Multimodal
    • OpenAI Completion Client
    • OpenAI Cross Encoder Score
    • OpenAI Embedding Client
    • OpenAI Pooling Client
    • Save Sharded State
    • Tensorize vLLM Model

Serving

  • OpenAI Compatible Server
  • Deploying with Docker
  • Deploying with Kubernetes
  • Deploying with Helm
  • Deploying with Nginx Loadbalancer
  • Distributed Inference and Serving
  • Production Metrics
  • Integrations
    • Deploying and scaling up with SkyPilot
    • Deploying with KServe
    • Deploying with KubeAI
    • Deploying with NVIDIA Triton
    • Deploying with BentoML
    • Deploying with Cerebrium
    • Deploying with LWS
    • Deploying with dstack
    • Serving with Langchain
    • Serving with llama_index
    • Serving with Llama Stack
  • Loading Models with CoreWeave’s Tensorizer
  • Loading Models with Run:ai Model Streamer

Models

  • Supported Models
  • Generative Models
  • Pooling Models
  • Adding a New Model
  • Enabling Multimodal Inputs

Usage

  • LoRA Adapters
  • Multimodal Inputs
  • Tool Calling
  • Structured Outputs
  • Speculative decoding
  • Compatibility Matrix
  • Performance and Tuning
  • Frequently Asked Questions
  • Engine Arguments
  • Environment Variables
  • Usage Stats Collection
  • Disaggregated prefilling (experimental)

Quantization

  • Supported Hardware for Quantization Kernels
  • AutoAWQ
  • BitsAndBytes
  • GGUF
  • INT8 W8A8
  • FP8 W8A8
  • FP8 E5M2 KV Cache
  • FP8 E4M3 KV Cache

Automatic Prefix Caching

  • Introduction
  • Implementation

Performance

  • Benchmark Suites

Community

  • vLLM Meetups
  • Sponsors

API Documentation

  • Sampling Parameters
  • Pooling Parameters
  • Offline Inference
    • LLM Class
    • LLM Inputs
  • vLLM Engine
    • LLMEngine
    • AsyncLLMEngine

Design

  • Architecture Overview
  • Integration with HuggingFace
  • vLLM’s Plugin System
  • Input Processing
    • Input Processing Pipeline
  • vLLM Paged Attention
  • Multi-Modality
    • Adding a Multimodal Plugin
  • Python Multiprocessing

For Developers

  • Contributing to vLLM
  • Profiling vLLM
  • Dockerfile
  • Repository
  • Suggest edit
  • .md

Integrations

Integrations#

  • Deploying and scaling up with SkyPilot
  • Deploying with KServe
  • Deploying with KubeAI
  • Deploying with NVIDIA Triton
  • Deploying with BentoML
  • Deploying with Cerebrium
  • Deploying with LWS
  • Deploying with dstack
  • Serving with Langchain
  • Serving with llama_index
  • Serving with Llama Stack

previous

Production Metrics

next

Deploying and scaling up with SkyPilot

By the vLLM Team

© Copyright 2024, vLLM Team.