Skip to main content

Ctrl+K

Getting Started

Installation
Installation with ROCm
Installation with Neuron
Installation with CPU
Quickstart
Examples

Serving

OpenAI Compatible Server
Deploying with Docker
Distributed Inference and Serving
Production Metrics
Environment Variables
Usage Stats Collection
Integrations

Models

Supported Models
Adding a New Model
Engine Arguments
Using LoRA adapters
Performance and Tuning

Quantization

AutoAWQ
FP8 E5M2 KV Cache
FP8 E4M3 KV Cache

Developer Documentation

Sampling Params
vLLM Engine
- LLMEngine
- AsyncLLMEngine
vLLM Paged Attention
Dockerfile

.rst

Examples

Examples#

Scripts

API Client
Aqlm Example
Gradio OpenAI Chatbot Webserver
Gradio Webserver
Llava Example
LLM Engine Example
MultiLoRA Inference
Offline Inference
Offline Inference Distributed
Offline Inference Neuron
Offline Inference With Prefix
OpenAI Chat Completion Client
OpenAI Completion Client
Tensorize vLLM Model

previous

Quickstart

next

API Client

By the vLLM Team

© Copyright 2024, vLLM Team.