Skip to main content
Ctrl+K
vLLM - Home

Getting Started

  • Installation
  • Installation with ROCm
  • Installation with Neuron
  • Installation with CPU
  • Quickstart

Serving

  • OpenAI Compatible Server
  • Deploying with Docker
  • Distributed Inference and Serving
  • Production Metrics
  • Usage Stats Collection
  • Integrations
    • Running on clouds with SkyPilot
    • Deploying with KServe
    • Deploying with NVIDIA Triton
    • Deploying with BentoML
    • Serving with Langchain

Models

  • Supported Models
  • Adding a New Model
  • Engine Arguments
  • Using LoRA adapters

Quantization

  • AutoAWQ
  • FP8 E5M2 KV Cache

Developer Documentation

  • Sampling Params
  • vLLM Engine
    • LLMEngine
    • AsyncLLMEngine
  • vLLM Paged Attention
  • .rst

vLLM Engine

vLLM Engine#

Engines

  • LLMEngine
    • LLMEngine
  • AsyncLLMEngine
    • AsyncLLMEngine

previous

Sampling Params

next

LLMEngine

By the vLLM Team

© Copyright 2024, vLLM Team.