vLLM

Quantization

Initializing search

Home
User Guide
Developer Guide
API Reference
CLI Reference
Community

vLLM

Home
User Guide
User Guide
- User Guide
- vLLM V1
- General
  General
- Inference and Serving
  Inference and Serving
- Deployment
  Deployment
- Training
  Training
  - Reinforcement Learning from Human Feedback
  - Transformers Reinforcement Learning
- Configuration
  Configuration
- Models
  Models
- Features
  Features
  - Compatibility Matrix
  - Automatic Prefix Caching
  - Disaggregated Prefilling (experimental)
  - LoRA Adapters
  - Multimodal Inputs
  - Prompt Embedding Inputs
  - Reasoning Outputs
  - Speculative Decoding
  - Structured Outputs
  - Tool Calling
  - Quantization
    Quantization
    
    Quantization
    
    AutoAWQ
    
    BitBLAS
    
    BitsAndBytes
    
    FP8 W8A8
    
    GGUF
    
    GPTQModel
    
    INT4 W4A16
    
    INT8 W8A8
    
    NVIDIA TensorRT Model Optimizer
    
    Quantized KV Cache
    
    AMD QUARK
    
    Supported Hardware
    
    TorchAO
Developer Guide
API Reference
CLI Reference
Community

Quantization

Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.

Contents:

Supported_Hardware
Auto_Awq
Bnb
Bitblas
Gguf
Gptqmodel
Int4
Int8
Fp8
Modelopt
Quark
Quantized_Kvcache
Torchao

Made with Material for MkDocs