logo
vLLM
Quantization
Initializing search
    GitHub
    • Home
    • User Guide
    • Developer Guide
    • API Reference
    • CLI Reference
    • Community
    GitHub
    • Home
      • User Guide
      • vLLM V1
        • Frequently Asked Questions
        • Production Metrics
        • Reproducibility
        • Security
        • Troubleshooting
        • Usage Stats Collection
        • Offline Inference
        • OpenAI-Compatible Server
        • Distributed Inference and Serving
        • Integrations
        • Using Docker
        • Using Kubernetes
        • Using Nginx
        • Frameworks
        • Integrations
        • Reinforcement Learning from Human Feedback
        • Transformers Reinforcement Learning
        • Summary
        • Conserving Memory
        • Engine Arguments
        • Environment Variables
        • Model Resolution
        • Optimization and Tuning
        • Server Arguments
        • Supported Models
        • Generative Models
        • Pooling Models
        • Extensions
        • Compatibility Matrix
        • Automatic Prefix Caching
        • Disaggregated Prefilling (experimental)
        • LoRA Adapters
        • Multimodal Inputs
        • Prompt Embedding Inputs
        • Reasoning Outputs
        • Speculative Decoding
        • Structured Outputs
        • Tool Calling
          • Quantization
          • AutoAWQ
          • BitBLAS
          • BitsAndBytes
          • FP8 W8A8
          • GGUF
          • GPTQModel
          • INT4 W4A16
          • INT8 W8A8
          • NVIDIA TensorRT Model Optimizer
          • Quantized KV Cache
          • AMD QUARK
          • Supported Hardware
          • TorchAO
    • Developer Guide
    • API Reference
    • CLI Reference
    • Community

    Quantization

    Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.

    Contents:

    • Supported_Hardware
    • Auto_Awq
    • Bnb
    • Bitblas
    • Gguf
    • Gptqmodel
    • Int4
    • Int8
    • Fp8
    • Modelopt
    • Quark
    • Quantized_Kvcache
    • Torchao
    Made with Material for MkDocs