You are viewing the latest developer preview docs. Click here to view docs for the latest stable release.

logo
vLLM
Quantization
Initializing search
    GitHub
    • Home
    • User Guide
    • Developer Guide
    • API Reference
    • CLI Reference
    • Community
    GitHub
    • Home
      • User Guide
      • vLLM V1
        • Frequently Asked Questions
        • Production Metrics
        • Reproducibility
        • Security
        • Troubleshooting
        • Usage Stats Collection
        • Offline Inference
        • OpenAI-Compatible Server
        • Distributed Inference and Serving
        • Integrations
        • Using Docker
        • Using Kubernetes
        • Using Nginx
        • Frameworks
        • Integrations
        • Reinforcement Learning from Human Feedback
        • Transformers Reinforcement Learning
        • Summary
        • Conserving Memory
        • Engine Arguments
        • Environment Variables
        • Model Resolution
        • Optimization and Tuning
        • Server Arguments
        • Supported Models
        • Generative Models
        • Pooling Models
        • Extensions
        • Compatibility Matrix
        • Automatic Prefix Caching
        • Disaggregated Prefilling (experimental)
        • LoRA Adapters
        • Multimodal Inputs
        • Prompt Embedding Inputs
        • Reasoning Outputs
        • Speculative Decoding
        • Structured Outputs
        • Tool Calling
          • Quantization
          • AutoAWQ
          • BitBLAS
          • BitsAndBytes
          • FP8 W8A8
          • GGUF
          • GPTQModel
          • INT4 W4A16
          • INT8 W8A8
          • NVIDIA TensorRT Model Optimizer
          • Quantized KV Cache
          • AMD Quark
          • Supported Hardware
          • TorchAO
    • Developer Guide
    • API Reference
    • CLI Reference
    • Community

    Quantization

    Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.

    Contents:

    • Supported Hardware
    • AutoAWQ
    • BitsAndBytes
    • BitBLAS
    • GGUF
    • GPTQModel
    • INT4 W4A16
    • INT8 W8A8
    • FP8 W8A8
    • NVIDIA TensorRT Model Optimizer
    • AMD Quark
    • Quantized KV Cache
    • TorchAO
    Made with Material for MkDocs