Frequently Asked Questions

Below are the most frequently asked questions when using LLM Compressor. If you do not see your question here, please file an issue: LLM Compressor Issues.

1. Why doesn't my model run any faster after I compress it?

This is usually the case when loading your model through transformers, not an inference server that supports models in the compressed-tensors format. Loading the model through transformers does not provide an inference benefit, as forward passes of the model are done with the model decompressed. There is no support for optimized compression inference during runtime. Instead, the model should be run in vLLM or another inference server that supports optimized inference for the quantized models.

2. Are models compressed using LLM Compressor supported with SGlang?

There is minimal support for compressed-tensors models in sglang, but it is not maintained nor tested by our team. Much of the integration relies on vLLM. For the most up-to-date and tested integration, vLLM is recommended.

3. How do I choose the right quantization scheme?

This involves understanding your hardware availability and inference requirements. Refer to the Compression Schemes Guide. For a step-by-step guide, see our Compression Guide.

4. What are the memory requirements for compression?

Refer to Memory Requirements for LLM Compressor.

5. Which model layers should be quantized?

Typically, all linear layers are quantized except the lm_head layer. This is because the lm_head layer is the last layer of the model and sensitive to quantization, which will impact the model's accuracy. For example, this code snippet shows how to ignore the lm_head layer.

Mixture of Expert (MoE) models, due to their advanced architecture and some components such as gate and routing layers, are sensitive to quantization as well. For example, this code snippet shows how to ignore the gates.

Multimodal models (e.g., vision-language models)pair a language model with another component for image, audio, or video input as well as text. In these cases, the non-textual component is excluded from quantization, as it generally has fewer parameters and is more sensitive.

For more information, see Quantizing Multimodal Audio Models and Quantizing Multimodal Vision-Language Models.

6. What environment should be used for installing LLM Compressor?

vLLM and LLM Compressor should be used in separate environments as they may have dependency mismatches.

7. Does LLM Compressor have multi-GPU support?

Yes. LLM Compressor supports multi-GPU compression via Distributed Data Parallel (DDP), available since v0.10.0.

By default, LLM Compressor compresses large models through sequential onloading, whereby layers of the model are onloaded to a single GPU, optimized, then offloaded back to the CPU/Disk. DDP parallelizes this process across multiple GPUs, significantly reducing compression time. The following modifiers support DDP: GPTQModifier, QuantizationModifier and AutoRoundModifier and the following transforms support DDP: SmoothQuantModifier and AWQModifier. See the Big Models and Distributed Guide for usage details and benchmark results.

8. Where can I learn more about LLM Compressor?

There are multiple videos on YouTube: - Optimizing vLLM Performance through Quantization|Ray Summit 2024 - vLLM Office Hours

Alternatively, join the vLLM Slack and ask any questions in #llm-compressor or #sig-quantization.