Skip to content
LLM Compressor Docs
Disk offloading
Search
vllm-project/llm-compressor
LLM Compressor Docs
vllm-project/llm-compressor
Home
Why use LLM Compressor?
Compressing your model, step-by-step
Compressing your model, step-by-step
Choosing your model
Choosing the right compression scheme
Choosing the right compression algorithm
Choosing a dataset
Compressing your model
Deploying with vLLM
Getting started
Getting started
Installing LLM Compressor
Key Models
Key Models
DeepSeek V4
DeepSeek V4
NVFP4 + FP8 Example
Kimi-K2.6
Kimi-K2.6
NVFP4 Example
FP8 Block Example
Qwen3.5
Qwen3.5
NVFP4A16 VL Example
NVFP4 MoE Example
Qwen3.6
Qwen3.6
NVFP4 MoE Example
Gemma 4
Gemma 4
NVFP4 Example
FP8 Block Example
NVFP4 MoE Example
Llama 4
Llama 4
FP8 Example
Mistral Large 3
Mistral Large 3
FP8 Example
User Guides
User Guides
Entrypoints
Entrypoints
oneshot
model-free-ptq
convert_checkpoint
Compression Schemes
Observers
Big Models and Distributed Support
Big Models and Distributed Support
Model Loading
Sequential Onloading
Distributed Oneshot
Saving a Compressed Model
Memory Requirements
Runtime Performance
Developer Guides
Developer Guides
Adding a New Modifier
Adding a New Observer
Adding MoE Calibration Support for a New Model
Examples
Examples
`AutoRound` Quantization
AWQ Quantization
Big Model Quantization with Sequential Onloading
Disk Offloading
Disk Offloading
On this page
Disk Offloading
iMatrix Importance-Weighted Quantization
Model-free Quantization
Multimodal Audio Model Quantization
Multimodal Vision-Language Quantization
Attention Quantization in LLM Compressor
KV Cache Quantization
Non-uniform Quantization
`int4` Weight Quantization
`fp4` Quantization with NVFP4
`fp8` Weight and Activation Quantization
`int8` Weight and Activation Quantization
Quantizing Mixture of Experts (MoE) models
Applying Transforms to Improve Quantization Accuracy
Experimental
Experimental
Attention Quantization in LLM Compressor
Mistral-format model compression (experimental)
Developer
Developer
API Reference
API Reference
llmcompressor
llmcompressor
args
args
dataset_arguments
dataset_arguments
model_arguments
model_arguments
recipe_arguments
recipe_arguments
utils
utils
core
core
events
events
event
event
lifecycle
lifecycle
model_layer
model_layer
session
session
session_functions
session_functions
state
state
datasets
datasets
utils
utils
entrypoints
entrypoints
model_free
model_free
helpers
helpers
lifecycle
lifecycle
microscale
microscale
process
process
reindex_fused_weights
reindex_fused_weights
save_utils
save_utils
validate
validate
oneshot
oneshot
utils
utils
logger
logger
modeling
modeling
afmoe
afmoe
deepseek_v3
deepseek_v3
deepseekv32
deepseekv32
config
config
kernel
kernel
model
model
fuse
fuse
gemma4
gemma4
glm4_moe
glm4_moe
glm4_moe_lite
glm4_moe_lite
glm_moe_dsa
glm_moe_dsa
gpt_oss
gpt_oss
granite4
granite4
llama4
llama4
moe_context
moe_context
offset_norm
offset_norm
qwen3_5_moe
qwen3_5_moe
qwen3_moe
qwen3_moe
qwen3_next_moe
qwen3_next_moe
qwen3_vl_moe
qwen3_vl_moe
modifiers
modifiers
autoround
autoround
base
base
awq
awq
experimental
experimental
factory
factory
gptq
gptq
base
base
gptq_quantize
gptq_quantize
interface
interface
logarithmic_equalization
logarithmic_equalization
base
base
modifier
modifier
obcq
obcq
sgpt_base
sgpt_base
pruning
pruning
constant
constant
base
base
helpers
helpers
magnitude
magnitude
base
base
sparsegpt
sparsegpt
base
base
sgpt_base
sgpt_base
sgpt_sparsify
sgpt_sparsify
utils
utils
pytorch
pytorch
layer_mask
layer_mask
mask_factory
mask_factory
wanda
wanda
base
base
wanda_sparsify
wanda_sparsify
quantization
quantization
calibration
calibration
gptq
gptq
group_size_validation
group_size_validation
quantization
quantization
base
base
mixin
mixin
smoothquant
smoothquant
base
base
utils
utils
transform
transform
awq
awq
base
base
dynamic_mappings
dynamic_mappings
mappings
mappings
imatrix
imatrix
base
base
quip
quip
base
base
smoothquant
smoothquant
base
base
utils
utils
spinquant
spinquant
base
base
mappings
mappings
norm_mappings
norm_mappings
utils
utils
constants
constants
helpers
helpers
hooks
hooks
pytorch_helpers
pytorch_helpers
observers
observers
base
base
helpers
helpers
imatrix
imatrix
min_max
min_max
mse
mse
pipelines
pipelines
basic
basic
pipeline
pipeline
cache
cache
data_free
data_free
pipeline
pipeline
independent
independent
pipeline
pipeline
registry
registry
sequential
sequential
ast_helpers
ast_helpers
ast_utils
ast_utils
auto_wrapper
auto_wrapper
control_flow_analyzer
control_flow_analyzer
name_analyzer
name_analyzer
helpers
helpers
pipeline
pipeline
transformers_helpers
transformers_helpers
pytorch
pytorch
model_load
model_load
helpers
helpers
utils
utils
helpers
helpers
sparsification
sparsification
sparsification_info
sparsification_info
configs
configs
helpers
helpers
module_sparsification_info
module_sparsification_info
recipe
recipe
metadata
metadata
recipe
recipe
utils
utils
sentinel
sentinel
transformers
transformers
compression
compression
compressed_tensors_utils
compressed_tensors_utils
sparsity_helpers
sparsity_helpers
data
data
base
base
c4
c4
cnn_dailymail
cnn_dailymail
custom
custom
data_helpers
data_helpers
evolcodealpaca
evolcodealpaca
flickr_30k
flickr_30k
gsm8k
gsm8k
open_platypus
open_platypus
peoples_speech
peoples_speech
ultrachat_200k
ultrachat_200k
wikitext
wikitext
tracing
tracing
debug
debug
utils
utils
helpers
helpers
utils
utils
dev
dev
dist
dist
helpers
helpers
metric_logging
metric_logging
pytorch
pytorch
module
module
utils
utils
transformers
transformers
FAQ
FAQ
Frequently Asked Questions
On this page
Disk Offloading
Home
Examples
Disk offloading
Disk Offloading
For more information on disk offloading, see
Model Loading
.
Back to top