model_free_ptq
model_free_ptq is a PTQ entrypoint for data-free quantization schemes that operates directly on safetensors checkpoint files without requiring a Hugging Face model definition or loading the model through transformers.
When to Use
Use model_free_ptq when:
- Your quantization scheme is data-free (e.g. FP8 dynamic, FP8 block, NVFP4A16, MXFP4/MXFP8)
- The model does not have a Hugging Face transformers definition (e.g. a newly released model not yet in transformers)
oneshotfails for your model
For schemes that require calibration data (GPTQ, AWQ, SmoothQuant, static activation quantization), use oneshot instead.
Basic Usage
from llmcompressor import model_free_ptq
model_free_ptq(
model_stub="meta-llama/Meta-Llama-3-8B-Instruct",
save_directory="Meta-Llama-3-8B-Instruct-FP8-BLOCK",
scheme="FP8_BLOCK",
ignore=["lm_head"],
device="cuda:0",
)
How It Works
model_free_ptq processes each .safetensors file in the checkpoint independently, without ever loading the full model into memory as a torch.nn.Module. For each file:
- Validate — check that all quantizable tensors can be quantized with the given scheme
- Initialize — create a minimal
torch.nn.Linearmodule for each weight tensor - Calibrate — compute scale and zero point directly from the weight tensor (data-free)
- Compress — call
compress_modulefromcompressed-tensorsto pack/quantize the weights - Save — write the compressed tensors back to disk
After all files are processed, the safetensors index and model config are updated with the quantization metadata.
Multiple files can be processed in parallel using the max_workers argument.
Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
model_stub |
str \| PathLike |
— | HuggingFace model ID or path to a local directory containing safetensors files |
save_directory |
str \| PathLike |
— | Directory to save the quantized checkpoint |
scheme |
QuantizationScheme \| str |
— | Quantization scheme to apply. Can be a preset string (e.g. "FP8_BLOCK", "NVFP4A16") or a QuantizationScheme object |
ignore |
Iterable[str] |
() |
Module names or regex patterns to skip. Modules ending in "norm" are always ignored automatically |
max_workers |
int |
1 |
Number of parallel worker threads for processing safetensors files |
device |
str \| torch.device \| None |
None |
Device to use for quantization. Defaults to GPU if available, otherwise CPU |
converter |
Converter \| None |
None |
Optional compressed-tensors converter to apply before quantization, e.g. to convert modelopt-format checkpoints to compressed-tensors format |
Standard Flow (Non-Microscale Schemes)
For schemes without a global scale (e.g. FP8_BLOCK, FP8_DYNAMIC), call model_free_ptq directly:
from llmcompressor import model_free_ptq
model_free_ptq(
model_stub="unsloth/Kimi-K2-Thinking-BF16",
save_directory="Kimi-K2-Thinking-FP8-BLOCK",
scheme="FP8_BLOCK",
ignore=[
"re:.*gate$",
"lm_head",
"re:.*kv_a_proj_with_mqa$",
"re:.*q_a_proj$",
"model.embed_tokens",
],
max_workers=15,
device="cuda:0",
)
Microscale Flow (NVFP4)
NVFP4 requires a global scale that is fused across related weight groups (e.g. qkv projections, gate/up projections). For this fusion to work correctly, the weights of each fused group must reside in the same safetensors shard.
Standard model checkpoints often split these weights across different shards. To fix this, run the reindex_fused_weights CLI tool first to reorganize the checkpoint:
llmcompressor.reindex_fused_weights \
unsloth/Kimi-K2-Thinking-BF16 \
Kimi-K2-Thinking-BF16-reindexed \
--num_workers=10
Then run model_free_ptq on the reindexed checkpoint:
from llmcompressor import model_free_ptq
model_free_ptq(
model_stub="Kimi-K2-Thinking-BF16-reindexed",
save_directory="Kimi-K2-Thinking-NVFP4A16",
scheme="NVFP4A16",
ignore=[
"re:.*gate$",
"lm_head",
"re:.*kv_a_proj_with_mqa$",
"re:.*q_a_proj$",
"model.embed_tokens",
],
max_workers=15,
device="cuda:0",
)
Note
Reindexing is only required for NVFP4, which uses a global scale. MXFP4 does not use a global scale and does not require reindexing.
Ignoring Layers
The ignore argument accepts module name strings or regex patterns prefixed with re:. Modules whose names end in "norm" are automatically ignored regardless of the ignore list.
ignore=[
"lm_head", # exact name match
"re:.*gate$", # regex: any module ending in "gate"
"model.embed_tokens", # exact name match
]
Supported Schemes
model_free_ptq supports any data-free weight quantization scheme. Common presets:
| Scheme | Description |
|---|---|
FP8_DYNAMIC |
FP8 weights with dynamic per-token activation quantization |
FP8_BLOCK |
FP8 weights with block-wise scaling (Blackwell-optimized) |
NVFP4A16 |
NVFP4 weight-only quantization with FP8 group scales and a global scale |
MXFP4/MXFP8 |
MXFP4 or MXFP8 quantization with MX-format microscales |
Note: Many of these schemes, such as NVFP4 and MXFP4 may potentially lead to improved recovery when applied with a calibration algorithm that requires data, such as GPTQ. Consider comparing performance using oneshot. For the full list of supported schemes and formats, see Compression Schemes.