Skip to content

llmcompressor.entrypoints.model_free

Modules:

Functions:

  • model_free_ptq

    Quantize a model without the need for a model definition. This function

model_free_ptq

model_free_ptq(
    model_stub: str | PathLike,
    save_directory: str | PathLike,
    scheme: QuantizationScheme | str,
    ignore: Iterable[str] = tuple(),
    max_workers: int = 1,
    device: Optional[device | str] = None,
    converter: Converter | None = None,
)

Quantize a model without the need for a model definition. This function operates on a model stub or folder containing weights saved in safetensors files.

For microscale schemes (NVFP4, MXFP4), fused weight sets (q/k/v, gate/up) are handled correctly even when split across shards. Each shard job receives a precomputed inverse_weight_map specifying exactly which tensors to load from which files — enabling true partial reads with no runtime discovery and no redundant tensor reads.

Parameters:

  • model_stub (str | PathLike) –

    huggingface model hub or path to local weights files

  • save_directory (str | PathLike) –

    directory to save quantized weights to

  • scheme (QuantizationScheme | str) –

    weight quantization scheme or preset scheme name

  • ignore (Iterable[str], default: tuple() ) –

    modules to ignore. Modules ending with "norm" are automatically ignored

  • max_workers (int, default: 1 ) –

    number of worker threads to process files with

  • device (Optional[device | str], default: None ) –

    gpu device to accelerate quantization with

  • converter (Converter | None, default: None ) –

    optional converter to apply to the checkpoint to convert it to compressed-tensors format before running model-free PTQ

Source code in src/llmcompressor/entrypoints/model_free/__init__.py
def model_free_ptq(
    model_stub: str | os.PathLike,
    save_directory: str | os.PathLike,
    scheme: QuantizationScheme | str,
    ignore: Iterable[str] = tuple(),
    max_workers: int = 1,
    device: Optional[torch.device | str] = None,
    converter: Converter | None = None,
):
    """
    Quantize a model without the need for a model definition. This function
    operates on a model stub or folder containing weights saved in safetensors
    files.

    For microscale schemes (NVFP4, MXFP4), fused weight sets (q/k/v, gate/up)
    are handled correctly even when split across shards. Each shard job receives
    a precomputed inverse_weight_map specifying exactly which tensors to load
    from which files — enabling true partial reads with no runtime discovery
    and no redundant tensor reads.

    :param model_stub: huggingface model hub or path to local weights files
    :param save_directory: directory to save quantized weights to
    :param scheme: weight quantization scheme or preset scheme name
    :param ignore: modules to ignore. Modules ending with "norm" are
        automatically ignored
    :param max_workers: number of worker threads to process files with
    :param device: gpu device to accelerate quantization with
    :param converter: optional converter to apply to the checkpoint to convert
        it to compressed-tensors format before running model-free PTQ
    """
    # validate arguments
    model_files = get_checkpoint_files(model_stub)

    scheme_name, scheme = validate_scheme(scheme)
    device = gpu_if_available(device)
    validate_safetensors_index(model_files, scheme)

    # copy non-safetensors files (configs, tokenizers, etc.)
    for file_path, resolved_path in model_files.items():
        if not file_path.endswith("safetensors"):
            save_path = Path(save_directory) / file_path
            if is_weights_file(file_path):
                logger.warning(f"Skip processing for weights file {file_path}")
            save_path.parent.mkdir(parents=True, exist_ok=True)
            logger.info(f"Copying {file_path} -> {save_path}")
            shutil.copyfile(resolved_path, save_path)

    # build quantization jobs
    jobs = _build_jobs(model_files, save_directory, scheme, ignore, device, converter)

    # 1. validate quantizable tensors — fail fast before long-running quantization
    validate_jobs = [(validate_file, *job[1:]) for job in jobs]
    exec_jobs(validate_jobs, max_workers, desc="Validating")

    # 2-5. quantize and compress weights
    total_size = 0
    weight_map = dict()
    quantize_results = exec_jobs(jobs, max_workers, desc="Quantizing")
    for _total_size, _weight_map in quantize_results:
        total_size += _total_size
        weight_map.update(_weight_map)

    # 6. update config and safetensors index
    # weight_map may contain tensors re-located to new shards (partner tensors
    # re-saved alongside the shard that needed them for fused scale computation)
    update_config(save_directory, scheme_name, scheme, ignore, converter)
    update_safetensors_index(save_directory, total_size, weight_map)