llmcompressor.modifiers.quantization.calibration

Functions:

calibrate_input_hook –

Hook to calibrate input activations by accumulating statistics in the observer.
calibrate_output_hook –

Hook to calibrate output activations by accumulating statistics in the observer.
freeze_module_quantization –

deletes observers when calibration is complete.
get_modules –

Extract all modules from parent modules and return a deduplicated list
initialize_observer –

Initialize observer module and attach as submodule.
observe –

Run observers to accumulate statistics on modules.
update_qparams –

Compute quantization parameters from observer statistics and store on module.

calibrate_input_hook

calibrate_input_hook(module: Module, args: Any)

Hook to calibrate input activations by accumulating statistics in the observer.

Source code in src/llmcompressor/modifiers/quantization/calibration.py

def calibrate_input_hook(module: Module, args: Any):
    """
    Hook to calibrate input activations by accumulating statistics in the observer.
    """
    args = args[0] if isinstance(args, tuple) else args
    module.input_observer(args)

calibrate_output_hook

calibrate_output_hook(
    module: Module, _args: Any, output: Tensor
)

Hook to calibrate output activations by accumulating statistics in the observer.

Source code in src/llmcompressor/modifiers/quantization/calibration.py

def calibrate_output_hook(module: Module, _args: Any, output: torch.Tensor):
    """
    Hook to calibrate output activations by accumulating statistics in the observer.
    """
    module.output_observer(output)
    output = forward_quantize(
        module=module,
        value=output,
        base_name="output",
        args=module.quantization_scheme.output_activations,
    )
    return output

freeze_module_quantization

freeze_module_quantization(module: Module)

deletes observers when calibration is complete.

apply to full model with model.apply(freeze_module_quantization)

Parameters:

module (Module) –

module to freeze quantization for

Source code in src/llmcompressor/modifiers/quantization/calibration.py

def freeze_module_quantization(module: Module):
    """
    deletes observers when calibration is complete.

    apply to full model with `model.apply(freeze_module_quantization)`

    :param module: module to freeze quantization for
    """
    scheme = getattr(module, "quantization_scheme", None)
    if not scheme:
        # no quantization scheme nothing to do
        return

    if module.quantization_status == QuantizationStatus.FROZEN:
        # nothing to do, already frozen
        return

    # remove observers
    for name in ("input", "weight", "output", "q", "k", "v"):
        obs_name = f"{name}_observer"
        if hasattr(module, obs_name):
            getattr(module, obs_name).detach(module)
            delattr(module, obs_name)

    module.quantization_status = QuantizationStatus.FROZEN

get_modules

get_modules(parents: Iterable[Module]) -> list[Module]

Extract all modules from parent modules and return a deduplicated list preserving iteration order.

This is critical for DDP: all ranks must process modules in the same order to avoid NCCL deadlocks when collective operations (e.g., all_reduce) are called during observer synchronization.

Parameters:

parents (Iterable[Module]) –

iterable of parent modules

Returns:

list[Module] –

deduplicated list of all modules in iteration order

Source code in src/llmcompressor/modifiers/quantization/calibration.py

def get_modules(parents: Iterable[Module]) -> list[Module]:
    """
    Extract all modules from parent modules and return a deduplicated list
    preserving iteration order.

    This is critical for DDP: all ranks must process modules in the same order
    to avoid NCCL deadlocks when collective operations (e.g., all_reduce) are
    called during observer synchronization.

    :param parents: iterable of parent modules
    :return: deduplicated list of all modules in iteration order
    """
    seen = set()
    result = []
    for parent in parents:
        for module in parent.modules():
            if module not in seen:
                seen.add(module)
                result.append(module)
    return result

initialize_observer

initialize_observer(module: Module, base_name: str)

Initialize observer module and attach as submodule. The name of the observer is fetched from the quantization_args. The name is then used to load the observer from the registry and attached to the module. The name of the observer uses the base_name provided.

This function always initializes memoryless observers for weights

Parameters:

module (Module) –

torch.nn.Module that the observer is being attached to
base_name (str) –

str used to name the observer attribute

Source code in src/llmcompressor/modifiers/quantization/calibration.py

def initialize_observer(
    module: Module,
    base_name: str,
):
    """
    Initialize observer module and attach as submodule.
    The name of the observer is fetched from the quantization_args.
    The name is then used to load the observer from the registry and attached
    to the module. The name of the observer uses the base_name provided.

    This function always initializes memoryless observers for weights

    :param module: torch.nn.Module that the observer is being attached to
    :param base_name: str used to name the observer attribute

    """
    if base_name == "weight":
        arg_name = "weights"
    elif base_name == "output":
        arg_name = "output_activations"
    else:  # input, q, k, v
        arg_name = "input_activations"

    args: QuantizationArgs = getattr_chain(
        module, f"quantization_scheme.{arg_name}", None
    )
    observer = args.observer

    # training is no longer supported: always use memoryless for weights
    if base_name == "weight" and args.observer in ("static_minmax", "minmax"):
        observer = "memoryless_minmax"
        logger.warning(
            "Overriding weight observer for lower memory usage "
            f"({args.observer} -> {observer})",
            log_once=True,
        )
    if base_name == "weight" and args.observer in ("mse",):
        observer = "memoryless_mse"
        logger.warning(
            "Overriding weight observer for lower memory usage "
            f"({args.observer} -> {observer})",
            log_once=True,
        )

    if args is not None and args.dynamic is not True:
        observer = Observer.load_from_registry(observer, base_name=base_name, args=args)
        module.register_module(f"{base_name}_observer", observer)
        observer.attach(module)

observe

observe(module: Module | Iterable[Module], base_name: str)

Run observers to accumulate statistics on modules. Must be called before update_qparams.

Parameters:

module (Module | Iterable[Module]) –

module or iterable of modules with observer attributes
base_name (str) –

substring used to fetch the observer and value to observe

Source code in src/llmcompressor/modifiers/quantization/calibration.py

def observe(
    module: Module | Iterable[Module],
    base_name: str,
):
    """
    Run observers to accumulate statistics on modules.
    Must be called before update_qparams.

    :param module: module or iterable of modules with observer attributes
    :param base_name: substring used to fetch the observer and value to observe
    """
    if isinstance(module, Iterable):
        for m in module:
            observe(m, base_name)
        return

    observer = getattr(module, f"{base_name}_observer", None)
    if observer is None:
        return

    observer(getattr(module, base_name))

update_qparams

update_qparams(
    module: Module | Iterable[Module],
    base_name: str | Iterable[str],
    only_update_onload: bool = False,
)

Compute quantization parameters from observer statistics and store on module.

For dynamic quantization, scale/zp updates are skipped (scale/zp are computed at inference time). For non-TENSOR_GROUP strategies, global_scale is None and naturally skipped.

:only_update_onload: option to only update the onloaded value, useful when we want to do a temporary update or in DDP situations where we want only want one rank to update the offload+onload to avoid multiple writes to the offload (rest just update onload)

Parameters:

module (Module | Iterable[Module]) –

torch.nn.Module with attached observer (or iterable of modules)
base_name (str | Iterable[str]) –

substring used to fetch the observer, scales, and zp. Can be a string or iterable of strings.

Source code in src/llmcompressor/modifiers/quantization/calibration.py

def update_qparams(
    module: Module | Iterable[Module],
    base_name: str | Iterable[str],
    only_update_onload: bool = False,
):
    """
    Compute quantization parameters from observer statistics and store on module.

    For dynamic quantization, scale/zp updates are skipped (scale/zp are
    computed at inference time). For non-TENSOR_GROUP strategies, global_scale
    is None and naturally skipped.

    :param module: torch.nn.Module with attached observer (or iterable of modules)
    :param base_name: substring used to fetch the observer, scales, and zp.
        Can be a string or iterable of strings.
    :only_update_onload: option to only update the onloaded value, useful
        when we want to do a temporary update or in DDP situations where
        we want only want one rank to update the offload+onload to avoid
        multiple writes to the offload (rest just update onload)
    """
    if isinstance(module, Iterable) or not isinstance(base_name, str):
        modules = [module] if not isinstance(module, Iterable) else module
        base_names = [base_name] if isinstance(base_name, str) else base_name
        for m, b in product(modules, base_names):
            update_qparams(m, b, only_update_onload=only_update_onload)
        return

    observer = getattr(module, f"{base_name}_observer", None)
    if observer is None:
        return
    if not observer.has_statistics:
        return

    # Dynamic (activation) quantization: only store global_scale, not scale/zp
    args = observer.args
    is_dynamic = getattr(args, "dynamic", False) in (True, DynamicType.LOCAL)

    qparams = observer.get_qparams()
    for param_name, param_val in qparams.items():
        if param_val is None:
            continue
        if is_dynamic and param_name in ("scale", "zero_point"):
            continue
        full_param_name = f"{base_name}_{param_name}"
        if hasattr(module, full_param_name):
            if not only_update_onload:  # update offload + onload
                update_offload_parameter(module, full_param_name, param_val)
            else:  # only update onload
                getattr(module, full_param_name).data = param_val