Skip to content

llmcompressor.observers.helpers

Helper functions for observer token counting and analysis.

Provides utility functions for analyzing observer statistics and token counts across model modules. Used for monitoring compression effects and understanding model behavior during quantization and pruning operations.

Functions:

flatten_for_calibration

flatten_for_calibration(
    value: Tensor, base_name: str, args: QuantizationArgs
) -> torch.Tensor

Reshapes the value according to the quantization strategy for the purposes of scale/zp calibration. The value after flattening has the following shape:

(num_observations, *qparam_shape, group_size)

For block quantization, value will be zero-padded if it is not evenly divisible by block_size, so as not to distort the calculated qparams and to be compatible with vllm block-wise kernels that do not require even divisibility.

The first dim is the number of observations (usually the batch size times number of tokens), the middle dims are the dimension of the scales, and the last dim is the number of elements being quantized per group.

Parameters:

  • value (Tensor) –

    value being flattened

  • base_name (str) –

    weight, input, output, q/k/v. Used to characterize the value as being a weight, activation, or attention state

  • args (QuantizationArgs) –

    quantization args for determining how the value is flattened

Returns:

  • Tensor

    value which has been reshaped for calibration

Source code in src/llmcompressor/observers/helpers.py
def flatten_for_calibration(
    value: torch.Tensor,
    base_name: str,
    args: QuantizationArgs,
) -> torch.Tensor:
    """
    Reshapes the value according to the quantization strategy for the purposes of
    scale/zp calibration. The value after flattening has the following shape:

    `(num_observations, *qparam_shape, group_size)`

    For block quantization, value will be zero-padded if it is not evenly
    divisible by block_size, so as not to distort the calculated qparams and to be
    compatible with vllm block-wise kernels that do not require even divisibility.

    The first dim is the number of observations (usually the batch size times number of
    tokens), the middle dims are the dimension of the scales, and the last dim is the
    number of elements being quantized per group.

    :param value: value being flattened
    :param base_name: weight, input, output, q/k/v. Used to characterize the value as
        being a weight, activation, or attention state
    :param args: quantization args for determining how the value is flattened
    :return: value which has been reshaped for calibration
    """
    if base_name == "weight":
        return _flatten_weight(value, args)
    elif base_name in ("input", "output"):
        return _flatten_activation(value, args)
    elif base_name in ("q", "k", "v"):
        return _flatten_attention(value, args)
    else:
        raise ValueError(f"Unknown quantization base name: {base_name}")

fuse_weight_observers

fuse_weight_observers(model: Module)

Link weight observers across fused layer groups for shared global_scale.

For TENSOR_GROUP quantization (e.g. NVFP4), vLLM requires that fused layers (Q/K/V attention, gate/up MLP) share the same global_scale. This function links their observers so that get_qparams() computes global_scale from the combined statistics of all observers in the group.

Parameters:

  • model (Module) –

    model whose weight observers should be linked

Source code in src/llmcompressor/observers/helpers.py
def fuse_weight_observers(model: Module):
    """
    Link weight observers across fused layer groups for shared global_scale.

    For TENSOR_GROUP quantization (e.g. NVFP4), vLLM requires that fused
    layers (Q/K/V attention, gate/up MLP) share the same global_scale.
    This function links their observers so that get_qparams() computes
    global_scale from the combined statistics of all observers in the group.

    :param model: model whose weight observers should be linked
    """
    from llmcompressor.observers import Observer

    for submodule in model.modules():
        for layers_to_fuse in FUSED_LAYER_NAMES:
            if not all(hasattr(submodule, name) for name in layers_to_fuse):
                continue

            layers = [getattr(submodule, name) for name in layers_to_fuse]
            observers = []
            for layer in layers:
                obs = getattr(layer, "weight_observer", None)
                if obs is None:
                    break
                if obs.args.strategy != QuantizationStrategy.TENSOR_GROUP:
                    break
                observers.append(obs)
            else:
                Observer.fuse(observers)

lerp

lerp(
    start: Tensor, end: Tensor, weight: float
) -> torch.Tensor

Linear interpolation — torch.lerp is not implemented for all dtypes.

Source code in src/llmcompressor/observers/helpers.py
def lerp(start: torch.Tensor, end: torch.Tensor, weight: float) -> torch.Tensor:
    """Linear interpolation — torch.lerp is not implemented for all dtypes."""
    return (start * (1.0 - weight)) + (end * weight)