llmcompressor.observers
Framework for monitoring and analyzing model behavior during compression.
Provides observers for tracking tensor statistics, activation ranges, and model behavior during compression workflows. Includes min-max observers, MSE observers, and helper utilities for quantization and other compression techniques.
Modules:
-
base– -
helpers–Helper functions for observer token counting and analysis.
-
imatrix– -
min_max– -
mse–
Classes:
-
IMatrixMSEObserver–MSE observer weighted by per-input-channel importance (E[x²]).
-
MemorylessMinMaxObserver–Compute quantization parameters by taking the min/max of the observed value.
-
MinMaxObserver–Compute quantization parameters by taking the moving average of min/max values.
-
MovingAverageMSEObserver–Compute quantization parameters by finding the optimal min/max values which minimize
-
Observer–Base class for observers which compute quantization parameters given
-
QParamsDict–Dictionary containing quantization parameters.
-
StaticMinMaxObserver–Compute quantization parameters by taking the min/max of all observed values.
Functions:
-
flatten_for_calibration–Reshapes the value according to the quantization strategy for the purposes of
-
fuse_weight_observers–Link weight observers across fused layer groups for shared global_scale.
IMatrixMSEObserver
Bases: Observer
MSE observer weighted by per-input-channel importance (E[x²]).
Supports CHANNEL, GROUP, and TENSOR_GROUP for weight-only Linear modules. Falls back to uniform MSE when importance data is unavailable.
Importance is accumulated as raw _imatrix_sum / _imatrix_count
and synced across DDP ranks via _act_sync_dict before observation.
Methods:
-
attach–Attach a forward-pre hook to accumulate E[x²] per input channel.
-
detach–Remove hooks and leave raw sum/count on module for second-pass pickup.
Source code in src/llmcompressor/observers/imatrix.py
attach
Attach a forward-pre hook to accumulate E[x²] per input channel.
If raw accumulators (_imatrix_sum / _imatrix_count) already
exist on the module (second pass after IMatrixGatherer), copy them
to the observer and skip hook registration.
Source code in src/llmcompressor/observers/imatrix.py
detach
Remove hooks and leave raw sum/count on module for second-pass pickup.
Case 1 – accumulators present on module: leave them for next
observer's attach() to pick up.
Case 2 – no accumulators (second-pass cleanup): nothing to do.
Source code in src/llmcompressor/observers/imatrix.py
MemorylessMinMaxObserver
Bases: Observer
Compute quantization parameters by taking the min/max of the observed value.
Source code in src/llmcompressor/observers/base.py
MinMaxObserver
Bases: Observer
Compute quantization parameters by taking the moving average of min/max values.
Source code in src/llmcompressor/observers/min_max.py
MovingAverageMSEObserver
Bases: Observer
Compute quantization parameters by finding the optimal min/max values which minimize the mean of quantization error squared, with moving average smoothing.
Source code in src/llmcompressor/observers/mse.py
Observer
Bases: InternalModule, RegistryMixin
Base class for observers which compute quantization parameters given observations of weights, activations, or attention states.
Parameters:
-
base_name(str) –str used to name the observer attribute
-
args(QuantizationArgs) –quantization args used to calibrate and quantize the observed value
-
**observer_kwargs–keyword arguments for observer initialization
Methods:
-
attach–Called when the observer is attached to a module.
-
detach–Called before the observer is deleted from a module.
-
forward–Update observer statistics from observed value.
-
fuse–Link all observers in the list with each other for shared global_scale.
-
get_qparams–Compute quantization parameters from accumulated statistics.
-
sync_activation_stats–All-reduce accumulated activation statistics across DDP ranks.
-
update_statistics_from_observed–Update internal observer statistics (min_vals, max_vals) from observed tensor.
Source code in src/llmcompressor/observers/base.py
attach
Called when the observer is attached to a module. Subclasses can override to register hooks or initialize state.
Parameters:
-
module(Module) –the module this observer is being attached to
Source code in src/llmcompressor/observers/base.py
detach
Called before the observer is deleted from a module. Subclasses can override to remove hooks and clean up module attributes.
Parameters:
-
module(Module) –the module this observer is being removed from
Source code in src/llmcompressor/observers/base.py
forward
Update observer statistics from observed value.
Parameters:
-
observed(Tensor) –value being observed
Returns:
-
Observer–self for method chaining
Source code in src/llmcompressor/observers/base.py
fuse
staticmethod
Link all observers in the list with each other for shared global_scale.
Parameters:
-
observers(Iterable[Observer]) –list of observers to fuse together
Source code in src/llmcompressor/observers/base.py
get_qparams
Compute quantization parameters from accumulated statistics.
For TENSOR_GROUP, global_scale is computed from the absmax of this observer and all fused observers. Fused observers must already have statistics — call observe_weight on all modules before calling get_qparams on any of them.
Returns:
-
QParamsDict–dict with keys "scale", "zero_point", and "global_scale"
Source code in src/llmcompressor/observers/base.py
sync_activation_stats
All-reduce accumulated activation statistics across DDP ranks.
note: weight statistics don't need to be synced since weights
are synced across ranks, only data (activations) differs by rank.
Returns:
-
List[Work]–list of async communication handles
Source code in src/llmcompressor/observers/base.py
update_statistics_from_observed
abstractmethod
Update internal observer statistics (min_vals, max_vals) from observed tensor.
Parameters:
-
observed(Tensor) –flattened observed value of shape (num_observations, *qparam_shape, group_size)
Source code in src/llmcompressor/observers/base.py
QParamsDict
Bases: TypedDict
Dictionary containing quantization parameters.
StaticMinMaxObserver
Bases: MemorylessMinMaxObserver
Compute quantization parameters by taking the min/max of all observed values.
Source code in src/llmcompressor/observers/base.py
flatten_for_calibration
Reshapes the value according to the quantization strategy for the purposes of scale/zp calibration. The value after flattening has the following shape:
(num_observations, *qparam_shape, group_size)
For block quantization, value will be zero-padded if it is not evenly divisible by block_size, so as not to distort the calculated qparams and to be compatible with vllm block-wise kernels that do not require even divisibility.
The first dim is the number of observations (usually the batch size times number of tokens), the middle dims are the dimension of the scales, and the last dim is the number of elements being quantized per group.
Parameters:
-
value(Tensor) –value being flattened
-
base_name(str) –weight, input, output, q/k/v. Used to characterize the value as being a weight, activation, or attention state
-
args(QuantizationArgs) –quantization args for determining how the value is flattened
Returns:
-
Tensor–value which has been reshaped for calibration
Source code in src/llmcompressor/observers/helpers.py
fuse_weight_observers
Link weight observers across fused layer groups for shared global_scale.
For TENSOR_GROUP quantization (e.g. NVFP4), vLLM requires that fused layers (Q/K/V attention, gate/up MLP) share the same global_scale. This function links their observers so that get_qparams() computes global_scale from the combined statistics of all observers in the group.
Parameters:
-
model(Module) –model whose weight observers should be linked