Adding a New Observer
Observers analyze weight and activation tensors during calibration to compute the statistics needed for quantization. This guide explains how observers fit into the quantization pipeline and how to implement a custom one.
What is an Observer?
When a quantized layer runs a calibration forward pass, it passes the weight or activation tensor to an observer. The observer accumulates statistics that characterize the tensor's distribution. These statistics are later used to compute quantization parameters (scale, zero_point, and optionally global_scale) via compressed-tensors.
Observers work in two phases:
- Observe:
forward()reshapes the tensor and callsupdate_statistics_from_observed()to accumulate statistics - Compute:
get_qparams()converts accumulated statistics into aQParamsDict
In situations that require a global scale (e.g., NVFP4) and for weights that require fusion of this global_scale (QKV, MoE), the observers are fused together so that they can jointly calculate a fused global_scale. This requires all fused observers have accumulated statistics.
The base Observer class handles all slicing and reshaping for group-wise, channel-wise, and token-wise strategies before calling your subclass. Your subclass needs to define how statistics are accumulated. The base class get_qparams() handles converting min_vals/max_vals into quantization parameters.
The Observer Contract
All observers subclass llmcompressor.observers.Observer. At minimum, you must implement update_statistics_from_observed. If your observer uses min_vals/max_vals as its statistics, the base class get_qparams handles the rest. If your observer uses different statistics, you must also override get_qparams.
Simple case: min/max statistics
Most observers accumulate min_vals and max_vals. In this case, you only need update_statistics_from_observed — the base class get_qparams will pass your min_vals/max_vals to calculate_qparams and handle global_scale for TENSOR_GROUP automatically:
import torch
from llmcompressor.observers import Observer
@Observer.register("my_observer")
class MyObserver(Observer):
_act_sync_dict = {}
def update_statistics_from_observed(self, observed: torch.Tensor) -> None:
"""
Update internal statistics from the observed tensor.
The base class has already reshaped the tensor into
shape (num_observations, *qparam_shape, group_size).
:param observed: pre-processed tensor ready for statistics computation
"""
self.min_vals = torch.amin(observed, dim=(0, -1))
self.max_vals = torch.amax(observed, dim=(0, -1))
Custom statistics
If your observer uses statistics other than min_vals/max_vals (e.g., mean/std, histograms), you must also override get_qparams to convert your statistics into a QParamsDict. You may also need to override has_statistics since the base class checks for min_vals by default.
The following example tracks mean and standard deviation, then derives min/max as mean ± k·std to clip outliers before computing quantization parameters:
import torch
from compressed_tensors.quantization import QuantizationStrategy
from compressed_tensors.quantization.utils import calculate_qparams, generate_gparam
from llmcompressor.observers.base import Observer, QParamsDict
@Observer.register("normal_observer")
class NormalObserver(Observer):
"""
Derives quantization range from mean ± k·std of the observed tensor.
Configure k via observer_kwargs (default: 2.0).
"""
_act_sync_dict = {}
@property
def has_statistics(self) -> bool:
return hasattr(self, "_mean") and hasattr(self, "_std")
def update_statistics_from_observed(self, observed: torch.Tensor) -> None:
self._mean = observed.mean(dim=(0, -1))
self._std = observed.std(dim=(0, -1))
def get_qparams(self) -> QParamsDict:
k = self.args.observer_kwargs.get("k", 2.0)
min_vals = self._mean - k * self._std
max_vals = self._mean + k * self._std
global_scale = None
if self.args.strategy == QuantizationStrategy.TENSOR_GROUP:
global_absmax = torch.max(-min_vals.min(), max_vals.max())
for obs in self._fused_observers:
assert obs.has_statistics
global_absmax = torch.max(global_absmax, obs._mean.abs().max())
global_scale = generate_gparam(
-global_absmax.reshape(1), global_absmax.reshape(1)
)
scale, zero_point = calculate_qparams(
min_vals=min_vals,
max_vals=max_vals,
quantization_args=self.args,
global_scale=global_scale,
)
return {"scale": scale, "zero_point": zero_point, "global_scale": global_scale}
DDP synchronization
Each observer must also declare _act_sync_dict — a class-level dict mapping statistic attribute names to DDP reduce operations. The base class sync_activation_stats() uses this to all-reduce statistics across ranks between calibration batches.
- If your observer accumulates state across batches, declare which attributes need syncing and with what operation
- Memoryless observers that overwrite statistics each batch should set
_act_sync_dict = {} - Weight observers typically don't need sync (weights are identical across ranks), but activation observers do
from torch import distributed as dist
@Observer.register("my_observer")
class MyObserver(Observer):
_act_sync_dict = {
"min_vals": dist.ReduceOp.MIN,
"max_vals": dist.ReduceOp.MAX,
}
...
The @Observer.register("my_observer") decorator registers your observer under the given name so it can be referenced in recipes by string.
How the Base Class Uses Your Statistics
The default get_qparams reads min_vals/max_vals and passes them to calculate_qparams from compressed-tensors:
# Inside Observer.get_qparams (simplified):
# For TENSOR_GROUP: compute global_scale from this observer
# and all fused observers' min_vals/max_vals
global_scale = None
if self.args.strategy == QuantizationStrategy.TENSOR_GROUP:
global_absmax = torch.max(-self.min_vals.min(), self.max_vals.max())
for obs in self._fused_observers:
global_absmax = torch.max(global_absmax, -obs.min_vals.min())
global_absmax = torch.max(global_absmax, obs.max_vals.max())
global_scale = generate_gparam(-global_absmax, global_absmax)
scale, zero_point = calculate_qparams(
min_vals=self.min_vals,
max_vals=self.max_vals,
quantization_args=self.args,
global_scale=global_scale,
)
return {"scale": scale, "zero_point": zero_point, "global_scale": global_scale}
calculate_qparams handles the actual scale and zero point computation — symmetric vs asymmetric quantization, dtype clamping, MX scale generation, and so on. Your observer controls what statistics are accumulated and how they map to quantization parameters.
If you override get_qparams and your observer supports TENSOR_GROUP, you are responsible for computing global_scale from the fused observers yourself (see self._fused_observers).
Stateful Observers
Some observers accumulate statistics across multiple calibration batches. To do this, check for existing state in update_statistics_from_observed:
@Observer.register("my_observer")
class MyObserver(Observer):
_act_sync_dict = {
"min_vals": dist.ReduceOp.MIN,
"max_vals": dist.ReduceOp.MAX,
}
def update_statistics_from_observed(self, observed: torch.Tensor) -> None:
min_vals = torch.amin(observed, dim=(0, -1))
max_vals = torch.amax(observed, dim=(0, -1))
if self.has_statistics:
min_vals = torch.min(min_vals, self.min_vals)
max_vals = torch.max(max_vals, self.max_vals)
self.min_vals = min_vals
self.max_vals = max_vals
Example: A Percentile-Clipping Observer
The following observer clips outliers by returning min/max values from a configurable percentile range rather than the absolute extremes. This can improve accuracy when tensors have extreme outlier values that would otherwise inflate the quantization range.
import torch
from llmcompressor.observers import Observer
@Observer.register("percentile")
class PercentileObserver(Observer):
"""
Clips outliers by setting min_vals/max_vals to a configurable percentile range.
Configure via observer_kwargs:
percentile (float): the upper percentile to retain, e.g. 99.9
"""
_act_sync_dict = {}
def update_statistics_from_observed(self, observed: torch.Tensor) -> None:
percentile = self.args.observer_kwargs.get("percentile", 99.9)
lower = 100.0 - percentile
upper = percentile
self.min_vals = torch.quantile(observed, lower / 100.0, dim=(0, -1))
self.max_vals = torch.quantile(observed, upper / 100.0, dim=(0, -1))
Using the Observer in a Recipe
Reference the registered name ("percentile") via the observer field in QuantizationArgs:
from llmcompressor.modifiers.quantization import QuantizationModifier
from compressed_tensors.quantization import QuantizationArgs
recipe = QuantizationModifier(
targets="Linear",
scheme={
"weights": QuantizationArgs(
num_bits=8,
type="int",
symmetric=True,
strategy="channel",
observer="percentile",
observer_kwargs={"percentile": 99.5},
)
},
ignore=["lm_head"],
)
Or from a YAML recipe:
quantization_stage:
quantization_modifiers:
QuantizationModifier:
targets:
- Linear
ignore:
- lm_head
scheme:
weights:
num_bits: 8
type: int
symmetric: true
strategy: channel
observer: percentile
observer_kwargs:
percentile: 99.5
Tips
- Implement
update_statistics_from_observedand optionallyget_qparams. If your statistics aremin_vals/max_vals, the base classget_qparamshandles the conversion toscale,zero_point, andglobal_scale. If you use different statistics, overrideget_qparams(andhas_statistics) as well. update_statistics_from_observedreceives a pre-shaped tensor. The base class has already sliced the input according toQuantizationArgs.strategy(group, channel, token, etc.). You do not need to handle reshaping yourself.global_scaleis handled automatically for min/max observers. For TENSOR_GROUP strategies, the defaultget_qparams()computesglobal_scalefrom the combined statistics of all fused observers. If you overrideget_qparams, you must handle fusedglobal_scaleyourself.- Set
_act_sync_dictcorrectly. Every observer must declare this. If your observer accumulates state across batches, map each statistic attribute to its reduce operation. Memoryless observers should set_act_sync_dict = {}. observer_kwargsis the right place for hyperparameters. Access them viaself.args.observer_kwargs.get(...).- Match the shape contract for min/max observers. If using the default
get_qparams, setself.min_valsandself.max_valsto tensors of shape(*qparam_shape,)— one scalar per quantization group/channel/token. - Existing observers are good references. See
min_max.pyfor a simple stateless min/max example,mse.pyfor a stateful one, andimatrix.pyfor an observer that uses custom statistics beyond min/max.