llmcompressor.observers.helpers
Helper functions for observer token counting and analysis.
Provides utility functions for analyzing observer statistics and token counts across model modules. Used for monitoring compression effects and understanding model behavior during quantization and pruning operations.
Functions:
-
flatten_for_calibration–Reshapes the value according to the quantization strategy for the purposes of
-
fuse_weight_observers–Link weight observers across fused layer groups for shared global_scale.
flatten_for_calibration
Reshapes the value according to the quantization strategy for the purposes of scale/zp calibration. The value after flattening has the following shape:
(num_observations, *qparam_shape, group_size)
For block quantization, value will be zero-padded if it is not evenly divisible by block_size, so as not to distort the calculated qparams and to be compatible with vllm block-wise kernels that do not require even divisibility.
The first dim is the number of observations (usually the batch size times number of tokens), the middle dims are the dimension of the scales, and the last dim is the number of elements being quantized per group.
Parameters:
-
value(Tensor) –value being flattened
-
base_name(str) –weight, input, output, q/k/v. Used to characterize the value as being a weight, activation, or attention state
-
args(QuantizationArgs) –quantization args for determining how the value is flattened
Returns:
-
Tensor–value which has been reshaped for calibration
Source code in src/llmcompressor/observers/helpers.py
fuse_weight_observers
Link weight observers across fused layer groups for shared global_scale.
For TENSOR_GROUP quantization (e.g. NVFP4), vLLM requires that fused layers (Q/K/V attention, gate/up MLP) share the same global_scale. This function links their observers so that get_qparams() computes global_scale from the combined statistics of all observers in the group.
Parameters:
-
model(Module) –model whose weight observers should be linked
Source code in src/llmcompressor/observers/helpers.py
lerp
Linear interpolation — torch.lerp is not implemented for all dtypes.