llmcompressor.modifiers.autoround.base
Classes:
-
AutoRoundModifier–Implements the AutoRound algorithm from https://aclanthology.org/2024.findings-emnlp.662.pdf.
AutoRoundModifier
Bases: Modifier, QuantizationMixin
Implements the AutoRound algorithm from https://aclanthology.org/2024.findings-emnlp.662.pdf. This modifier leverages signed gradient descent (SignSGD) optimizer and block-wise loss to optimize rounding values and weight clipping in a few steps.
Sample yaml:
test_stage:
modifiers:
AutoRoundModifier:
iters: 200
config_groups:
group_0:
targets:
- "Linear"
input_activations: null
output_activations: null
weights:
num_bits: 4
type: "int"
symmetric: true
strategy: group
group_size: 128
Lifecycle:
- on_initialize
- apply config to model
- on_start
- add input capture hooks to decoding layers
- on_sequential_epoch_end
- apply_autoround
- post_autoround_cleanup
- on_finalize
- remove_hooks()
- model.apply(freeze_module_quantization)
Parameters:
-
–config_groupsdictionary specifying quantization schemes to apply to target modules. Modules not matching a scheme target will NOT be quantized.
-
–targetslist of layer names to quantize if a scheme is provided. Defaults to Linear layers
-
–ignoreoptional list of module class names or submodule names to not quantize even if they match a target in config_groups. Defaults to empty list.
-
–schemea single quantization scheme to apply to the model. This is a dictionary that supports all keys from QuantizationScheme except targets, which will be set to the targets parameter set at the modifier level.
Methods:
-
apply_autoround–Applies AutoRound quantization tuning on the current decoding layer.
-
on_end–Finish calibrating by removing observers and calibration hooks
-
on_finalize–disable the quantization observers used by the AutoRound algorithm
-
on_initialize–Initialize the model state for quantization and calibration.
-
start_calibration–Register activation calibration hooks and enable quantization as we calibrate
apply_autoround
Applies AutoRound quantization tuning on the current decoding layer.
The tuning logic is as follows: for iter in range(iters): quant_output = forward(layer, cached_inputs) loss = mse_loss(quant_output, original_output) loss.backward() optimizer.step() if loss < best_loss: best_params = update_params(layer)
For more details, please refer to the AutoRound repository: https://github.com/intel/auto-round/
Source code in llmcompressor/modifiers/autoround/base.py
on_end
Finish calibrating by removing observers and calibration hooks
Source code in llmcompressor/modifiers/autoround/base.py
on_finalize
disable the quantization observers used by the AutoRound algorithm
Parameters:
-
(stateState) –session state storing input model and calibration data
Source code in llmcompressor/modifiers/autoround/base.py
on_initialize
Initialize the model state for quantization and calibration.
Parameters:
-
(stateState) –session state storing input model and calibration data
Source code in llmcompressor/modifiers/autoround/base.py
start_calibration
Register activation calibration hooks and enable quantization as we calibrate
Parameters:
-
(modelModule) –model to prepare for calibration