llmcompressor.modifiers.autoround.base

Classes:

AutoRoundModifier –

Implements the AutoRound algorithm from https://aclanthology.org/2024.findings-emnlp.662.pdf.

AutoRoundModifier

Bases: Modifier, QuantizationMixin

Implements the AutoRound algorithm from https://aclanthology.org/2024.findings-emnlp.662.pdf. This modifier leverages signed gradient descent (SignSGD) optimizer and block-wise loss to optimize rounding values and weight clipping in a few steps.

Sample yaml:

test_stage:
  modifiers:
    AutoRoundModifier:
      iters: 200
      config_groups:
        group_0:
          targets:
            - "Linear"
          input_activations: null
          output_activations: null
          weights:
            num_bits: 4
            type: "int"
            symmetric: true
            strategy: group
            group_size: 128

Lifecycle:

on_initialize
- apply config to model
on_start
- add input capture hooks to decoding layers
on_sequential_epoch_end
- apply_autoround
- post_autoround_cleanup
on_finalize
- remove_hooks()
- model.apply(freeze_module_quantization)

Parameters:

config_groups
–

dictionary specifying quantization schemes to apply to target modules. Modules not matching a scheme target will NOT be quantized.
targets
–

list of layer names to quantize if a scheme is provided. Defaults to Linear layers
ignore
–

optional list of module class names or submodule names to not quantize even if they match a target in config_groups. Defaults to empty list.
scheme
–

a single quantization scheme to apply to the model. This is a dictionary that supports all keys from QuantizationScheme except targets, which will be set to the targets parameter set at the modifier level.

Methods:

apply_autoround –

Applies AutoRound quantization tuning on the current decoding layer.
on_end –

Finish calibrating by removing observers and calibration hooks
on_finalize –

disable the quantization observers used by the AutoRound algorithm
on_initialize –

Initialize the model state for quantization and calibration.
start_calibration –

Register activation calibration hooks and enable quantization as we calibrate

apply_autoround

apply_autoround(state, subgraph)

Applies AutoRound quantization tuning on the current decoding layer.

The tuning logic is as follows: for iter in range(iters): quant_output = forward(layer, cached_inputs) loss = mse_loss(quant_output, original_output) loss.backward() optimizer.step() if loss < best_loss: best_params = update_params(layer)

For more details, please refer to the AutoRound repository: https://github.com/intel/auto-round/

Source code in llmcompressor/modifiers/autoround/base.py

def apply_autoround(self, state, subgraph):
    """
    Applies AutoRound quantization tuning on the current decoding layer.

    The tuning logic is as follows:
    for iter in range(iters):
        quant_output = forward(layer, cached_inputs)
        loss = mse_loss(quant_output, original_output)
        loss.backward()
        optimizer.step()
        if loss < best_loss:
            best_params = update_params(layer)

    For more details, please refer to the AutoRound repository:
    https://github.com/intel/auto-round/
    """
    modules = list(subgraph.submodules(model=state.model))

    decoding_layers = [m for m in modules if self._is_decoding_layer(m)]
    if len(decoding_layers) == 0:
        return
    assert len(decoding_layers) == 1, (
        "Only one decoding layer is expected in the subgraph, "
        f"found {len(decoding_layers)}."
    )
    decoding_layer = decoding_layers[0]

    logger.info("Applying AutoRound on layer {}", decoding_layer._tmp_name)

    wrapped_model = _wrap_decoding_layer(decoding_layer)
    wrapped_model.name_or_path = state.model.name_or_path

    with torch.enable_grad(), align_module_device(decoding_layer):
        ar_quant_scheme = self._mapping_config_to_autoround()
        ar = AutoRound(
            model=wrapped_model,
            tokenizer="",
            scheme=ar_quant_scheme,
            iters=self.iters,
            enable_torch_compile=self.enable_torch_compile,
            batch_size=self.batch_size,
        )
        # TODO: configure layer-wise config based on self.resolved_config
        ar.configure_layer_config(enable_gguf_official_mixed=False)
        ar.batch_dim = 0
        first_param = next(decoding_layer.parameters())
        device = first_param.device
        cur_inputs = self._all_module_input[decoding_layer._tmp_name]
        decoding_layer.tuning_device = device

        q_input, _ = ar.quantize_block(
            block=decoding_layer,
            inputs=cur_inputs,
            q_input=self._q_input,
            device=str(device),
            # Leave offload for LLMC
            auto_offload=False,
        )
        self._q_input = q_input
        # Update offload parameters and remove temporary attributes
        for _, module in decoding_layer.named_modules():
            if hasattr(module, "weight_scale") and hasattr(
                module, "weight_zero_point"
            ):
                # Note: The model's weight is already q-dq in-place by auto-round.
                weight_scale = module.scale
                del module.scale
                # TODO: update zero_point after supporting asymmetric quantization
                update_offload_parameter(module, "weight_scale", weight_scale)
    decoding_layer.eval()

on_end

on_end(state: State, event: Event, **kwargs)

Finish calibrating by removing observers and calibration hooks

Source code in llmcompressor/modifiers/autoround/base.py

def on_end(self, state: State, event: Event, **kwargs):
    """
    Finish calibrating by removing observers and calibration hooks
    """
    self.ended_ = True
    QuantizationMixin.end_calibration(self, state.model)
    self._remove_temporary_names(state.model)
    self.remove_hooks()
    self._q_input = None

on_finalize

on_finalize(state: State, **kwargs) -> bool

disable the quantization observers used by the AutoRound algorithm

Parameters:

state
(State) –

session state storing input model and calibration data

Source code in llmcompressor/modifiers/autoround/base.py

def on_finalize(self, state: State, **kwargs) -> bool:
    """
    disable the quantization observers used by the AutoRound algorithm

    :param state: session state storing input model and calibration data
    """
    if not self.ended_:
        self.on_end(state, None)

    return True

on_initialize

on_initialize(state: State, **kwargs) -> bool

Initialize the model state for quantization and calibration.

Parameters:

state
(State) –

session state storing input model and calibration data

Source code in llmcompressor/modifiers/autoround/base.py

def on_initialize(self, state: State, **kwargs) -> bool:
    """
    Initialize the model state for quantization and calibration.

    :param state: session state storing input model and calibration data
    """
    # apply config to model and prepare calibration hooks
    if QuantizationMixin.has_config(self):
        QuantizationMixin.initialize_quantization(self, state.model)

    # prepare module names
    self._add_temporary_names(state.model)
    # freeze all model parameters
    for _, param in state.model.named_parameters():
        param.requires_grad_(False)

    self.sequential_targets = self._infer_sequential_targets(state.model)
    return True

start_calibration

start_calibration(model: Module)

Register activation calibration hooks and enable quantization as we calibrate

Parameters:

model
(Module) –

model to prepare for calibration

Source code in llmcompressor/modifiers/autoround/base.py

def start_calibration(self, model: torch.nn.Module):
    """
    Register activation calibration hooks and enable quantization as we calibrate

    :param model: model to prepare for calibration
    """
    targets = match_named_modules(model, self.targets, self.ignore)
    if targets_embeddings(model, targets):
        untie_word_embeddings(model)

    for _, module in match_named_modules(model, self.targets, self.ignore):
        # Note: No need to register observers for auto-round
        self._calibration_hooks |= self._initialize_hooks(module)
        apply_calibration_status(module)

    model.apply(enable_quantization)  # quantize at the same time as calibrate

llmcompressor.modifiers.autoround.base

AutoRoundModifier

`config_groups`

`targets`

`ignore`

`scheme`

apply_autoround

on_end

on_finalize

`state`

on_initialize

`state`

start_calibration

`model`