llmcompressor.modifiers.pruning.reap.base

REAP (Router-weighted Expert Activation Pruning) modifier for MoE models.

Classes:

REAPPruningModifier –

Prunes experts from MoE layers using the REAP saliency metric. For each

REAPPruningModifier

Bases: Modifier

Prunes experts from MoE layers using the REAP saliency metric. For each expert j the saliency is

``S_j = mean(g_j * ||f_j||_2)``

averaged over the tokens routed to expert j, where:

g_j is the router gate weight assigned to expert j (the coefficient that multiplies the expert's output when combining experts), and
f_j is expert j's output activation for that token, so ||f_j||_2 is its L2 norm.

The lowest-saliency experts are removed per layer. REAP runs during the sequential calibration pipeline: saliency is accumulated via hooks on the MoE experts, the structural pruning for a layer is executed when it completes (SEQUENTIAL_EPOCH_END). The config is updated to reflect the new number of experts in on_finalize.

Parameters:

sparsity –

fraction of experts to remove per layer (0, 1).
ignore –
module name patterns to skip during MoE layer detection.

Example recipe::
```
REAPPruningModifier:
  sparsity: 0.25
```

Methods:

on_finalize –

Finalize the model config to reflect the new number of experts.
on_sequential_epoch_end –

Prune any tracked layer whose saliency is

on_finalize

on_finalize(state: State, **kwargs) -> bool

Finalize the model config to reflect the new number of experts.

Source code in src/llmcompressor/modifiers/pruning/reap/base.py

def on_finalize(self, state: State, **kwargs) -> bool:
    """Finalize the model config to reflect the new number of experts."""

    model = state.model

    new_num_experts = self._moe_attrs.num_experts - self._n_experts_to_drop
    update_model_config(model, self._moe_attrs, new_num_experts)

    self._saliency_trackers.clear()
    self._norm_buffers.clear()

    return True

on_sequential_epoch_end

on_sequential_epoch_end(
    state: State, event: Event, **kwargs
)

Prune any tracked layer whose saliency is complete, then release its activation norm buffers.

Source code in src/llmcompressor/modifiers/pruning/reap/base.py

def on_sequential_epoch_end(self, state: State, event: Event, **kwargs):
    """Prune any tracked layer whose saliency is
    complete, then release its activation norm buffers."""

    model = state.model

    for layer_name, tracker in list(self._saliency_trackers.items()):
        if tracker.total_count <= 0:
            continue

        retained = tracker.compute_retained_experts(
            self._n_experts_to_drop,
            self._n_experts_to_drop_per_group,
            self._moe_attrs,
        )
        expected = self._moe_attrs.num_experts - self._n_experts_to_drop
        assert (
            len(retained) == expected
        ), f"Expected {expected} retained experts, got {len(retained)}"

        prune_moe_layer(model, layer_name, retained, self._moe_attrs)

        # free this layer's accumulators / buffers now
        del self._saliency_trackers[layer_name]
        self._norm_buffers.pop(layer_name, None)