llmcompressor.modifiers.pruning.reap.base
REAP (Router-weighted Expert Activation Pruning) modifier for MoE models.
See: https://arxiv.org/abs/2510.13999
Classes:
-
REAPPruningModifier–Prunes experts from MoE layers using the REAP saliency metric. For each
REAPPruningModifier
Bases: Modifier
Prunes experts from MoE layers using the REAP saliency metric. For each
expert j the saliency is
``S_j = mean(g_j * ||f_j||_2)``
averaged over the tokens routed to expert j, where:
g_jis the router gate weight assigned to expertj(the coefficient that multiplies the expert's output when combining experts), andf_jis expertj's output activation for that token, so||f_j||_2is its L2 norm.
The lowest-saliency experts are removed per layer. REAP runs during the
sequential calibration pipeline: saliency is accumulated via hooks on the MoE
experts, the structural pruning for a layer is executed when it
completes (SEQUENTIAL_EPOCH_END). The config is updated to reflect the new
number of experts in on_finalize.
Parameters:
-
sparsity–fraction of experts to remove per layer (0, 1).
-
ignore–module name patterns to skip during MoE layer detection.
Example recipe::
REAPPruningModifier: sparsity: 0.25
Methods:
-
on_finalize–Finalize the model config to reflect the new number of experts.
-
on_sequential_epoch_end–Prune any tracked layer whose saliency is
on_finalize
Finalize the model config to reflect the new number of experts.
Source code in src/llmcompressor/modifiers/pruning/reap/base.py
on_sequential_epoch_end
Prune any tracked layer whose saliency is complete, then release its activation norm buffers.