llmcompressor.modeling.moe_context
Simplified interface for MoE model calibration.
MoE (Mixture of Experts) models route tokens to different expert networks. During calibration for quantization/compression, we need to ensure ALL experts see data, not just the ones selected by the router. This module provides the infrastructure to temporarily modify MoE modules for proper calibration.
Key components: - MoECalibrationModule: Abstract base class for calibration modules - moe_calibration_context: Context manager that applies calibration to a model
Classes:
-
MoECalibrationModule–Abstract base class for MoE calibration modules.
Functions:
-
moe_calibration_context–Context manager that applies MoE calibration to a model.
MoECalibrationModule
Bases: ABC, Module, RegistryMixin
Abstract base class for MoE calibration modules.
Calibration modules replace original MoE modules during the calibration phase to ensure all experts receive data for proper quantization statistics.
Subclasses must:
1. Implement __init__() with signature:
(self, original, config, calibrate_all_experts=True)
2. Set is_permanent to indicate if module should stay in calibration form
3. Optionally implement restore() if is_permanent=False
Methods:
-
restore–Restore the original module structure.
restore
Restore the original module structure.
Only needed if is_permanent=False. For permanent modules, this is a no-op.
Returns: The original module (or self if permanent)
Source code in src/llmcompressor/modeling/moe_context.py
moe_calibration_context
Context manager that applies MoE calibration to a model.
This scans all modules in the model and replaces any MoE modules with their calibration equivalents. After the context exits, non-permanent modules are restored to their original form.
The model is modified in-place, so the same model object should be used within the context.
Args: model: The model to apply MoE calibration to (modified in-place) calibrate_all_experts: If True, all experts see all tokens during calibration. If False, use normal routing (useful for some techniques)
Example: with moe_calibration_context(model): # Run calibration - all experts will see data for batch in dataloader: model(**batch) # Model is now restored (unless permanent)
Source code in src/llmcompressor/modeling/moe_context.py
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | |