llmcompressor.modeling.glm4_moe_lite
Classes:
-
CalibrationGlm4MoeLiteMoE–Calibration version of Glm4MoeLiteMoE that unfuses 3D expert parameters into
-
SequentialGlm4MoeLiteExperts–Unpacks 3D expert parameter tensors into individual Glm4MoeLiteMLP modules so
CalibrationGlm4MoeLiteMoE
CalibrationGlm4MoeLiteMoE(
original: GlmMoeDsaMoE,
config: GlmMoeDsaConfig,
calibrate_all_experts: bool = True,
)
Bases: CalibrationGlmMoeDsaMoE
Calibration version of Glm4MoeLiteMoE that unfuses 3D expert parameters into individual MLP modules (nn.Linear) so they can be quantized.
GLM-4.7-Flash Lite stores routed experts in a Glm4MoeLiteNaiveMoe module
using 3D parameters (gate_up_proj, down_proj) instead of nn.Linear
submodules. Since llm-compressor targets Linear modules, the original
routed experts are invisible to quantization and remain BF16 unless they are
unpacked.
Inherits routing logic (:meth:route_tokens_to_experts) and forward pass
from :class:CalibrationGlmMoeDsaMoE, overriding only expert creation to
use Glm4MoeLiteMLP modules.
Source code in src/llmcompressor/modeling/glm_moe_dsa.py
SequentialGlm4MoeLiteExperts
Bases: ModuleList
Unpacks 3D expert parameter tensors into individual Glm4MoeLiteMLP modules so
each routed expert has standard nn.Linear projections visible to
targets="Linear".