llmcompressor.modeling.glm4_moe_lite

Classes:

CalibrationGlm4MoeLiteMoE –

Calibration version of Glm4MoeLiteMoE that unfuses 3D expert parameters into
SequentialGlm4MoeLiteExperts –

Unpacks 3D expert parameter tensors into individual Glm4MoeLiteMLP modules so

CalibrationGlm4MoeLiteMoE

CalibrationGlm4MoeLiteMoE(
    original: GlmMoeDsaMoE,
    config: GlmMoeDsaConfig,
    calibrate_all_experts: bool = True,
)

Bases: CalibrationGlmMoeDsaMoE

Calibration version of Glm4MoeLiteMoE that unfuses 3D expert parameters into individual MLP modules (nn.Linear) so they can be quantized.

GLM-4.7-Flash Lite stores routed experts in a Glm4MoeLiteNaiveMoe module using 3D parameters (gate_up_proj, down_proj) instead of nn.Linear submodules. Since llm-compressor targets Linear modules, the original routed experts are invisible to quantization and remain BF16 unless they are unpacked.

Inherits routing logic (:meth:route_tokens_to_experts) and forward pass from :class:CalibrationGlmMoeDsaMoE, overriding only expert creation to use Glm4MoeLiteMLP modules.

Source code in src/llmcompressor/modeling/glm_moe_dsa.py

def __init__(
    self,
    original: GlmMoeDsaMoE,
    config: GlmMoeDsaConfig,
    calibrate_all_experts: bool = True,
):
    super().__init__()
    self.top_k = config.num_experts_per_tok
    self.num_experts = self._get_num_experts(config)
    self.n_routed_experts = config.n_routed_experts
    self.n_group = config.n_group
    self.topk_group = config.topk_group
    self.norm_topk_prob = config.norm_topk_prob
    self.routed_scaling_factor = config.routed_scaling_factor

    self.experts = self._make_experts(config, original.experts)
    self.gate = original.gate
    self.shared_experts = original.shared_experts
    self.calibrate_all_experts = calibrate_all_experts

SequentialGlm4MoeLiteExperts

SequentialGlm4MoeLiteExperts(
    config: Glm4MoeLiteConfig, original: Glm4MoeLiteNaiveMoe
)

Bases: ModuleList

Unpacks 3D expert parameter tensors into individual Glm4MoeLiteMLP modules so each routed expert has standard nn.Linear projections visible to targets="Linear".

Source code in src/llmcompressor/modeling/glm4_moe_lite.py

def __init__(self, config: Glm4MoeLiteConfig, original: Glm4MoeLiteNaiveMoe):
    from transformers.models.glm4_moe_lite.modeling_glm4_moe_lite import (
        Glm4MoeLiteMLP,
    )

    self.num_experts = config.n_routed_experts
    intermediate_size = config.moe_intermediate_size

    with skip_weights_initialize():
        super().__init__(
            [
                Glm4MoeLiteMLP(config, intermediate_size=intermediate_size)
                for _ in range(self.num_experts)
            ]
        )

    for i in range(self.num_experts):
        gate_up = original.gate_up_proj[i]
        down = original.down_proj[i]
        gate_proj, up_proj = gate_up.chunk(2, dim=0)

        self[i].gate_proj.weight.data = gate_proj.clone().contiguous()
        self[i].up_proj.weight.data = up_proj.clone().contiguous()
        self[i].down_proj.weight.data = down.clone().contiguous()