Skip to content

vllm.models.deepseek_v32

DeepSeek V3.2 (deepseek_v32) model — hardware-isolated entry point.

DeepSeek V3.2 introduced the DeepSeek Sparse Attention (DSA) architecture: MLA + a "lightning indexer" that selects the top-k tokens for a sparse MLA attend. The same model code serves any DSA checkpoint, including GLM-5.2 (glm_moe_dsa), which reuses this architecture.

Modules:

Classes:

DeepseekV32ForCausalLM

Bases: DeepseekV2ForCausalLM

DSA causal LM — DeepSeek V2/V3 orchestration with the DSA backbone.

Serves DeepSeek V3.2 and any architecture reusing DSA (e.g. GLM-5.2).

Source code in vllm/models/deepseek_v32/nvidia/model.py
class DeepseekV32ForCausalLM(DeepseekV2ForCausalLM):
    """DSA causal LM — DeepSeek V2/V3 orchestration with the DSA backbone.

    Serves DeepSeek V3.2 and any architecture reusing DSA (e.g. GLM-5.2).
    """

    model_cls = DeepseekV32Model

    def set_moe_parameters(self):
        # Same as the base, but keyed on the MoE block type rather than the
        # decoder-layer type (DeepseekV32DecoderLayer is a plain nn.Module).
        self.expert_weights = []
        self.num_expert_groups = getattr(self.config, "n_group", 1)
        self.moe_layers = []
        self.moe_mlp_layers = []
        example_moe = None
        for layer in self.model.layers:
            if isinstance(layer, PPMissingLayer):
                continue
            if isinstance(layer.mlp, DeepseekV2MoE):
                example_moe = layer.mlp
                self.moe_mlp_layers.append(layer.mlp)
                self.moe_layers.append(layer.mlp.experts)
        self.extract_moe_parameters(example_moe)