Skip to content

vllm_omni.model_executor.models.voxcpm2.minicpm4_paged

MiniCPM4 with PagedAttention + fp32 RoPE/RMSNorm for VoxCPM2.

Uses vllm Attention for KV cache, keeps fp32 precision ops from minicpm4_hf_compat.py to match native VoxCPM2 numerics.

logger module-attribute

logger = init_logger(__name__)

MiniCPM4PagedForVoxCPM2

Bases: Module

PagedAttention base_lm (28 layers) for VoxCPM2 scaffold.

config instance-attribute

config = config

embed_tokens instance-attribute

embed_tokens = Embedding(vocab_size, hidden_size)

layers instance-attribute

layers = ModuleList(
    [
        (
            _PagedMiniCPM4DecoderLayer(
                hidden_size=hidden_size,
                intermediate_size=intermediate_size,
                num_attention_heads=num_attention_heads,
                num_key_value_heads=num_key_value_heads,
                kv_channels=kv_channels,
                rms_norm_eps=rms_norm_eps,
                layer_idx=i,
                num_hidden_layers=num_hidden_layers,
                use_mup=getattr(lm_cfg, "use_mup", False),
                scale_depth=getattr(
                    lm_cfg, "scale_depth", 1.0
                ),
                cache_config=cache_config,
                prefix=f"{prefix}.layers.{i}",
            )
        )
        for i in (range(num_hidden_layers))
    ]
)

make_empty_intermediate_tensors instance-attribute

make_empty_intermediate_tensors = (
    make_empty_intermediate_tensors_factory(
        ["hidden_states", "residual"], hidden_size
    )
)

norm instance-attribute

norm = RMSNorm(hidden_size, eps=rms_norm_eps)

rope_emb instance-attribute

rope_emb = _MiniCPMLongRoPE(
    hidden_size=hidden_size,
    num_attention_heads=num_attention_heads,
    kv_channels=kv_channels,
    rope_theta=getattr(lm_cfg, "rope_theta", 10000.0),
    max_position_embeddings=getattr(
        lm_cfg, "max_position_embeddings", 32768
    ),
    rope_scaling=rope_scaling_dict,
)

vocab_size instance-attribute

vocab_size = vocab_size

compile_selective

compile_selective() -> list[str]

Compile the full model forward as one graph.

Earlier versions compiled layer.mlp + layer.self_attn.o_proj (PR #2690) and then the whole layer (perf/voxcpm2-streaming-vae). Both still paid one Dynamo dispatch per layer per decode step. V3 profiling showed 1,332 per-layer dispatches (~28 layers × ~47 decode steps) costing ~726 ms of CPU self-time for a long prompt.

Compiling forward at the model level lets Dynamo unroll the 28-layer Python loop inside the graph. Graph breaks at PagedAttention produce sub-graphs but Dynamo memoises the whole trace once, so the per-step dispatch drops from 28 to just a few.

embed_input_ids

embed_input_ids(input_ids: Tensor, **_: Any) -> Tensor

forward

forward(
    input_ids: Tensor | None,
    positions: Tensor,
    intermediate_tensors: IntermediateTensors | None = None,
    inputs_embeds: Tensor | None = None,
    **kwargs: Any,
) -> Tensor | IntermediateTensors

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Load weights from native checkpoint (base_lm. prefix pre-stripped).

precompute_fused_qkv

precompute_fused_qkv() -> None

Materialize fused QKV weights before CUDA Graph capture.

MiniCPM4PagedResidualLM

Bases: Module

PagedAttention residual LM (8 layers, no RoPE) for VoxCPM2.

config instance-attribute

config = config

layers instance-attribute

layers = ModuleList(
    [
        (
            _PagedMiniCPM4DecoderLayer(
                hidden_size=hidden_size,
                intermediate_size=intermediate_size,
                num_attention_heads=num_attention_heads,
                num_key_value_heads=num_key_value_heads,
                kv_channels=kv_channels,
                rms_norm_eps=rms_norm_eps,
                layer_idx=i,
                num_hidden_layers=num_hidden_layers,
                use_mup=getattr(lm_cfg, "use_mup", False),
                scale_depth=getattr(
                    lm_cfg, "scale_depth", 1.0
                ),
                cache_config=cache_config,
                prefix=f"{prefix}.layers.{i}",
            )
        )
        for i in (range(num_hidden_layers))
    ]
)

norm instance-attribute

norm = RMSNorm(hidden_size, eps=rms_norm_eps)

rope_emb instance-attribute

rope_emb = None

compile_selective

compile_selective() -> list[str]

Compile the full residual model forward as one graph (same strategy as base_lm).

forward

forward(positions: Tensor, inputs_embeds: Tensor) -> Tensor

load_weights_from_native

load_weights_from_native(native_residual_lm: Module) -> int

Load weights from native residual_lm. Returns param count.

precompute_fused_qkv

precompute_fused_qkv() -> None

Materialize fused QKV weights before CUDA Graph capture.