vllm_omni.model_executor.models.voxcpm2.minicpm4_paged ¶

MiniCPM4 with PagedAttention + fp32 RoPE/RMSNorm for VoxCPM2.

Uses vllm Attention for KV cache, keeps fp32 precision ops from minicpm4_hf_compat.py to match native VoxCPM2 numerics.

logger `module-attribute` ¶

logger = init_logger(__name__)

MiniCPM4PagedForVoxCPM2 ¶

Bases: Module

PagedAttention base_lm (28 layers) for VoxCPM2 scaffold.

config `instance-attribute` ¶

config = config

embed_tokens `instance-attribute` ¶

embed_tokens = nn.Embedding(self.vocab_size, hidden_size)

layers `instance-attribute` ¶

layers = nn.ModuleList(
    [
        (
            _PagedMiniCPM4DecoderLayer(
                hidden_size=hidden_size,
                intermediate_size=lm_cfg.intermediate_size,
                num_attention_heads=lm_cfg.num_attention_heads,
                num_key_value_heads=lm_cfg.num_key_value_heads,
                kv_channels=kv_channels,
                rms_norm_eps=lm_cfg.rms_norm_eps,
                layer_idx=i,
                num_hidden_layers=num_hidden_layers,
                use_mup=getattr(lm_cfg, "use_mup", False),
                scale_depth=getattr(
                    lm_cfg, "scale_depth", 1.0
                ),
                cache_config=cache_config,
                prefix=f"{prefix}.layers.{i}",
            )
        )
        for i in (range(num_hidden_layers))
    ]
)

make_empty_intermediate_tensors `instance-attribute` ¶

make_empty_intermediate_tensors = (
    make_empty_intermediate_tensors_factory(
        ["hidden_states", "residual"], hidden_size
    )
)

norm `instance-attribute` ¶

norm = RMSNorm(hidden_size, eps=lm_cfg.rms_norm_eps)

rope_emb `instance-attribute` ¶

rope_emb = _MiniCPMLongRoPE(
    hidden_size=hidden_size,
    num_attention_heads=lm_cfg.num_attention_heads,
    kv_channels=kv_channels,
    rope_theta=getattr(lm_cfg, "rope_theta", 10000.0),
    max_position_embeddings=getattr(
        lm_cfg, "max_position_embeddings", 32768
    ),
    rope_scaling=rope_scaling_dict,
)

vocab_size `instance-attribute` ¶

vocab_size = lm_cfg.vocab_size

compile_selective ¶

compile_selective() -> list[str]

Compile the full model forward as one graph.

torch.compile is applied at the model level so Dynamo unrolls the per-layer Python loop inside the graph. Graph breaks at PagedAttention produce sub-graphs, but Dynamo memoises the whole trace once, so the per-decode-step dispatch drops to just a few instead of one per layer.

embed_input_ids ¶

embed_input_ids(input_ids: Tensor, **_: Any) -> Tensor

forward ¶

forward(
    input_ids: Tensor | None,
    positions: Tensor,
    intermediate_tensors: IntermediateTensors | None = None,
    inputs_embeds: Tensor | None = None,
    **kwargs: Any,
) -> Tensor | IntermediateTensors

load_weights ¶

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Load weights from native checkpoint (base_lm. prefix pre-stripped).

precompute_fused_qkv ¶

precompute_fused_qkv() -> None

Materialize fused QKV weights before CUDA Graph capture.

MiniCPM4PagedResidualLM ¶

Bases: Module

PagedAttention residual LM (8 layers, no RoPE) for VoxCPM2.

config `instance-attribute` ¶

config = config

layers `instance-attribute` ¶

layers = nn.ModuleList(
    [
        (
            _PagedMiniCPM4DecoderLayer(
                hidden_size=hidden_size,
                intermediate_size=lm_cfg.intermediate_size,
                num_attention_heads=lm_cfg.num_attention_heads,
                num_key_value_heads=lm_cfg.num_key_value_heads,
                kv_channels=kv_channels,
                rms_norm_eps=lm_cfg.rms_norm_eps,
                layer_idx=i,
                num_hidden_layers=num_hidden_layers,
                use_mup=getattr(lm_cfg, "use_mup", False),
                scale_depth=getattr(
                    lm_cfg, "scale_depth", 1.0
                ),
                cache_config=cache_config,
                prefix=f"{prefix}.layers.{i}",
            )
        )
        for i in (range(num_hidden_layers))
    ]
)

norm `instance-attribute` ¶

norm = RMSNorm(hidden_size, eps=lm_cfg.rms_norm_eps)

rope_emb `instance-attribute` ¶

rope_emb = None

compile_selective ¶

compile_selective() -> list[str]

Compile the full residual model forward as one graph (same strategy as base_lm).

forward ¶

forward(positions: Tensor, inputs_embeds: Tensor) -> Tensor

load_weights_from_native ¶

load_weights_from_native(native_residual_lm: Module) -> int

Load weights from native residual_lm. Returns param count.

precompute_fused_qkv ¶

precompute_fused_qkv() -> None

Materialize fused QKV weights before CUDA Graph capture.

vllm_omni.model_executor.models.voxcpm2.minicpm4_paged ¶

logger module-attribute ¶

MiniCPM4PagedForVoxCPM2 ¶

config instance-attribute ¶

embed_tokens instance-attribute ¶

layers instance-attribute ¶

make_empty_intermediate_tensors instance-attribute ¶

norm instance-attribute ¶

rope_emb instance-attribute ¶

vocab_size instance-attribute ¶

compile_selective ¶

embed_input_ids ¶

forward ¶

load_weights ¶

precompute_fused_qkv ¶

MiniCPM4PagedResidualLM ¶

config instance-attribute ¶

layers instance-attribute ¶

norm instance-attribute ¶

rope_emb instance-attribute ¶

compile_selective ¶

forward ¶

load_weights_from_native ¶

precompute_fused_qkv ¶

logger `module-attribute` ¶

config `instance-attribute` ¶

embed_tokens `instance-attribute` ¶

layers `instance-attribute` ¶

make_empty_intermediate_tensors `instance-attribute` ¶

norm `instance-attribute` ¶

rope_emb `instance-attribute` ¶

vocab_size `instance-attribute` ¶

config `instance-attribute` ¶

layers `instance-attribute` ¶

norm `instance-attribute` ¶

rope_emb `instance-attribute` ¶