vllm_omni.model_executor.models.voxcpm2.minicpm4_paged ¶
MiniCPM4 with PagedAttention + fp32 RoPE/RMSNorm for VoxCPM2.
Uses vllm Attention for KV cache, keeps fp32 precision ops from minicpm4_hf_compat.py to match native VoxCPM2 numerics.
MiniCPM4PagedForVoxCPM2 ¶
Bases: Module
PagedAttention base_lm (28 layers) for VoxCPM2 scaffold.
layers instance-attribute ¶
layers = ModuleList(
[
(
_PagedMiniCPM4DecoderLayer(
hidden_size=hidden_size,
intermediate_size=intermediate_size,
num_attention_heads=num_attention_heads,
num_key_value_heads=num_key_value_heads,
kv_channels=kv_channels,
rms_norm_eps=rms_norm_eps,
layer_idx=i,
num_hidden_layers=num_hidden_layers,
use_mup=getattr(lm_cfg, "use_mup", False),
scale_depth=getattr(
lm_cfg, "scale_depth", 1.0
),
cache_config=cache_config,
prefix=f"{prefix}.layers.{i}",
)
)
for i in (range(num_hidden_layers))
]
)
make_empty_intermediate_tensors instance-attribute ¶
make_empty_intermediate_tensors = (
make_empty_intermediate_tensors_factory(
["hidden_states", "residual"], hidden_size
)
)
rope_emb instance-attribute ¶
rope_emb = _MiniCPMLongRoPE(
hidden_size=hidden_size,
num_attention_heads=num_attention_heads,
kv_channels=kv_channels,
rope_theta=getattr(lm_cfg, "rope_theta", 10000.0),
max_position_embeddings=getattr(
lm_cfg, "max_position_embeddings", 32768
),
rope_scaling=rope_scaling_dict,
)
compile_selective ¶
Compile the full model forward as one graph.
Earlier versions compiled layer.mlp + layer.self_attn.o_proj (PR #2690) and then the whole layer (perf/voxcpm2-streaming-vae). Both still paid one Dynamo dispatch per layer per decode step. V3 profiling showed 1,332 per-layer dispatches (~28 layers × ~47 decode steps) costing ~726 ms of CPU self-time for a long prompt.
Compiling forward at the model level lets Dynamo unroll the 28-layer Python loop inside the graph. Graph breaks at PagedAttention produce sub-graphs but Dynamo memoises the whole trace once, so the per-step dispatch drops from 28 to just a few.
forward ¶
forward(
input_ids: Tensor | None,
positions: Tensor,
intermediate_tensors: IntermediateTensors | None = None,
inputs_embeds: Tensor | None = None,
**kwargs: Any,
) -> Tensor | IntermediateTensors
load_weights ¶
Load weights from native checkpoint (base_lm. prefix pre-stripped).
precompute_fused_qkv ¶
Materialize fused QKV weights before CUDA Graph capture.
MiniCPM4PagedResidualLM ¶
Bases: Module
PagedAttention residual LM (8 layers, no RoPE) for VoxCPM2.
layers instance-attribute ¶
layers = ModuleList(
[
(
_PagedMiniCPM4DecoderLayer(
hidden_size=hidden_size,
intermediate_size=intermediate_size,
num_attention_heads=num_attention_heads,
num_key_value_heads=num_key_value_heads,
kv_channels=kv_channels,
rms_norm_eps=rms_norm_eps,
layer_idx=i,
num_hidden_layers=num_hidden_layers,
use_mup=getattr(lm_cfg, "use_mup", False),
scale_depth=getattr(
lm_cfg, "scale_depth", 1.0
),
cache_config=cache_config,
prefix=f"{prefix}.layers.{i}",
)
)
for i in (range(num_hidden_layers))
]
)
compile_selective ¶
Compile the full residual model forward as one graph (same strategy as base_lm).
load_weights_from_native ¶
load_weights_from_native(native_residual_lm: Module) -> int
Load weights from native residual_lm. Returns param count.
precompute_fused_qkv ¶
Materialize fused QKV weights before CUDA Graph capture.