vllm.models.deepseek_v32.nvidia.kernels ¶
Functions:
-
fused_eh_norm–Returns cat([enorm(masked embeds), hnorm(prev_hidden)]) -> [N, 2H].
_fp8_ue8m0_quantize(vals) ¶
Quantize float32 values to FP8 E4M3 with a ue8m0 (power-of-2) scale.
Returns (fp8_vals, scale) so the caller can store them or reuse the scale.
Source code in vllm/models/deepseek_v32/nvidia/kernels.py
_fused_eh_norm_kernel(pos_ptr, embeds_ptr, embeds_stride, prev_ptr, prev_stride, enorm_w_ptr, hnorm_w_ptr, eps, out_ptr, out_stride, H, BLOCK) ¶
MTP input fusion: zero embeds at position 0, RMSNorm(embeds) with enorm and RMSNorm(prev_hidden) with hnorm, written side-by-side into out ([N, 2H]) ready for the eh_proj GEMM. Replaces where + 2x RMSNorm + cat.
Source code in vllm/models/deepseek_v32/nvidia/kernels.py
fused_eh_norm(positions, inputs_embeds, previous_hidden, enorm_w, hnorm_w, eps) ¶
Returns cat([enorm(masked embeds), hnorm(prev_hidden)]) -> [N, 2H].