vllm.models.deepseek_v4.nvidia.ops.o_proj ¶
compute_fp8_einsum_recipe ¶
fp8_einsum recipe + scale layout for the current GPU arch.
SM90: FP32 block scales stay [g, r/128, d/128] → sfb_gran_mn=128. SM100: INT32 packed scales become [g, r, ...] → sfb_gran_mn=1.
Returns (einsum_recipe, tma_aligned_scales) for deep_gemm_fp8_o_proj.
Source code in vllm/models/deepseek_v4/nvidia/ops/o_proj.py
deep_gemm_fp8_o_proj ¶
deep_gemm_fp8_o_proj(
o: Tensor,
positions: Tensor,
cos_sin_cache: Tensor,
wo_a: Module,
wo_b: Module,
*,
n_groups: int,
heads_per_group: int,
nope_dim: int,
rope_dim: int,
o_lora_rank: int,
einsum_recipe: tuple[int, int, int],
tma_aligned_scales: bool,
) -> Tensor
O projection: inverse RoPE + FP8 quant + einsum + wo_b.
Shared by the FlashMLA and FlashInfer CUDA backends. einsum_recipe / tma_aligned_scales come from compute_fp8_einsum_recipe.