vllm.models.minimax_m3.nvidia.indexer_msa ¶
MSA (SM100/Blackwell) indexer impl for MiniMax M3.
Prefill scores with fmha_sm100's score-only (OnlyScore) path then selects top-k blocks with the Triton minimax_m3_index_topk kernel -- fmha is much faster than Triton for the wide prefill score (benchmarked ~3-5x).
Decode uses the Triton fused minimax_m3_index_decode (the same kernel the Triton indexer impl uses): for q_len==1 it is a purpose-built vector x matrix score (no wasted tensor-core tiles) with a 256-way split-K and a fused split-K top-k, which beats fmha's OnlyScore (wasted MMA on a single query, 64-split cap) by ~1.1-3.7x. It is cudagraph-safe by construction (shape-constant split grids) and writes the shared topk_indices_buffer via out=.
fmha_sm100 imports are function-local so this module is import-safe on AMD / non-SM100.
Classes:
-
MiniMaxM3IndexerMSABackend–Indexer side-cache backend selecting the MSA builder.
-
MiniMaxM3IndexerMSAImpl–Decode: Triton fused score+top-k. Prefill: fmha_sm100 OnlyScore + top-k.
-
MiniMaxM3IndexerMSAMetadata–Decode reuses the inherited base
decodefield (the Triton decode -
MiniMaxM3IndexerMSAMetadataBuilder–Decode metadata is the cudagraph-safe Triton decode metadata; the prefill
-
MiniMaxM3IndexerMSAPrefillMetadata–fmha score plan + Triton top-k inputs for the prefill side (eager).
MiniMaxM3IndexerMSABackend ¶
Bases: MiniMaxM3IndexerBackend
Indexer side-cache backend selecting the MSA builder.
Source code in vllm/models/minimax_m3/nvidia/indexer_msa.py
MiniMaxM3IndexerMSAImpl ¶
Bases: MiniMaxM3IndexerImpl
Decode: Triton fused score+top-k. Prefill: fmha_sm100 OnlyScore + top-k.
Source code in vllm/models/minimax_m3/nvidia/indexer_msa.py
MiniMaxM3IndexerMSAMetadata dataclass ¶
Bases: MiniMaxM3IndexerMetadata
Decode reuses the inherited base decode field (the Triton decode metadata); prefill_msa carries the fmha score plan for the prefill side (the base prefill field is unused on this path).
Source code in vllm/models/minimax_m3/nvidia/indexer_msa.py
MiniMaxM3IndexerMSAMetadataBuilder ¶
Bases: MiniMaxM3IndexerMetadataBuilder
Decode metadata is the cudagraph-safe Triton decode metadata; the prefill fmha plan is built eagerly (prefill batches are not captured).
Source code in vllm/models/minimax_m3/nvidia/indexer_msa.py
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | |
MiniMaxM3IndexerMSAPrefillMetadata dataclass ¶
fmha score plan + Triton top-k inputs for the prefill side (eager).