vllm.models.minimax_m3.amd.ops.index_topk ¶
Triton kernels for MiniMax M3 lightning-indexer block scoring + top-k.
Index queries score each 128-token block of index keys (max over the block), then the top-k blocks (plus forced init/local blocks) are selected per query token. Adapted to vLLM's paged KV cache: the KV page size is forced to equal the sparse block size (128), so one sparse block maps to exactly one page.
Index-K cache layout (vLLM): (num_blocks, 128, idx_head_dim) (single head).
Only the paths MiniMax M3 uses are implemented: score_type="max", index value disabled (score-only indexer), single shared index head. The selected block ids feed the block-sparse attention kernels in sparse_attn.
Functions:
-
minimax_m3_index_decode–Decode index block-score + top-k, both split-K (cudagraph-safe).
-
minimax_m3_index_score–Compute per-token index scores for each visible sparse block.
-
minimax_m3_index_topk–Select index top-k from a precomputed score tensor.
minimax_m3_index_decode(idx_q, index_kv_cache, block_table, seq_lens, max_seq_len, topk, init_blocks, local_blocks, num_kv_heads, decode_query_len, max_decode_query_len, out=None) ¶
Decode index block-score + top-k, both split-K (cudagraph-safe).
Returns topk_idx [num_kv_heads, total_q, topk] (0-indexed block ids, -1 pad). When out ([num_kv_heads, >=total_q, topk]) is given, writes into out[:, :total_q, :] (stable address for cudagraph) instead of allocating.
Source code in vllm/models/minimax_m3/amd/ops/index_topk.py
766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 | |
minimax_m3_index_score(idx_q, index_kv_cache, block_table, cu_seqlens_q, seq_lens, prefix_lens, max_query_len, max_seq_len, num_kv_heads) ¶
Compute per-token index scores for each visible sparse block.
Returns score [num_kv_heads, total_q, max_block], where each score is the max over a 128-token index-K block. M3 has num_idx_heads == num_kv_heads.
Source code in vllm/models/minimax_m3/amd/ops/index_topk.py
minimax_m3_index_topk(score, cu_seqlens_q, prefix_lens, max_query_len, topk, init_blocks, local_blocks, out=None) ¶
Select index top-k from a precomputed score tensor.
When out is provided (a [num_idx_heads, >=total_q, topk] buffer), the result is written into out[:, :total_q, :] instead of a fresh tensor -- used to keep the top-k output at a stable address for cudagraph capture.