vllm.v1.attention.ops.triton_unified_attention_diffkv ¶
Triton unified attention with different K/V head dimensions (DiffKV).
This is a slimmed fork of triton_unified_attention.py for models like MiMo-V2.5 where the V tensor's head dimension differs from K's. The KV cache is the same packed layout used by FlashAttentionDiffKVBackend:
kv_cache: [num_blocks, block_size, num_kv_heads, head_size_qk + head_size_v]
We slice key_cache = kv_cache[..., :head_size_qk] and value_cache = kv_cache[..., head_size_qk:] on the host, so the kernel takes two cache pointers but with two distinct head sizes.
Both 2D and 3D launches are supported
- 2D: one program per (q-block, kv-head); tile-loop walks the full KV sequence; final output written directly. Used for prefill and large decode batches.
- 3D: one program per (q-block, kv-head, segm); each program covers a KV slice and writes per-segment partials (max/expsum/output). A follow-up
kernel_reduce_segments_diffkvcombines them. Selected for decode-only batches whose 2D grid would under-fill the GPU.
Functions:
-
kernel_reduce_segments_diffkv–Combine per-segment partials into the final softmax output.
kernel_reduce_segments_diffkv(output_ptr, segm_output_ptr, segm_max_ptr, segm_expsum_ptr, seq_lens_ptr, num_seqs, num_query_heads, output_stride_0, output_stride_1, TILE_SIZE, HEAD_SIZE_V, HEAD_SIZE_V_PADDED, query_start_len_ptr, BLOCK_Q, NUM_SEGMENTS_PER_SEQ) ¶
Combine per-segment partials into the final softmax output.
Mirrors reduce_segments from triton_unified_attention.py but indexes V's head size (HEAD_SIZE_V) instead of the shared one.