vllm.v1.attention.backends.mla.triton_mla ¶
Classes:
TritonMLAMetadataBuilder ¶
Bases: MLACommonMetadataBuilder[MLACommonMetadata]
Source code in vllm/v1/attention/backends/mla/triton_mla.py
_reserve_attn_logits_workspace() ¶
Pre-size the shared workspace for the decode split-KV attn logits.
Reserving at the worst case (max_model_len -> max num_kv_splits, max_num_seqs decode tokens) before warmup/cudagraph capture means the per-call get_simultaneous in forward_mqa never has to grow the buffer at runtime (which would raise once the workspace is locked).