Skip to content

vllm_omni.attention.fish_kvcache_backend ¶

logger `module-attribute` ¶

logger = init_logger(__name__)

get_fish_kvcache_attn_stats ¶

get_fish_kvcache_attn_stats() -> dict[str, Any]

install_fish_kvcache_attn_backend ¶

install_fish_kvcache_attn_backend(model: Any) -> int

Install the Fish kvcache fast path on this Fish SlowAR model only.

is_fish_kvcache_attn_active_for_model ¶

is_fish_kvcache_attn_active_for_model(
    model_config: Any,
) -> bool

maybe_attach_fish_kvcache_seq_lens_upper_bound ¶

maybe_attach_fish_kvcache_seq_lens_upper_bound(
    *,
    model_config: Any,
    attn_metadata: Any,
    input_batch: Any,
    optimistic_seq_lens_cpu: Tensor,
    num_reqs: int,
    num_reqs_padded: int,
    max_query_len: int,
    pad_attn: bool,
    for_cudagraph_capture: bool = False,
    num_scheduled_tokens_np: Any = None,
) -> None

Attach CPU seq-len upper bounds for the Fish-only Triton decode fastpath.

prewarm_fish_kvcache_attn_capture_workspaces ¶

prewarm_fish_kvcache_attn_capture_workspaces(
    *,
    model_config: Any,
    device: device,
    dtype: dtype,
    capture_sizes: list[int] | tuple[int, ...],
) -> int

Preallocate Fish attention workspaces used during CUDA graph capture.

reset_fish_kvcache_attn_stats ¶

reset_fish_kvcache_attn_stats() -> None