vllm_omni.platforms.npu.quant.kv_quant_npu ¶

FP8 quantization utilities for diffusion attention tensors.

Provides per-tensor dynamic quantization of Q/K/V tensors to float8_e4m3fn format. Designed for diffusion models where Q/K/V are computed fresh each forward pass (no persistent KV cache).

fp8_rotate_quant_fa ¶

fp8_rotate_quant_fa(
    query: Tensor,
    key: Tensor,
    value: Tensor,
    *,
    layout: str = "BNSD",
    softmax_scale: float | None = None,
) -> Tensor

Run NPU fused attention with dynamic FP8 Q/K/V and optional QuaRot preprocess.

Parameters:

Name	Type	Description	Default
`query`	`Tensor`	Query tensor in `layout` order (default BNSD: batch, heads, seq, dim).	required
`key`	`Tensor`	Key tensor in `layout` order (default BNSD: batch, heads, seq, dim).	required
`value`	`Tensor`	Value tensor in `layout` order (default BNSD: batch, heads, seq, dim).	required
`layout`	`str`	`BNSD` or `BSND` for `npu_fused_infer_attention_score_v2`.	`'BNSD'`
`softmax_scale`	`float \| None`	If None, uses `1 / sqrt(head_dim)`.	`None`

Returns:

Type	Description
`Tensor`	Attention output in the same layout as inputs.

is_quantized_kv_cache ¶

is_quantized_kv_cache(kv_cache_dtype: str | None) -> bool

True if config requests FP8-style KV / QKV quantization for the NPU FA path.