Skip to content

vllm_omni.platforms.npu.quant.kv_quant_npu

FP8 quantization utilities for diffusion attention tensors.

Provides per-tensor dynamic quantization of Q/K/V tensors to float8_e4m3fn format. Designed for diffusion models where Q/K/V are computed fresh each forward pass (no persistent KV cache).

fp8_rotate_quant_fa

fp8_rotate_quant_fa(
    query: Tensor,
    key: Tensor,
    value: Tensor,
    *,
    layout: str = "BNSD",
    softmax_scale: float | None = None,
) -> Tensor

Run NPU fused attention with dynamic FP8 Q/K/V and optional QuaRot preprocess.

Parameters:

Name Type Description Default
query Tensor

Query tensor in layout order (default BNSD: batch, heads, seq, dim).

required
key Tensor

Key tensor in layout order (default BNSD: batch, heads, seq, dim).

required
value Tensor

Value tensor in layout order (default BNSD: batch, heads, seq, dim).

required
layout str

BNSD or BSND for npu_fused_infer_attention_score_v2.

'BNSD'
softmax_scale float | None

If None, uses 1 / sqrt(head_dim).

None

Returns:

Type Description
Tensor

Attention output in the same layout as inputs.

is_quantized_kv_cache

is_quantized_kv_cache(kv_cache_dtype: str | None) -> bool

True if config requests FP8-style KV / QKV quantization for the NPU FA path.