vllm_omni.platforms.npu.quant.kv_quant_npu ¶
FP8 quantization utilities for diffusion attention tensors.
Provides per-tensor dynamic quantization of Q/K/V tensors to float8_e4m3fn format. Designed for diffusion models where Q/K/V are computed fresh each forward pass (no persistent KV cache).
fp8_rotate_quant_fa ¶
fp8_rotate_quant_fa(
query: Tensor,
key: Tensor,
value: Tensor,
*,
layout: str = "BNSD",
softmax_scale: float | None = None,
) -> Tensor
Run NPU fused attention with dynamic FP8 Q/K/V and optional QuaRot preprocess.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query | Tensor | Query tensor in | required |
key | Tensor | Key tensor in | required |
value | Tensor | Value tensor in | required |
layout | str |
| 'BNSD' |
softmax_scale | float | None | If None, uses | None |
Returns:
| Type | Description |
|---|---|
Tensor | Attention output in the same layout as inputs. |