Quantized KV Cache¶
Overview¶
In DiT-based image and video generation, Flash Attention can take a large share of denoising time, especially for high-resolution or long-frame workloads. vLLM-Omni supports online FP8 quantization for eligible diffusion Flash Attention (FA) to reduce FA latency while keeping model weights in their original dtype.
This feature is configured through diffusion_kv_cache_dtype on OmniDiffusionConfig (CLI: --diffusion-kv-cache-dtype). It is intentionally not the same as vLLM's --kv-cache-dtype, which controls autoregressive language-model KV cache storage and defaults to "auto". Diffusion FA quantization uses the dedicated diffusion flags so omni serve does not inherit that default.
In vLLM-Omni diffusion pipelines, this is a runtime FA path: Q/K/V tensors are dynamically quantized before the attention operator. It does not quantize model weights and is separate from FP8 W8A8, Int8 W8A8, or pre-quantized checkpoint formats.
If diffusion_kv_cache_dtype is not set, behavior is unchanged and attention runs in the native dtype.
Hardware Support¶
| Device | FP8 FA |
|---|---|
| Ascend NPU | ✅ |
| NVIDIA GPU | ❌ |
| AMD ROCm | ❌ |
| Intel XPU | ❌ |
Legend: ✅ supported, ❌ unsupported.
FP8 FA is currently implemented only for the NPU Flash Attention backend. Other backends do not support diffusion_kv_cache_dtype="fp8" for diffusion attention and fall back to native dtype execution.
Model Type Support¶
Diffusion Model¶
| Model | Scope | Status | Notes |
|---|---|---|---|
| Wan2.2 | Eligible DiT full-attention FA on Ascend NPU | Tested | Compare quality and latency against a BF16 baseline before production use |
| Other diffusion models | Eligible DiT full-attention FA on Ascend NPU | Not tested | You can try diffusion_kv_cache_dtype="fp8"; tune diffusion_kv_cache_skip_steps and diffusion_kv_cache_skip_layers when higher precision is needed |
Multi-Stage Omni/TTS Model (Qwen3-Omni, Qwen3-TTS)¶
Not tested for FP8 FA. Treat any use as experimental unless a model-specific guide documents support.
Multi-Stage Diffusion Model (BAGEL, GLM-Image)¶
Not tested. If the diffusion stage uses the same NPU Flash Attention backend, diffusion_kv_cache_dtype may apply in theory; validate quality and latency for each stage and model.
Configuration¶
Offline diffusion example:
python examples/offline_inference/image_to_video/image_to_video.py \
--model <your-wan2.2-model> \
--prompt "A cat sitting on a surfboard at the beach" \
--height 1280 \
--width 720 \
--num-frames 61 \
--num-inference-steps 4 \
--ulysses-degree 4 \
--vae-patch-parallel-size 4 \
--diffusion-kv-cache-dtype fp8 \
--diffusion-kv-cache-skip-steps "0,1" \
--diffusion-kv-cache-skip-layers "0-2"
Online serving:
Stage config:
stage_args:
- stage_id: 0
stage_type: diffusion
engine_args:
model_stage: dit
diffusion_kv_cache_dtype: "fp8"
diffusion_kv_cache_skip_steps: "0,1"
diffusion_kv_cache_skip_layers: "0-2"
Legacy YAML keys kv_cache_dtype, kv_cache_skip_steps, and kv_cache_skip_layers are still accepted when constructing OmniDiffusionConfig (for example via from_kwargs); prefer the diffusion_* names for new configs.
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
diffusion_kv_cache_dtype | str | None | None | Set to "fp8" to enable dynamic FP8 FA on supported attention backends |
diffusion_kv_cache_skip_steps | str | None | None | Denoising step selector to keep in native dtype, for example "0,1,4-6" |
diffusion_kv_cache_skip_layers | str | None | None | Transformer layer selector to keep in native dtype, for example "0-2,10" |
Selectors use comma-separated integers and inclusive ranges. Listed steps or layers skip FP8 FA; all other eligible full-attention forwards use the FP8 path.
Validation and Notes¶
- Compare generated images or videos against a BF16 baseline with the same seed, prompt, resolution, frame count, and denoising steps.
- Use
diffusion_kv_cache_skip_stepsfor denoising steps where quality is more sensitive. - Use
diffusion_kv_cache_skip_layersfor transformer layers that show visible quality regressions. - Report both latency and quality results when enabling this option for a new model. For image or video models, include visual comparison and quantitative metrics when available, such as PSNR or SSIM.