vllm_omni.diffusion.cache.magcache.hook ¶
Hook-based MagCache implementation for vLLM-Omni.
This module implements a diffusers-style hook system for MagCache (Magnitude-based Cache), providing adaptive caching for diffusion model inference.
MagCache speeds up inference by skipping transformer block computations when the accumulated magnitude error is below a threshold, reusing cached residuals instead.
Based on: https://github.com/Zehong-Ma/MagCache Reference: diffusers/src/diffusers/hooks/mag_cache.py
Architecture: - MagCacheStrategy: Model-specific strategy for preprocessing/postprocessing - MagCacheState: Per-step state tracking residuals and accumulated error - MagCacheHeadHook: Decides whether to skip based on accumulated error - MagCacheBlockHook: Computes and stores residuals at tail block
MagCacheBlockHook ¶
Bases: ModelHook
Block hook for MagCache - computes residuals at tail block.
log_residual_computed ¶
log_residual_computed(
state: MagCacheState,
residual: Tensor | tuple[Tensor, Tensor],
) -> None
perform_calibration ¶
perform_calibration(
state: MagCacheState,
current_residual: Tensor | tuple[Tensor, Tensor],
) -> None
MagCacheHeadHook ¶
apply_mag_cache_hook ¶
apply_mag_cache_hook(
module: Module,
config: MagCacheConfig,
strategy: MagCacheStrategy | None = None,
) -> None
Apply MagCache optimization to a transformer module.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
module | Module | Transformer model to optimize (e.g., FluxTransformer2DModel) | required |
config | MagCacheConfig | MagCacheConfig specifying caching parameters | required |
strategy | MagCacheStrategy | None | Optional strategy to use. If None, will be looked up from registry. | None |