vllm_omni.diffusion.offloader.layerwise_backend ¶
LayerWiseOffloadBackend ¶
Bases: OffloadBackend
Layer-wise (block-level) offloading backend.
Implements sliding window offloading where only a small number of transformer blocks reside on GPU at a time. Blocks are prefetched asynchronously while previous blocks compute, and freed after use.
get_blocks_attr_names staticmethod ¶
Get block attribute names from model class.
LayerwiseOffloadHook ¶
Bases: ModelHook
Hook for layerwise (transformer-block-wise) CPU offloading.
The hook instance retains parameters for both the current registered block module and those for the next block, as well as flattened CPU tensors which record the parameters of the current block module, so that these parameters could be re-materialized on device in an overlapping way. This hook should be registered to each of the transformer blocks in DiT module(s) of the target pipeline.
Based on implementations from: https://github.com/sgl-project/sglang/blob/v0.5.8/python/sglang/multimodal_gen/runtime/utils/layerwise_offload.py
dtype_cpu_flattened_weights instance-attribute ¶
dtype_cpu_flattened_weights: dict[dtype, Tensor] = {}
is_materialized property ¶
is_materialized: bool
Check whether this block's parameters hold real data on device.
offload_layer ¶
Free GPU memory for layer by replacing tensors with empty placeholders. This function does not actually offload weights from GPU back to CPU.
apply_block_hook ¶
apply_block_hook(
module: Module,
next_block: Module,
device: device,
stream: Stream | None = None,
pin_memory: bool = True,
) -> LayerwiseOffloadHook