Skip to content

vllm_omni.diffusion.offloader.layerwise_backend

logger module-attribute

logger = init_logger(__name__)

LayerWiseOffloadBackend

Bases: OffloadBackend

Layer-wise (block-level) offloading backend.

Implements sliding window offloading where only a small number of transformer blocks reside on GPU at a time. Blocks are prefetched asynchronously while previous blocks compute, and freed after use.

copy_stream instance-attribute

copy_stream = Stream()

disable

disable() -> None

enable

enable(pipeline: Module) -> None

get_blocks_attr_names staticmethod

get_blocks_attr_names(model: Module) -> list[str]

Get block attribute names from model class.

get_blocks_from_dit staticmethod

get_blocks_from_dit(
    model: Module,
) -> tuple[list[str], list[Module]]

Retrieve blocks and attribute names from provided DiT model. Blocks attribute names are found by _layerwise_offload_blocks_attrs set to DiT models. For example,

class WanTransformer3DModel(nn.Module):
    _layerwise_offload_blocks_attrs = ["blocks"]

Returns:

Type Description
tuple[list[str], list[Module]]

Tuple of (blocks_attr_names, blocks)

set_blocks_attr_names staticmethod

set_blocks_attr_names(
    model: Module, names: list[str]
) -> None

LayerwiseOffloadHook

Bases: ModelHook

Hook for layerwise (transformer-block-wise) CPU offloading.

The hook instance retains parameters for both the current registered block module and those for the next block, as well as flattened CPU tensors which record the parameters of the current block module, so that these parameters could be re-materialized on device in an overlapping way. This hook should be registered to each of the transformer blocks in DiT module(s) of the target pipeline.

Based on implementations from: https://github.com/sgl-project/sglang/blob/v0.5.8/python/sglang/multimodal_gen/runtime/utils/layerwise_offload.py

copy_stream instance-attribute

copy_stream = stream or current_stream()

device instance-attribute

device = device

dtype_cpu_flattened_weights instance-attribute

dtype_cpu_flattened_weights: dict[dtype, Tensor] = {}

dtype_metadata instance-attribute

dtype_metadata: dict[dtype, list[dict[str, Any]]] = {}

is_materialized property

is_materialized: bool

Check whether this block's parameters hold real data on device.

next_block instance-attribute

next_block = next_block

next_block_buffers instance-attribute

next_block_buffers: dict[str, Tensor] = {}

next_block_parameters instance-attribute

next_block_parameters: dict[str, Parameter] = {}

pin_memory instance-attribute

pin_memory = pin_memory

initialize_hook

initialize_hook(module: Module) -> Module

offload_layer

offload_layer() -> None

Free GPU memory for layer by replacing tensors with empty placeholders. This function does not actually offload weights from GPU back to CPU.

post_forward

post_forward(module: Module, output: Any) -> Any

pre_forward

pre_forward(
    module: Module, *args: Any, **kwargs: Any
) -> tuple[tuple, dict]

prefetch_layer

prefetch_layer(non_blocking: bool = True) -> None

Copy layer weights from CPU -> GPU.

Pre-fetch target block in an asynchronous way with compute - memory copy overlap, with non_blocking set to True.

apply_block_hook

apply_block_hook(
    module: Module,
    next_block: Module,
    device: device,
    stream: Stream | None = None,
    pin_memory: bool = True,
) -> LayerwiseOffloadHook

remove_block_hook

remove_block_hook(module: Module) -> None