vllm_omni.diffusion.offloader.layerwise_backend ¶

logger `module-attribute` ¶

logger = init_logger(__name__)

LayerWiseOffloadBackend ¶

Bases: OffloadBackend

Layer-wise (block-level) offloading backend.

Implements sliding window offloading where only a small number of transformer blocks reside on GPU at a time. Blocks are prefetched asynchronously while previous blocks compute, and freed after use.

copy_stream `instance-attribute` ¶

copy_stream = current_omni_platform.Stream()

disable ¶

disable() -> None

enable ¶

enable(pipeline: Module) -> None

get_blocks_attr_names `staticmethod` ¶

get_blocks_attr_names(model: Module) -> list[str]

Get block attribute names from model class.

get_blocks_from_dit `staticmethod` ¶

get_blocks_from_dit(
    model: Module,
) -> tuple[list[str], list[Module]]

Retrieve blocks and attribute names from provided DiT model. Blocks attribute names are found by _layerwise_offload_blocks_attrs set to DiT models. For example,

class WanTransformer3DModel(nn.Module):
    _layerwise_offload_blocks_attrs = ["blocks"]

Returns:

Type	Description
`tuple[list[str], list[Module]]`	Tuple of (blocks_attr_names, blocks)

set_blocks_attr_names `staticmethod` ¶

set_blocks_attr_names(
    model: Module, names: list[str]
) -> None

LayerwiseOffloadHook ¶

Bases: ModelHook

Hook for layerwise (transformer-block-wise) CPU offloading.

The hook instance retains parameters for both the current registered block module and those for the next block, as well as flattened CPU tensors which record the parameters of the current block module, so that these parameters could be re-materialized on device in an overlapping way. This hook should be registered to each of the transformer blocks in DiT module(s) of the target pipeline.

Based on implementations from: https://github.com/sgl-project/sglang/blob/v0.5.8/python/sglang/multimodal_gen/runtime/utils/layerwise_offload.py

copy_stream `instance-attribute` ¶

copy_stream = (
    stream or current_omni_platform.current_stream()
)

device `instance-attribute` ¶

device = device

dtype_cpu_flattened_weights `instance-attribute` ¶

dtype_cpu_flattened_weights: dict[dtype, Tensor] = {}

dtype_metadata `instance-attribute` ¶

dtype_metadata: dict[dtype, list[dict[str, Any]]] = {}

is_materialized `property` ¶

is_materialized: bool

Check whether this block's parameters hold real data on device.

next_block `instance-attribute` ¶

next_block = next_block

next_block_buffers `instance-attribute` ¶

next_block_buffers: dict[str, Tensor] = {}

next_block_parameters `instance-attribute` ¶

next_block_parameters: dict[str, Parameter] = {}

pin_memory `instance-attribute` ¶

pin_memory = pin_memory

initialize_hook ¶

initialize_hook(module: Module) -> Module

offload_layer ¶

offload_layer() -> None

Free GPU memory for layer by replacing tensors with empty placeholders. This function does not actually offload weights from GPU back to CPU.

post_forward ¶

post_forward(module: Module, output: Any) -> Any

pre_forward ¶

pre_forward(
    module: Module, *args: Any, **kwargs: Any
) -> tuple[tuple, dict]

prefetch_layer ¶

prefetch_layer(non_blocking: bool = True) -> None

Copy layer weights from CPU -> GPU.

Pre-fetch target block in an asynchronous way with compute - memory copy overlap, with non_blocking set to True.

apply_block_hook ¶

apply_block_hook(
    module: Module,
    next_block: Module,
    device: device,
    stream: Stream | None = None,
    pin_memory: bool = True,
) -> LayerwiseOffloadHook

remove_block_hook ¶

remove_block_hook(module: Module) -> None

vllm_omni.diffusion.offloader.layerwise_backend ¶

logger module-attribute ¶

LayerWiseOffloadBackend ¶

copy_stream instance-attribute ¶

disable ¶

enable ¶

get_blocks_attr_names staticmethod ¶

get_blocks_from_dit staticmethod ¶

set_blocks_attr_names staticmethod ¶

LayerwiseOffloadHook ¶

copy_stream instance-attribute ¶

device instance-attribute ¶

dtype_cpu_flattened_weights instance-attribute ¶

dtype_metadata instance-attribute ¶

is_materialized property ¶

next_block instance-attribute ¶

next_block_buffers instance-attribute ¶

next_block_parameters instance-attribute ¶

pin_memory instance-attribute ¶

initialize_hook ¶

offload_layer ¶

post_forward ¶

pre_forward ¶

prefetch_layer ¶

apply_block_hook ¶

remove_block_hook ¶

logger `module-attribute` ¶

copy_stream `instance-attribute` ¶

get_blocks_attr_names `staticmethod` ¶

get_blocks_from_dit `staticmethod` ¶

set_blocks_attr_names `staticmethod` ¶

copy_stream `instance-attribute` ¶

device `instance-attribute` ¶

dtype_cpu_flattened_weights `instance-attribute` ¶

dtype_metadata `instance-attribute` ¶

is_materialized `property` ¶

next_block `instance-attribute` ¶

next_block_buffers `instance-attribute` ¶

next_block_parameters `instance-attribute` ¶

pin_memory `instance-attribute` ¶