Skip to content

vllm_omni.diffusion.offloader

Modules:

Name Description
base
layerwise_backend
module_collector
sequential_backend

logger module-attribute

logger = init_logger(__name__)

LayerWiseOffloadBackend

Bases: OffloadBackend

Layer-wise (block-level) offloading backend.

Implements sliding window offloading where only a small number of transformer blocks reside on GPU at a time. Blocks are prefetched asynchronously while previous blocks compute, and freed after use.

copy_stream instance-attribute

copy_stream = Stream()

disable

disable() -> None

enable

enable(pipeline: Module) -> None

get_blocks_attr_names staticmethod

get_blocks_attr_names(model: Module) -> list[str]

Get block attribute names from model class.

get_blocks_from_dit staticmethod

get_blocks_from_dit(
    model: Module,
) -> tuple[list[str], list[Module]]

Retrieve blocks and attribute names from provided DiT model. Blocks attribute names are found by _layerwise_offload_blocks_attrs set to DiT models. For example,

class WanTransformer3DModel(nn.Module):
    _layerwise_offload_blocks_attrs = ["blocks"]

Returns:

Type Description
tuple[list[str], list[Module]]

Tuple of (blocks_attr_names, blocks)

set_blocks_attr_names staticmethod

set_blocks_attr_names(
    model: Module, names: list[str]
) -> None

ModelLevelOffloadBackend

Bases: OffloadBackend

Model-level (sequential) offloading backend.

Uses SequentialOffloadHook registered via HookRegistry for automatic module swapping.

disable

disable() -> None

enable

enable(pipeline: Module) -> None

OffloadBackend

Bases: ABC

Base class for CPU offload backends

config instance-attribute

config = config

device instance-attribute

device = device

enabled instance-attribute

enabled = False

disable abstractmethod

disable() -> None

Disable offloading and cleanup resources.

Removes all registered hooks. Does NOT move modules back to original devices (caller responsible for that).

enable abstractmethod

enable(pipeline: Module) -> None

Enable offloading on the pipeline.

Discovers modules, moves them to appropriate devices, and registers forward hooks for swapping/prefetching.

Parameters:

Name Type Description Default
pipeline Module

Diffusion pipeline model (e.g., Wan22Pipeline)

required

is_enabled

is_enabled() -> bool

OffloadConfig dataclass

pin_cpu_memory class-attribute instance-attribute

pin_cpu_memory: bool = True

strategy instance-attribute

strategy: OffloadStrategy

use_hsdp class-attribute instance-attribute

use_hsdp: bool = False

from_od_config classmethod

from_od_config(
    od_config: OmniDiffusionConfig,
) -> OffloadConfig

Extract and validate offload settings from OmniDiffusionConfig.

For now, enforces mutual exclusion between model-level and layer-wise offloading. Layer-wise takes priority if both are enabled.

Parameters:

Name Type Description Default
od_config OmniDiffusionConfig

OmniDiffusionConfig with offload settings

required

Returns:

Type Description
OffloadConfig

OffloadConfig with validated settings

OffloadStrategy

Bases: Enum

LAYER_WISE class-attribute instance-attribute

LAYER_WISE = 'layer_wise'

MODEL_LEVEL class-attribute instance-attribute

MODEL_LEVEL = 'model_level'

NONE class-attribute instance-attribute

NONE = 'none'

apply_sequential_offload

apply_sequential_offload(
    dit_modules: list[Module],
    encoder_modules: list[Module],
    device: device,
    pin_memory: bool = True,
    use_hsdp: bool = False,
) -> None

Apply sequential offloading hooks to DiT and encoder modules.

Registers hooks on modules to implement mutual-exclusion GPU allocation. - Before DiT runs, encoders are offloaded to CPU. - Before encoders run, DiT is offloaded to CPU.

Parameters:

Name Type Description Default
dit_modules list[Module]

DiT/transformer modules to register hooks on

required
encoder_modules list[Module]

Encoder modules to register hooks on

required
device device

Target GPU device for loading

required
pin_memory bool

Whether to pin CPU memory for faster transfers

True
use_hsdp bool

Whether HSDP is enabled (affects non_blocking behavior)

False
Example

apply_sequential_offload( ... dit_modules=[pipeline.transformer], ... encoder_modules=[pipeline.text_encoder, pipeline.vae], ... device=torch.device("cuda:0"), ... )

Modules of pipeline now automatically swap between CPU and GPU

get_offload_backend

get_offload_backend(
    od_config: OmniDiffusionConfig,
    device: device | None = None,
) -> OffloadBackend | None

Create appropriate offload backend based on configuration.

Parameters:

Name Type Description Default
od_config OmniDiffusionConfig

OmniDiffusionConfig with offload settings

required
device device | None

Target device (auto-detected if None)

None

Returns:

Type Description
OffloadBackend | None

OffloadBackend instance or None if offloading disabled

Example

backend = get_offload_backend(od_config, device=torch.device("cuda:0")) if backend: ... backend.enable(pipeline)

remove_sequential_offload

remove_sequential_offload(modules: list[Module]) -> None

Remove sequential offloading hooks from modules.

Parameters:

Name Type Description Default
modules list[Module]

Modules to remove hooks from

required
Example

all_modules = [dit_modules, encoder_modules] remove_sequential_offload(all_modules)