llmcompressor.pipelines.cache
Classes:
-
IntermediateValue–Dataclass which recursively defines offloaded values and which device to onload to
-
IntermediatesCache–Cache which stores intermediate values (activations) produced by batched, sequential
-
OverrideEqMode–When using a torch.Tensor as a key in a dictionary, the equality
IntermediateValue
dataclass
Dataclass which recursively defines offloaded values and which device to onload to
Parameters:
-
value(Tensor | 'IntermediateValue' | Any) –either an offloaded Tensor, an primative value, or a recursable value
-
device(device | None) –if the value is a Tensor, then the device to onload the tensor to, otherwise None
IntermediatesCache
IntermediatesCache(
batch_intermediates: list[IntermediateValues]
| None = None,
offload_device: device | None = "cpu",
)
Cache which stores intermediate values (activations) produced by batched, sequential
execution of models. Values are offloaded to the offload_device when stored in
the cache and onloaded to their original device when fetched from the cache. If
offload_device is None, values will not be offloaded at all.
Currently supports nested offloading of dataclass instances and tuples
Construct using empty and from_dataloader class methods
Methods:
-
append–Append new values to the cache. The new values will be assigned the next
-
delete–Delete values from the cache
-
empty–Construct an empty cache
-
fetch–Fetch values belonging to a batch
-
from_dataloader–Initialize a cache with data from the provided dataloader
-
iter_prefetch–Iterate over batches with the next batch prefetched in a background thread.
-
size–Returns the memory used by cached values, keyed by device, in bytes
-
update–Update/put values belonging to a batch
Source code in src/llmcompressor/pipelines/cache.py
append
Append new values to the cache. The new values will be assigned the next available batch index
Parameters:
-
values(dict[str, Any]) –dictionary mapping keys to values used for update
Source code in src/llmcompressor/pipelines/cache.py
delete
Delete values from the cache
Parameters:
-
batch_index(int) –index of batch whose values will be deleted
-
consumed_names(list[str] | None, default:None) –list of keys whose values will be deleted, defaults to removing all keys
Source code in src/llmcompressor/pipelines/cache.py
empty
classmethod
Construct an empty cache
Parameters:
-
num_batches(int) –the expected number of batches to be stored
-
offload_device(device) –device to offload values to
Source code in src/llmcompressor/pipelines/cache.py
fetch
Fetch values belonging to a batch
Parameters:
-
batch_index(int) –index of batch whose values are being fetched
-
input_names(list[str] | None, default:None) –list of keys whose values are being fetched
Returns:
-
dict[str, Any]–dictionary mapping keys to onloaded values
Source code in src/llmcompressor/pipelines/cache.py
from_dataloader
classmethod
from_dataloader(
dataloader: DataLoader,
model_device: device = torch.device("cpu"),
offload_device: device | None = torch.device("cpu"),
)
Initialize a cache with data from the provided dataloader
This method iterates through all batches in the dataloader and offloads them to the specified device. For faster cache preparation, consider: - Increasing batch_size to reduce the number of iterations - Using num_workers > 0 in the DataLoader for parallel loading (e.g. the calibration DataLoader from format_calibration_data uses dataloader_num_workers; when > 0, pin_memory and prefetch_factor are also set where applicable, which speeds both cache build and calibration) - Ensuring data preprocessing is done before creating the dataloader
Parameters:
-
dataloader(DataLoader) –dataloader which generates values to be cached
-
model_device(device, default:device('cpu')) –device which values will be onloaded to when fetched
-
offload_device(device | None, default:device('cpu')) –device to offload values to
Source code in src/llmcompressor/pipelines/cache.py
iter_prefetch
Iterate over batches with the next batch prefetched in a background thread. Overlaps onload from offload_device with consumption of the current batch, which can reduce wall-clock time when offloading to CPU.
When CUDA is available, uses non_blocking transfers (requires pinned CPU tensors, set up by _offload_value) and synchronises via CUDA events so the main stream waits for each H2D copy before running GPU kernels on the data.
Yields the same fetched batch dicts as :meth:iter; only the timing
of onloads differs.
Source code in src/llmcompressor/pipelines/cache.py
size
Returns the memory used by cached values, keyed by device, in bytes
Returns:
-
dict[device, int]–dictionary mapping torch device to number of bytes in cache
Source code in src/llmcompressor/pipelines/cache.py
update
Update/put values belonging to a batch
Parameters:
-
batch_index(int) –index of batch whose values will be updated
-
values(dict[str, Any]) –dictionary mapping keys to values used for update
Source code in src/llmcompressor/pipelines/cache.py
OverrideEqMode
Bases: TorchDispatchMode
When using a torch.Tensor as a key in a dictionary, the equality check must return a single value instead of a torch.Tensor of bool values. Use this override context for such cases, to swap out the torch.eq equality check for a check on id
a = torch.tensor([1,2,3]) b = torch.tensor([1,2,3]) a == b tensor([True, True, True]) with OverrideEqMode(): ... a == b tensor(True)