llmcompressor.utils
General utility functions used throughout LLM Compressor.
Modules:
-
dev– -
dist– -
helpers–General utility helper functions.
-
metric_logging–Utility functions for metrics logging and GPU memory monitoring.
-
pytorch– -
transformers–
Functions:
-
DisableQuantization–Disable quantization during forward passes after applying a quantization config
-
calibration_forward_context–Context in which all calibration forward passes should occur.
-
disable_cache–Temporarily disable the key-value cache for transformer models. Used to prevent
-
disable_hf_kernels–In transformers>=4.50.0, some module forward methods may be
-
disable_lm_head–Disable the lm_head of a model by moving it to the meta device. This function
-
dispatch_for_generation–Dispatch a model autoregressive generation. This means that modules are dispatched
-
eval_context–Disable pytorch training mode for the given module
-
get_embeddings–Returns input and output embeddings of a model. If
get_input_embeddings/ -
greedy_bin_packing–Distribute items across bins using a greedy bin-packing heuristic.
-
import_from_path–Import the module and the name of the function/class separated by :
-
patch_transformers_logger_level–Context under which the transformers logger's level is modified
-
skip_weights_download–Context manager under which models are initialized without having to download
-
targets_embeddings–Returns True if the given targets target the word embeddings of the model
-
untie_word_embeddings–Untie word embeddings, if possible. This function raises a warning if
-
wait_for_comms–Block until all pending async distributed operations complete.
DisableQuantization
Disable quantization during forward passes after applying a quantization config
Source code in src/llmcompressor/utils/helpers.py
calibration_forward_context
Context in which all calibration forward passes should occur.
- Remove gradient calculations
- Disable the KV cache
- Disable train mode and enable eval mode
- Disable hf kernels which could bypass hooks
- Disable lm head (input and weights can still be calibrated, output will be meta)
Source code in src/llmcompressor/utils/helpers.py
disable_cache
Temporarily disable the key-value cache for transformer models. Used to prevent excess memory use in one-shot cases where the model only performs the prefill phase and not the generation phase.
Example:
model = AutoModel.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0") input = torch.randint(0, 32, size=(1, 32)) with disable_cache(model): ... output = model(input)
Source code in src/llmcompressor/utils/helpers.py
disable_hf_kernels
In transformers>=4.50.0, some module forward methods may be replaced by calls to hf hub kernels. This has the potential to bypass hooks added by LLM Compressor
Source code in src/llmcompressor/utils/helpers.py
disable_lm_head
Disable the lm_head of a model by moving it to the meta device. This function does not untie parameters and restores the model proper loading upon exit
Source code in src/llmcompressor/utils/helpers.py
dispatch_for_generation
Dispatch a model autoregressive generation. This means that modules are dispatched evenly across avaiable devices and kept onloaded if possible.
Parameters:
-
model–model to dispatch
-
hint_batch_size–reserve memory for batch size of inputs
-
hint_batch_seq_len–reserve memory for sequence of length of inputs
-
hint_model_dtype–reserve memory for model's dtype. Will be inferred from model if none is provided
-
hint_extra_memory–extra memory reserved for model serving
-
no_split_modules–names of module classes which should not be split across multiple devices
Returns:
-
PreTrainedModel–dispatched model
Source code in src/llmcompressor/utils/dev.py
eval_context
Disable pytorch training mode for the given module
Source code in src/llmcompressor/utils/helpers.py
get_embeddings
Returns input and output embeddings of a model. If get_input_embeddings/
get_output_embeddings is not implemented on the model, then None will be returned
instead.
Parameters:
-
model(PreTrainedModel) –model to get embeddings from
Returns:
-
tuple[Module | None, Module | None]–tuple of containing embedding modules or none
Source code in src/llmcompressor/utils/transformers.py
greedy_bin_packing
Distribute items across bins using a greedy bin-packing heuristic.
Items are sorted by weight in descending order, then each item is assigned to the bin with the smallest current total weight. This approximates an even distribution of weight across bins.
Parameters:
-
items–items to distribute. Sorted in-place by descending weight.
-
num_bins–number of bins to distribute items across.
-
item_weight_fn–callable that returns the weight of an item. Defaults to uniform weight of 1.
Returns:
-
tuple[list[T], list[list[T]], dict[T, int]]–a 3-tuple of: - items: the input list, now sorted by descending weight. - bin_to_items: list of length
num_binswhere each element is the list of items assigned to that bin. - item_to_bin: mapping from each item to its assigned bin index.
Source code in src/llmcompressor/utils/dist.py
import_from_path
Import the module and the name of the function/class separated by : Examples: path = "/path/to/file.py:func_or_class_name" path = "/path/to/file:focn" path = "path.to.file:focn"
Parameters:
-
path(str) –path including the file path and object name
Source code in src/llmcompressor/utils/helpers.py
patch_transformers_logger_level
Context under which the transformers logger's level is modified
This can be used with skip_weights_download to squelch warnings related to
missing parameters in the checkpoint
Parameters:
-
level(int, default:ERROR) –new logging level for transformers logger. Logs whose level is below this level will not be logged
Source code in src/llmcompressor/utils/dev.py
skip_weights_download
Context manager under which models are initialized without having to download
the model weight files. This differs from init_empty_weights in that weights are
allocated on to assigned devices with random values, as opposed to being on the meta
device
Parameters:
-
model_class(Type[PreTrainedModel], default:AutoModelForCausalLM) –class to patch, defaults to
AutoModelForCausalLM
Source code in src/llmcompressor/utils/dev.py
targets_embeddings
targets_embeddings(
model: PreTrainedModel,
targets: NamedModules,
check_input: bool = True,
check_output: bool = True,
) -> bool
Returns True if the given targets target the word embeddings of the model
Parameters:
-
model(PreTrainedModel) –containing word embeddings
-
targets(NamedModules) –named modules to check
-
check_input(bool, default:True) –whether to check if input embeddings are targeted
-
check_output(bool, default:True) –whether to check if output embeddings are targeted
Returns:
-
bool–True if embeddings are targeted, False otherwise
Source code in src/llmcompressor/utils/transformers.py
untie_word_embeddings
Untie word embeddings, if possible. This function raises a warning if embeddings cannot be found in the model definition.
The model config will be updated to reflect that embeddings are now untied
Parameters:
-
model(PreTrainedModel) –transformers model containing word embeddings
Source code in src/llmcompressor/utils/transformers.py
wait_for_comms
Block until all pending async distributed operations complete.
Calls wait() on each work handle, then clears the list in-place
so it can be reused for the next batch of operations.
Parameters:
-
pending_comms–mutable list of async communication handles (returned by
dist.reduce,dist.broadcast, etc. withasync_op=True). The list is cleared after all operations have completed.