llmcompressor.modifiers.transform.spinquant
Modules:
-
base– -
mappings– -
norm_mappings–
Classes:
-
Event–A class for defining an event that can be triggered during sparsification.
-
EventType–An Enum for defining the different types of events that can be triggered
-
Modifier–A base class for all modifiers to inherit from.
-
NormMapping–SpinQuant needs to know where every norm layer exists in the model,
-
SpinQuantMapping–SpinQuant needs to know the entire architecture of the model,
-
SpinQuantModifier–Implements the transforms according to "SpinQuant: LLM quantization
-
State–State class holds information about the current compression state.
Functions:
-
center_embeddings–Shift each embedding to have a mean of zero
-
fuse_norm_linears–Fuse the scaling operation of norm layer into subsequent linear layers.
-
untie_word_embeddings–Untie word embeddings, if possible. This function raises a warning if
Event
dataclass
Event(
type_: Optional[EventType] = None,
steps_per_epoch: Optional[int] = None,
batches_per_step: Optional[int] = None,
invocations_per_step: int = 1,
global_step: int = 0,
global_batch: int = 0,
)
A class for defining an event that can be triggered during sparsification.
Parameters:
-
type_(Optional[EventType], default:None) –The type of event.
-
steps_per_epoch(Optional[int], default:None) –The number of steps per epoch.
-
batches_per_step(Optional[int], default:None) –The number of batches per step where step is an optimizer step invocation. For most pathways, these are the same. See the invocations_per_step parameter for more details when they are not.
-
invocations_per_step(int, default:1) –The number of invocations of the step wrapper before optimizer.step was called. Generally can be left as 1 (default). For older amp pathways, this is the number of times the scaler wrapper was invoked before the wrapped optimizer step function was called to handle accumulation in fp16.
-
global_step(int, default:0) –The current global step.
-
global_batch(int, default:0) –The current global batch.
Methods:
-
new_instance–Creates a new instance of the event with the provided keyword arguments.
-
should_update–Determines if the event should trigger an update.
Attributes:
-
current_index(float) –Calculates the current index of the event.
-
epoch(int) –Calculates the current epoch.
-
epoch_based(bool) –Determines if the event is based on epochs.
-
epoch_batch(int) –Calculates the current batch within the current epoch.
-
epoch_full(float) –Calculates the current epoch with the fraction of the current step.
-
epoch_step(int) –Calculates the current step within the current epoch.
current_index
property
writable
Calculates the current index of the event.
Returns:
-
float–The current index of the event, which is either the global step or the epoch with the fraction of the current step.
Raises:
-
ValueError–if the event is not epoch based or if the steps per epoch are too many.
epoch
property
Calculates the current epoch.
Returns:
-
int–The current epoch.
Raises:
-
ValueError–if the event is not epoch based.
epoch_based
property
Determines if the event is based on epochs.
Returns:
-
bool–True if the event is based on epochs, False otherwise.
epoch_batch
property
Calculates the current batch within the current epoch.
Returns:
-
int–The current batch within the current epoch.
Raises:
-
ValueError–if the event is not epoch based.
epoch_full
property
Calculates the current epoch with the fraction of the current step.
Returns:
-
float–The current epoch with the fraction of the current step.
Raises:
-
ValueError–if the event is not epoch based.
epoch_step
property
Calculates the current step within the current epoch.
Returns:
-
int–The current step within the current epoch.
Raises:
-
ValueError–if the event is not epoch based.
new_instance
Creates a new instance of the event with the provided keyword arguments.
Parameters:
-
kwargs–Keyword arguments to set in the new instance.
Returns:
-
Event–A new instance of the event with the provided kwargs.
Source code in src/llmcompressor/core/events/event.py
should_update
Determines if the event should trigger an update.
Parameters:
-
start(Optional[float]) –The start index to check against, set to None to ignore start.
-
end(Optional[float]) –The end index to check against, set to None to ignore end.
-
update(Optional[float]) –The update interval, set to None or 0.0 to always update, otherwise must be greater than 0.0, defaults to None.
Returns:
-
bool–True if the event should trigger an update, False otherwise.
Source code in src/llmcompressor/core/events/event.py
EventType
Bases: Enum
An Enum for defining the different types of events that can be triggered during model compression lifecycles. The purpose of each EventType is to trigger the corresponding modifier callback during training or post training pipelines.
Parameters:
-
INITIALIZE–Event type for initialization.
-
FINALIZE–Event type for finalization.
-
BATCH_START–Event type for the start of a batch.
-
LOSS_CALCULATED–Event type for when loss is calculated.
-
BATCH_END–Event type for the end of a batch.
-
CALIBRATION_EPOCH_START–Event type for the start of a calibration epoch.
-
SEQUENTIAL_EPOCH_END–Event type for the end of a layer calibration epoch, specifically used by
src/llmcompressor/pipelines/sequential/pipeline.py -
CALIBRATION_EPOCH_END–Event type for the end of a calibration epoch.
-
OPTIM_PRE_STEP–Event type for pre-optimization step.
-
OPTIM_POST_STEP–Event type for post-optimization step.
Modifier
Bases: ModifierInterface, HooksMixin
A base class for all modifiers to inherit from. Modifiers are used to modify the training process for a model. Defines base attributes and methods available to all modifiers
Lifecycle: 1. initialize 2. on_event -> * on_start if self.start <= event.current_index * on_end if self.end >= event.current_index 5. finalize
Parameters:
-
index–The index of the modifier in the list of modifiers for the model
-
group–The group name for the modifier
-
start–The start step for the modifier
-
end–The end step for the modifier
-
update–The update step for the modifier
Methods:
-
finalize–Finalize the modifier for the given model and state.
-
initialize–Initialize the modifier for the given model and state.
-
on_end–on_end is called when the modifier ends and must be implemented
-
on_event–on_event is called whenever an event is triggered
-
on_finalize–on_finalize is called on modifier finalization and
-
on_initialize–on_initialize is called on modifier initialization and
-
on_start–on_start is called when the modifier starts and
-
on_update–on_update is called when the model in question must be
-
should_end–:param event: The event to check if the modifier should end
-
should_start–:param event: The event to check if the modifier should start
-
update_event–Update modifier based on the given event. In turn calls
Attributes:
-
finalized(bool) –:return: True if the modifier has been finalized
-
initialized(bool) –:return: True if the modifier has been initialized
finalize
Finalize the modifier for the given model and state.
Parameters:
-
state(State) –The current state of the model
-
kwargs–Additional arguments for finalizing the modifier
Raises:
-
RuntimeError–if the modifier has not been initialized
Source code in src/llmcompressor/modifiers/modifier.py
initialize
Initialize the modifier for the given model and state.
Parameters:
-
state(State) –The current state of the model
-
kwargs–Additional arguments for initializing the modifier
Raises:
-
RuntimeError–if the modifier has already been finalized
Source code in src/llmcompressor/modifiers/modifier.py
on_end
on_end is called when the modifier ends and must be implemented by the inheriting modifier.
Parameters:
-
state(State) –The current state of the model
-
event(Event) –The event that triggered the end
-
kwargs–Additional arguments for ending the modifier
Source code in src/llmcompressor/modifiers/modifier.py
on_event
on_finalize
on_finalize is called on modifier finalization and must be implemented by the inheriting modifier.
Parameters:
-
state(State) –The current state of the model
-
kwargs–Additional arguments for finalizing the modifier
Returns:
-
bool–True if the modifier was finalized successfully, False otherwise
Source code in src/llmcompressor/modifiers/modifier.py
on_initialize
abstractmethod
on_initialize is called on modifier initialization and must be implemented by the inheriting modifier.
Parameters:
-
state(State) –The current state of the model
-
kwargs–Additional arguments for initializing the modifier
Returns:
-
bool–True if the modifier was initialized successfully, False otherwise
Source code in src/llmcompressor/modifiers/modifier.py
on_start
on_start is called when the modifier starts and must be implemented by the inheriting modifier.
Parameters:
-
state(State) –The current state of the model
-
event(Event) –The event that triggered the start
-
kwargs–Additional arguments for starting the modifier
Source code in src/llmcompressor/modifiers/modifier.py
on_update
on_update is called when the model in question must be updated based on passed in event. Must be implemented by the inheriting modifier.
Parameters:
-
state(State) –The current state of the model
-
event(Event) –The event that triggered the update
-
kwargs–Additional arguments for updating the model
Source code in src/llmcompressor/modifiers/modifier.py
should_end
Parameters:
-
event(Event) –The event to check if the modifier should end
Returns:
-
–
True if the modifier should end based on the given event
Source code in src/llmcompressor/modifiers/modifier.py
should_start
Parameters:
-
event(Event) –The event to check if the modifier should start
Returns:
-
bool–True if the modifier should start based on the given event
Source code in src/llmcompressor/modifiers/modifier.py
update_event
Update modifier based on the given event. In turn calls on_start, on_update, and on_end based on the event and modifier settings. Returns immediately if the modifier is not initialized
Parameters:
-
state(State) –The current state of compression
-
event(Event) –The event to update the modifier with
-
kwargs–Additional arguments for updating the modifier
Raises:
-
RuntimeError–if the modifier has been finalized
Source code in src/llmcompressor/modifiers/modifier.py
NormMapping
Bases: BaseModel
SpinQuant needs to know where every norm layer exists in the model, as well as all the subsequent Linear layers the norm passes into. This is because the norm layer weights need to normalized before transforms can be fused into Linear layers.
Parameters:
-
norm–name or regex that matches norm layer in model
-
linears–list of names or regexes of Linear layers that receive input from norm.
SpinQuantMapping
Bases: BaseModel
SpinQuant needs to know the entire architecture of the model, as R1, R2, R3, and R4 rotations need to be applied to specific layers (https://arxiv.org/pdf/2405.16406 Fig. 1).
Parameters:
-
embedding–name or regex of embedding layer
-
attn–name or regex of attention block in decoder layer
-
attn_q–name or regex of q_proj layer in attention block
-
attn_k–name or regex of k_proj layer in attention block
-
attn_v–name or regex of v_proj layer in attention block
-
attn_o–name or regex of o_proj layer in attention block
-
attn_head_dim–head_dim of the attention module, needed because R2 needs to be applied "head-wisely" to v_proj and o_proj
-
mlp_in–list of names or regexes for the mlp blocks that receive the input to the MLP block, usually up_proj and gate_proj
-
mlp_out–list of names or regexes for the mlp blocks that constitute the output of the MLP block, usually down_proj
SpinQuantModifier
Bases: Modifier
Implements the transforms according to "SpinQuant: LLM quantization with learned rotations" (https://arxiv.org/abs/2405.16406)
Transforms (rotations) are extra layers added to a model which reduce the accuracy loss induced by quantization. This is achieved through "rotating" weights and activations into a space with a smaller dynamic range of values, thus decreasing the range of scales required for quantization.
The SpinQuant authors describe four different rotations which can be applied to a model. R1 and R2 are "offline" rotations, meaning that they can be fused into existing weights and therefore do not induce runtime cost. R3 and R4 are "online" rotations, meaning that they require additional computation at runtime.
Lifecycle:
- on_initialize
- infer SpinQuantMappings & NormMappings
- as needed, create transform schemes for R1, R2, R3, & R4
- on_start
- normalize embeddings
- fuse norm layers into subsequent Linear layers
- apply TransformConfig
- fuse transforms into weights for mergeable transforms
- add hooks for online transforms
- on sequential epoch end
- on_end
- on_finalize
Parameters:
-
rotations–A list containing the names of rotations to apply to the model. Possible rotations include R1, R2, R3, and R4
-
transform_type–The type of transform to apply to the model.
"hadamard"has the least performance cost but only supports sizes which are powers of power of two."random-matrix"has more performance cost, but supports a much larger set of sizes."random-matrix"has the greatest performance cost, but supports any size -
randomize–if True, create distinct transforms for each application
-
learnable–if True, attach gradients to transform weights for training
-
precision–Precision at which all transforms should be applied. This applies to both weight fusing and online rotations
-
transform_block_size–Block size to use for rotation matrices. The model's hidden_size and head_dim must be evenly divisible by transform_block_size. Layers will be transformed by a block-diagonal matrix where each block is a matrix of this size. If None is provided, model's hidden_size will be used for R1, R3, and R4 and model's head_dim will be used for R2
-
mappings–Specifies layers within a model to target for transforms. A mapping will be inferred if None is provided
-
norm_mappings–Specifies layers within a model to target for norm fusing. A mapping will be inferred if None is provided
-
transform_config–Optional transform config for overriding provided arguments
State
dataclass
State(
model: Any = None,
teacher_model: Any = None,
optimizer: Any = None,
optim_wrapped: bool = None,
loss: Any = None,
batch_data: Any = None,
data: Data = Data(),
hardware: Hardware = Hardware(),
loss_masks: list[Tensor] | None = None,
current_batch_idx: int = -1,
sequential_prefetch: bool = False,
)
State class holds information about the current compression state.
Parameters:
-
model(Any, default:None) –The model being used for compression
-
teacher_model(Any, default:None) –The teacher model being used for compression
-
optimizer(Any, default:None) –The optimizer being used for training
-
optim_wrapped(bool, default:None) –Whether or not the optimizer has been wrapped
-
loss(Any, default:None) –The loss function being used for training
-
batch_data(Any, default:None) –The current batch of data being used for compression
-
data(Data, default:Data()) –The data sets being used for training, validation, testing, and/or calibration, wrapped in a Data instance
-
hardware(Hardware, default:Hardware()) –Hardware instance holding info about the target hardware being used
Methods:
-
update–Update the state with the given parameters.
Attributes:
-
compression_ready(bool) –Check if the model and optimizer are set for compression.
compression_ready
property
Check if the model and optimizer are set for compression.
Returns:
-
bool–True if model and optimizer are set, False otherwise
update
update(
model: Any = None,
teacher_model: Any = None,
optimizer: Any = None,
attach_optim_callbacks: bool = True,
train_data: Any = None,
val_data: Any = None,
test_data: Any = None,
calib_data: Any = None,
copy_data: bool = True,
start: float = None,
steps_per_epoch: int = None,
batches_per_step: int = None,
**kwargs,
) -> dict
Update the state with the given parameters.
Parameters:
-
model(Any, default:None) –The model to update the state with
-
teacher_model(Any, default:None) –The teacher model to update the state with
-
optimizer(Any, default:None) –The optimizer to update the state with
-
attach_optim_callbacks(bool, default:True) –Whether or not to attach optimizer callbacks
-
train_data(Any, default:None) –The training data to update the state with
-
val_data(Any, default:None) –The validation data to update the state with
-
test_data(Any, default:None) –The testing data to update the state with
-
calib_data(Any, default:None) –The calibration data to update the state with
-
copy_data(bool, default:True) –Whether or not to copy the data
-
start(float, default:None) –The start index to update the state with
-
steps_per_epoch(int, default:None) –The steps per epoch to update the state with
-
batches_per_step(int, default:None) –The batches per step to update the state with
-
kwargs–Additional keyword arguments to update the state with
Returns:
-
Dict–The updated state as a dictionary
Source code in src/llmcompressor/core/state.py
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 | |
center_embeddings
Shift each embedding to have a mean of zero
Parameters:
-
embedding(Module) –embedding module containing embeddings to center
Source code in src/llmcompressor/modeling/fuse.py
fuse_norm_linears
Fuse the scaling operation of norm layer into subsequent linear layers. This useful for ensuring transform invariance between norm and linear layers.
Note that unitary transforms (rotation) commute with normalization, but not scaling
Parameters:
-
norm(Module) –norm layer whose weight will be fused into subsequent linears
-
linears(Iterable[Linear]) –linear layers which directly follow the norm layer
Source code in src/llmcompressor/modeling/fuse.py
untie_word_embeddings
Untie word embeddings, if possible. This function raises a warning if embeddings cannot be found in the model definition.
The model config will be updated to reflect that embeddings are now untied
Parameters:
-
model(PreTrainedModel) –transformers model containing word embeddings