vllm.model_executor.model_loader.reload.layerwise ¶
Functions:
-
finalize_layerwise_processing–Apply processing to any layers which were not layerwise processed during loading.
-
get_layerwise_info–Get information related to restoring and layerwise processing. If no previous
-
initialize_layerwise_reload–Set up layerwise weight loading with deferred processing.
-
record_metadata_for_reloading–Record layer metadata needed for later reloading.
_copy_and_restore_kernel_tensors(layer, info) ¶
Copy processed values into original kernel tensor storage and restore kernel tensor references on the layer. Preserves cudagraph references.
Source code in vllm/model_executor/model_loader/reload/layerwise.py
_get_original_loader(tensor) ¶
Return the weight loader with any layerwise wrappers removed
Source code in vllm/model_executor/model_loader/reload/layerwise.py
_layerwise_process(layer, info) ¶
Finalize layer loading after all weights have been buffered.
This function: 1. Materializes the layer onto the target device 2. Loads all buffered weights 3. Runs quantization processing if applicable 4. Copies processed values back to original tensor storage
Source code in vllm/model_executor/model_loader/reload/layerwise.py
_reload_attention_scales(layer, info) ¶
Load and process attention scale weights (k_scale, v_scale, etc.) during reload.
Assumes dtype/shapes of attention tensors do not change during processing, since we use .data.copy_() to preserve kernel tensor references.
Source code in vllm/model_executor/model_loader/reload/layerwise.py
_wrap_parameters_weight_loader(layer) ¶
Wrap each parameter's weight loader.
Source code in vllm/model_executor/model_loader/reload/layerwise.py
finalize_layerwise_processing(model, model_config) ¶
Apply processing to any layers which were not layerwise processed during loading. This includes attention layers and layers which have weight elements which are not loaded (due to padding).
This function should be applied after initialize_layerwise_reload is applied unwrap the layerwise weight loaders.
Parameters:
-
(model¶Module) –model to finalize processing for
-
(model_config¶ModelConfig) –config needed for applying processing to attention layers
Source code in vllm/model_executor/model_loader/reload/layerwise.py
get_layerwise_info(layer) ¶
Get information related to restoring and layerwise processing. If no previous information existed, a new entry is constructed
Source code in vllm/model_executor/model_loader/reload/layerwise.py
initialize_layerwise_reload(model) ¶
Set up layerwise weight loading with deferred processing.
Must be called after record_metadata_for_reloading. This function: 1. Saves current kernel tensors for later copying 2. Restores layer parameters/buffers from metadata (on meta device) 3. Wraps weight loaders to defer processing until all weights are loaded
When all weights for a layer are loaded, the wrapped loaders will: 1. Materialize the layer onto the target device 2. Load all cached weights 3. Run quantization processing if applicable 4. Copy processed values back to original tensor storage
Source code in vllm/model_executor/model_loader/reload/layerwise.py
initialize_online_processing(layer) ¶
Wrap a layer's weight loaders with online processing loaders. Called by either initialize_layerwise_reload or an online quantization scheme, prevents double wrapping in the case of online quantization + reloading
Parameters:
Source code in vllm/model_executor/model_loader/reload/layerwise.py
make_online_process_loader(layer, param_name) ¶
Create a wrapped weight loader that defers processing.
Source code in vllm/model_executor/model_loader/reload/layerwise.py
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 | |
record_metadata_for_reloading(model) ¶
Record layer metadata needed for later reloading.
Stores parameter and buffer metadata as meta tensors for restoration. Must be called before initialize_layerwise_reload.