vllm.model_executor.model_loader.weight_utils ¶
Utilities for downloading and initializing model weights.
Functions:
-
atomic_writer–Context manager that provides an atomic file writing routine.
-
composed_weight_loader–Create a weight loader that post-processes the weights after loading
-
convert_pyslice_to_tensor–convert PySafeSlice object from safetensors to torch.Tensor
-
default_weight_loader–Default weight loader.
-
download_safetensors_index_file_from_hf–Download hf safetensors index file from Hugging Face Hub.
-
download_weights_from_hf–Download model weights from Hugging Face Hub.
-
enable_xet_high_performance–automatically activates xet high performance mode
-
fastsafetensors_weights_iterator–Iterate over the weights in the model safetensor files
-
filter_files_not_needed_for_inference–Exclude files that are not needed for inference.
-
initialize_dummy_weights–Initialize model weights with random values.
-
instanttensor_weights_iterator–Iterate over the weights in the model safetensor files
-
maybe_download_from_modelscope–Download model from ModelScope hub if VLLM_USE_MODELSCOPE is True.
-
maybe_remap_kv_scale_name–Remap the name of FP8 k/v_scale parameters.
-
maybe_remap_moe_expert_param_name–Remap MoE expert parameter names to account for routed_experts hierarchy.
-
multi_thread_pt_weights_iterator–Multi-Thread iterate over the weights in the model bin/pt files.
-
multi_thread_safetensors_weights_iterator–Multi-Thread iterate over the weights in the model safetensor files.
-
np_cache_weights_iterator–Iterate over the weights in the model np files.
-
pt_weights_iterator–Iterate over the weights in the model bin/pt files.
-
remap_moe_expert_weights–Wrapper generator that remaps MoE expert parameter names for backward compatibility.
-
row_parallel_weight_loader–Load weights that are row-parallelized.
-
runai_safetensors_weights_iterator–Iterate over the weights in the model safetensor files.
-
safetensors_weights_iterator–Iterate over the weights in the model safetensor files.
-
sharded_weight_loader–Create a weight loader that shards the weights along the given axis
_get_available_ram_bytes() ¶
_get_checkpoints_size_bytes(files) ¶
Return the total size of the checkpoint files in bytes.
_get_fs_type(files) ¶
Get the filesystem type of the first file in files (Linux only).
Source code in vllm/model_executor/model_loader/weight_utils.py
_natural_sort_key(filepath) ¶
Natural sort key for filenames with numeric components, such as model-00001-of-00005.safetensors -> ['model-', 1, '-of-', 5, '.safetensors']
Source code in vllm/model_executor/model_loader/weight_utils.py
_prefetch_all_checkpoints(sorted_files, num_prefetch_threads=DEFAULT_SAFETENSORS_PREFETCH_NUM_THREADS, block_size=DEFAULT_SAFETENSORS_PREFETCH_BLOCK_SIZE) ¶
Start prefetching checkpoint files into page cache in a background thread.
Source code in vllm/model_executor/model_loader/weight_utils.py
_prefetch_checkpoint(file_path, block_size=DEFAULT_SAFETENSORS_PREFETCH_BLOCK_SIZE) ¶
Prefetch a checkpoint file into the OS page cache.
Reads the file in blocks so the kernel caches its pages before workers load the same file.
Source code in vllm/model_executor/model_loader/weight_utils.py
atomic_writer(filepath, mode='w', encoding=None) ¶
Context manager that provides an atomic file writing routine.
The context manager writes to a temporary file and, if successful, atomically replaces the original file.
Parameters:
-
(filepath¶str or Path) –The path to the file to write.
-
(mode¶str, default:'w') –The file mode for the temporary file (e.g., 'w', 'wb').
-
(encoding¶str, default:None) –The encoding for text mode.
Yields:
Source code in vllm/model_executor/model_loader/weight_utils.py
composed_weight_loader(loader, fn) ¶
Create a weight loader that post-processes the weights after loading
Source code in vllm/model_executor/model_loader/weight_utils.py
convert_pyslice_to_tensor(x) ¶
convert PySafeSlice object from safetensors to torch.Tensor
PySafeSlice object supports indexing, which is done before loading the actual tensor and can reduce the amount of memory being read into the memory. However, it does not support more advanced functionalities like .view() or .t(). Therefore, if we need to modify the loaded tensor with these more complicated operators, we need to convert to tensor first.
Source code in vllm/model_executor/model_loader/weight_utils.py
default_weight_loader(param, loaded_weight) ¶
Default weight loader.
Source code in vllm/model_executor/model_loader/weight_utils.py
download_safetensors_index_file_from_hf(model_name_or_path, index_file, cache_dir, subfolder=None, revision=None) ¶
Download hf safetensors index file from Hugging Face Hub.
Parameters:
-
(model_name_or_path¶str) –The model name or path.
-
(index_file¶str) –The safetensors index file name
-
(cache_dir¶Optional[str]) –The cache directory to store the model weights. If None, will use HF defaults.
-
(subfolder¶Optional[str], default:None) –The subfolder within the model repository to download weights from.
-
(revision¶Optional[str], default:None) –The revision of the model.
Source code in vllm/model_executor/model_loader/weight_utils.py
download_weights_from_hf(model_name_or_path, cache_dir, allow_patterns, revision=None, subfolder=None, ignore_patterns=None) ¶
Download model weights from Hugging Face Hub.
Parameters:
-
(model_name_or_path¶str) –The model name or path.
-
(cache_dir¶Optional[str]) –The cache directory to store the model weights. If None, will use HF defaults.
-
(allow_patterns¶list[str]) –The allowed patterns for the weight files. Files matched by any of the patterns will be downloaded.
-
(revision¶Optional[str], default:None) –The revision of the model.
-
(subfolder¶Optional[str], default:None) –The subfolder within the model repository to download weights from.
-
(ignore_patterns¶Optional[Union[str, list[str]]], default:None) –The patterns to filter out the weight files. Files matched by any of the patterns will be ignored.
Returns:
-
str(str) –The path to the downloaded model weights.
Source code in vllm/model_executor/model_loader/weight_utils.py
431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 | |
enable_xet_high_performance() ¶
automatically activates xet high performance mode
fastsafetensors_weights_iterator(hf_weights_files, use_tqdm_on_load) ¶
Iterate over the weights in the model safetensor files using fastsafetensor library.
Source code in vllm/model_executor/model_loader/weight_utils.py
filter_files_not_needed_for_inference(hf_weights_files) ¶
Exclude files that are not needed for inference.
See https://github.com/huggingface/transformers/blob/v4.34.0/src/transformers/trainer.py#L227-L233
Source code in vllm/model_executor/model_loader/weight_utils.py
initialize_dummy_weights(model, model_config, low=-0.001, high=0.001, seed=1234) ¶
Initialize model weights with random values.
The model weights must be randomly initialized for accurate performance measurements. Additionally, the model weights should not cause NaNs in the forward pass. We empirically found that initializing the weights with values between -1e-3 and 1e-3 works well for most models.
We use per-parameter random seed, so that dummy weights are consistent, even if the model is partitioned across multiple devices. When the seed is fixed, the random values generated by this function only depends on the parameter's number of elements and its data type.
Source code in vllm/model_executor/model_loader/weight_utils.py
instanttensor_weights_iterator(hf_weights_files, use_tqdm_on_load) ¶
Iterate over the weights in the model safetensor files using instanttensor library.
Source code in vllm/model_executor/model_loader/weight_utils.py
maybe_download_from_modelscope(model, revision=None, download_dir=None, ignore_patterns=None, allow_patterns=None) ¶
Download model from ModelScope hub if VLLM_USE_MODELSCOPE is True.
Returns the path to the downloaded model, or None if the model is not downloaded from ModelScope.
Source code in vllm/model_executor/model_loader/weight_utils.py
maybe_remap_kv_scale_name(name, params_dict) ¶
Remap the name of FP8 k/v_scale parameters.
This function handles the remapping of FP8 k/v_scale parameter names. It detects if the given name ends with a suffix and attempts to remap it to the expected name format in the model. If the remapped name is not found in the params_dict, a warning is printed and None is returned.
Parameters:
-
(name¶str) –The original loaded checkpoint parameter name.
-
(params_dict¶dict) –Dictionary containing the model's named parameters.
Returns:
-
str(str | None) –The remapped parameter name if successful, or the original name if no remapping is needed.
-
None(str | None) –If the remapped name is not found in params_dict.
Source code in vllm/model_executor/model_loader/weight_utils.py
1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 | |
maybe_remap_moe_expert_param_name(name, params_dict) ¶
Remap MoE expert parameter names to account for routed_experts hierarchy.
This handles the transition from the old FusedMoE structure where weights were directly in the experts module, to the new MoERunner → RoutedExperts structure.
Checkpoint weights have names like
layers.0.mlp.experts.w13_weight layers.0.feed_forward.experts.w2_input_scale
But actual parameters are now: layers.0.mlp.experts.routed_experts.w13_weight layers.0.feed_forward.experts.routed_experts.w2_input_scale
This function inserts 'routed_experts.' into the path when needed.
Parameters:
-
(name¶str) –Parameter name from checkpoint
-
(params_dict¶dict[str, Parameter]) –Dictionary of model parameters (from named_parameters())
Returns:
Source code in vllm/model_executor/model_loader/weight_utils.py
1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 | |
multi_thread_pt_weights_iterator(hf_weights_files, use_tqdm_on_load, pt_load_map_location='cpu', max_workers=4) ¶
Multi-Thread iterate over the weights in the model bin/pt files.
Source code in vllm/model_executor/model_loader/weight_utils.py
multi_thread_safetensors_weights_iterator(hf_weights_files, use_tqdm_on_load, max_workers=4) ¶
Multi-Thread iterate over the weights in the model safetensor files.
Source code in vllm/model_executor/model_loader/weight_utils.py
np_cache_weights_iterator(model_name_or_path, cache_dir, hf_folder, hf_weights_files, use_tqdm_on_load) ¶
Iterate over the weights in the model np files.
Will dump the model weights to numpy files if they are not already dumped.
Source code in vllm/model_executor/model_loader/weight_utils.py
pt_weights_iterator(hf_weights_files, use_tqdm_on_load, pt_load_map_location='cpu') ¶
Iterate over the weights in the model bin/pt files.
Source code in vllm/model_executor/model_loader/weight_utils.py
remap_moe_expert_weights(weights, params_dict) ¶
Wrapper generator that remaps MoE expert parameter names for backward compatibility.
This allows models with custom weight loading to automatically handle both old and new checkpoint formats without needing model-specific remapping code.
Usage
params_dict = dict(model.named_parameters()) for name, weight in remap_moe_expert_weights(weights, params_dict): # name is automatically remapped if needed param = params_dict[name] ...
Parameters:
-
(weights¶Iterable[tuple[str, Tensor]]) –Iterator of (name, tensor) tuples from checkpoint
-
(params_dict¶dict[str, Parameter]) –Dictionary of model parameters (from named_parameters())
Yields:
Source code in vllm/model_executor/model_loader/weight_utils.py
row_parallel_weight_loader(param, loaded_weight) ¶
Load weights that are row-parallelized.
Source code in vllm/model_executor/model_loader/weight_utils.py
runai_safetensors_weights_iterator(hf_weights_files, use_tqdm_on_load, is_distributed=False) ¶
Iterate over the weights in the model safetensor files.
Source code in vllm/model_executor/model_loader/weight_utils.py
safetensors_weights_iterator(hf_weights_files, use_tqdm_on_load, safetensors_load_strategy=None, local_expert_ids=None, *, safetensors_prefetch_num_threads=DEFAULT_SAFETENSORS_PREFETCH_NUM_THREADS, safetensors_prefetch_block_size=DEFAULT_SAFETENSORS_PREFETCH_BLOCK_SIZE) ¶
Iterate over the weights in the model safetensor files.
When local_expert_ids is provided, expert weights not belonging to this rank are skipped before reading from disk, which drastically reduces storage I/O for MoE models under EP.
Source code in vllm/model_executor/model_loader/weight_utils.py
821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 | |
sharded_weight_loader(shard_axis) ¶
Create a weight loader that shards the weights along the given axis