vllm.model_executor.layers.quantization.utils.humming_utils ¶
Functions:
-
convert_linear_layer_to_humming_standard–Rename/reshape a linear layer's quantized params (the canonical MPLinear
-
convert_to_humming_moe_kernel_format–Convert MoE weights from checkpoint format to Humming kernel format.
-
select_humming_moe_experts–Select the primary Humming MoE Experts class
_convert_sublayer_to_humming(layer, sublayer_name, shape_n, shape_k, weight_schema, input_schema, num_experts, param_dtype) ¶
Convert a sublayer's weights from checkpoint format to Humming format.
Returns:
Source code in vllm/model_executor/layers/quantization/utils/humming_utils.py
_extract_sublayer_tensors(layer, sublayer_name) ¶
Extract tensors for a specific sublayer from the layer's state dict.
Source code in vllm/model_executor/layers/quantization/utils/humming_utils.py
_group_shape(group_size, group_size_n=0) ¶
Map humming group sizes to QuantKey GroupShape.
group_size: elements per group along K (col); 0 means full dimension. group_size_n: elements per group along N (row); 0 means 1 (per-row).
GroupShape convention: row = N dim, col = K dim.
Source code in vllm/model_executor/layers/quantization/utils/humming_utils.py
_humming_input_schema_to_quant_key(schema) ¶
Convert a HummingInputSchema to a QuantKey. Returns None if the schema represents unquantized (bf16/fp16) inputs.
Source code in vllm/model_executor/layers/quantization/utils/humming_utils.py
_prepare_and_transform_sublayer(layer, sublayer_name, shape_n, shape_k, weight_schema, input_schema, has_bias, num_experts, param_dtype) ¶
Prepare layer metadata and transform weights for a sublayer.
This calls Humming's prepare_layer_meta and transform_humming_layer.
Source code in vllm/model_executor/layers/quantization/utils/humming_utils.py
_process_single_sublayer(layer, sublayer_name, shape_n, shape_k, weight_schema, input_schema, has_bias, num_experts, param_dtype, force_weight_schema=None) ¶
Process a single sublayer: convert, optionally requant, prepare, and transform.
This combines the common logic from convert_to_humming_moe_kernel_format for processing a single sublayer.
Parameters:
-
(layer¶RoutedExperts) –The RoutedExperts layer
-
(sublayer_name¶str) –Name of the sublayer (e.g., "w13", "w2")
-
(shape_n¶int) –Output dimension size
-
(shape_k¶int) –Input dimension size
-
(weight_schema¶Any) –Initial weight quantization schema
-
(input_schema¶Any) –Initial input quantization schema
-
(has_bias¶bool) –Whether the layer has bias terms
-
(num_experts¶int) –Number of experts
-
(param_dtype¶dtype) –Parameter data type
-
(force_weight_schema¶Any | None, default:None) –Optional schema to force requantization to
Returns:
Source code in vllm/model_executor/layers/quantization/utils/humming_utils.py
836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 | |
_replace_layer_parameters(layer, sublayer_name, tensors, preserve_bias=False) ¶
Replace layer parameters for a sublayer with new tensors.
Parameters:
-
(layer¶RoutedExperts) –The RoutedExperts layer
-
(sublayer_name¶str) –Name of the sublayer (e.g., "w13", "w2")
-
(tensors¶dict[str, Tensor]) –Dict of parameter name to tensor
-
(preserve_bias¶bool, default:False) –If True, don't delete bias parameters
Source code in vllm/model_executor/layers/quantization/utils/humming_utils.py
convert_linear_layer_to_humming_standard(layer, name_map) ¶
Rename/reshape a linear layer's quantized params (the canonical MPLinear layout: weight_packed int32 + weight_scale) into the parameter names and layout humming's weight schema expects (weight / weight_scale).
Source code in vllm/model_executor/layers/quantization/utils/humming_utils.py
convert_to_humming_moe_kernel_format(layer, quant_config=None, sublayer_configs=None, weight_schema=None, input_schema=None, force_weight_schema=None) ¶
Convert MoE weights from checkpoint format to Humming kernel format.
This function processes weights for each sublayer (w13, w2) by: 1. Converting from checkpoint format to humming format if needed 2. Force requanting if a different quantization schema is specified 3. Preparing layer metadata for the Humming kernel 4. Transforming weights for inference
Parameters:
-
(layer¶RoutedExperts) –The RoutedExperts layer containing weights to process
-
(quant_config¶dict | None, default:None) –Optional quantization config dict. Required if weight_schema or input_schema are None. Used to build schemas via BaseWeightSchema.from_config().
-
(sublayer_configs¶dict[str, Any] | None, default:None) –Optional configuration dict for each sublayer (w13, w2). Each config must have "shape_n" and "shape_k" keys. If None, configs are built from layer.moe_config properties.
-
(weight_schema¶Any | None, default:None) –Optional initial weight quantization schema. If None, built from quant_config.
-
(input_schema¶Any | None, default:None) –Optional initial input quantization schema. If None, built from quant_config or env vars.
-
(force_weight_schema¶Any | None, default:None) –Optional schema to force requantization to
Side effects
- Modifies layer parameters in place
- Sets layer.weight_schemas and layer.input_schemas
Source code in vllm/model_executor/layers/quantization/utils/humming_utils.py
914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 | |
select_humming_moe_experts(config, weight_key, activation_key) ¶
Select the primary Humming MoE Experts class Note: Shape-specific fallbacks may still occur at runtime.