vllm.model_executor.layers.quantization.utils.nvfp4_emulation_utils ¶
Functions:
-
dequantize_to_dtype–Dequantize the fp4 tensor back to high precision.
_dequantize_nvfp4_kernel(fp4_ptr, scale_ptr, global_scale_ptr, output_ptr, rows_per_batch, num_blocks, BLOCK_SIZE, has_batch_global_scale, TILE_BLOCKS) ¶
Triton kernel for NVFP4 dequantization (swizzle=False).
Optimized with 2D tile processing + interleave for coalesced stores.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
_e2m1_inline(nibble) ¶
Decode an NVFP4 nibble (4 bits: 1 sign + 3 magnitude) to float32.
Uses direct IEEE 754 bit construction. For magnitudes 2-7 the FP32 bit pattern is 0x3F000000 + (mag << 22), which is a single shift + add + bitcast. Magnitudes 0 (zero) and 1 (E2M1 subnormal = 0.5) are patched with two tl.where ops.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
_e2m1_lookup(magnitude) ¶
Lookup E2M1 float value from 3-bit magnitude.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
_nvfp4_quant_dequant_kernel(input_ptr, output_ptr, global_scale_ptr, k, num_blocks, BLOCK_SIZE, FP4_MAX_RECIPROCAL, TILE_BLOCKS) ¶
Fused NVFP4 quantize-dequantize kernel.
Uses a 2D grid (rows x tiles) to parallelize across both rows and quantization groups within a row. Each program handles TILE_BLOCKS groups at once using vectorized 2D operations.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
_round_to_fp4(x) ¶
Round float values to the nearest E2M1 representable value.
Matches the thresholds in the Python cast_to_fp4 exactly.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
_triton_dequantize_nvfp4(tensor_fp4, tensor_sf, global_scale, dtype, block_size=16) ¶
Dequantize NVFP4 using Triton (swizzle=False only).
Supports both 2D and 3D inputs: - 2D: [m, packed_k] -> [m, k] - 3D: [dim0, m, packed_k] -> [dim0, m, k]
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
_triton_nvfp4_quant_dequant(x, global_scale, block_size) ¶
Triton-accelerated NVFP4 quantize-dequantize.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
dequantize_to_dtype(tensor_fp4, tensor_sf, global_scale, dtype, block_size=16, swizzle=True) ¶
Dequantize the fp4 tensor back to high precision.
Supports both 2D and 3D inputs: - 2D: [m, packed_k] -> [m, k] - 3D: [dim0, m, packed_k] -> [dim0, m, k]
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
ref_nvfp4_quant_dequant(x, global_scale, block_size) ¶
NVFP4 quantize-dequantize operation.
global_scale is expected to have a single element.