vllm.model_executor.layers.quantization.utils.mxfp8_utils ¶
Functions:
-
dequant_mxfp8_to_bf16–Dequantize MXFP8 tensor to BF16.
-
mxfp8_e4m3_quantize_fake–Fake implementation for torch.compile tracing.
-
swizzle_mxfp8_scale–Swizzle MXFP8 scales from row-major 2D to F8_128x4 layout.
_mxfp8_e4m3_quantize_torch(x, is_sf_swizzled_layout=False) ¶
Naive MXFP8 quantization. For each block of 32 elements along the last dimension, compute a shared e8m0 scale (the biased exponent of the block-wise amax) and quantize each element to float8_e4m3fn.
Returns (quantized_values [same shape, fp8], scales uint8). Scale shape depends on is_sf_swizzled_layout: False -> [..., K//32] (row-major 2D) True -> [flat swizzled 1D]
Source code in vllm/model_executor/layers/quantization/utils/mxfp8_utils.py
_mxfp8_e4m3_quantize_triton(x) ¶
Fused 2D MXFP8 quant (non-swizzled, row-major [M, K//32] scales).
Source code in vllm/model_executor/layers/quantization/utils/mxfp8_utils.py
_mxfp8_quant_triton_kernel() ¶
Lazily-built Triton kernel: per-32-block E8M0 scale + FP8-E4M3 quant.
Fuses what _mxfp8_e4m3_quantize_torch does in several elementwise passes into one launch. Each program handles [BLOCK_M, 32] (one MX block).
Source code in vllm/model_executor/layers/quantization/utils/mxfp8_utils.py
dequant_mxfp8_to_bf16(x, scales) ¶
Dequantize MXFP8 tensor to BF16.
Source code in vllm/model_executor/layers/quantization/utils/mxfp8_utils.py
mxfp8_e4m3_quantize_fake(x, is_sf_swizzled_layout=False, alignment=0) ¶
Fake implementation for torch.compile tracing.
Source code in vllm/model_executor/layers/quantization/utils/mxfp8_utils.py
swizzle_mxfp8_scale(sf, M, K) ¶
Swizzle MXFP8 scales from row-major 2D to F8_128x4 layout.