vllm.model_executor.layers.fusion.quant_activation ¶
A QuantizedActivation is a pre-quantized activation produced by a fused kernel and consumed directly by a linear layer, letting the layer skip its own input quantization. A linear advertises the key its kernel can consume via expose_input_quant_key; the kernel validates and reads the activation via as_quantized_activation.
Classes:
-
QuantizedActivation–A quantized activation paired with its scale and original metadata.
Functions:
-
as_quantized_activation–Validate and narrow a pre-quantized activation for a consumer kernel.
-
expose_input_quant_key–Advertise the kernel's pre-quantized input key on the layer, if any.
QuantizedActivation dataclass ¶
A quantized activation paired with its scale and original metadata.
The quant_key describes how data and scale are to be interpreted (dtype, scale granularity, value packing). Details the key does not capture, such as blockscale layout or activation padding, must follow the consumer kernel's convention.
TODO(mgoin): Encode layout and padding requirements in the contract so producers can match consumer kernels without relying on convention.
Source code in vllm/model_executor/layers/fusion/quant_activation.py
as_quantized_activation(x, expected_key) ¶
Validate and narrow a pre-quantized activation for a consumer kernel.
Returns the QuantizedActivation when x is one whose key matches the kernel's declared expected_key, and None when x is a plain tensor (the caller quantizes in-kernel). Raises on a key mismatch so a wrongly routed activation fails loudly instead of being silently re-quantized.
Source code in vllm/model_executor/layers/fusion/quant_activation.py
expose_input_quant_key(layer, kernel) ¶
Advertise the kernel's pre-quantized input key on the layer, if any.
This is the bridge from a kernel's input_quant_key() to the layer.input_quant_key attribute that fusion call sites read. The attribute is left unset when the kernel quantizes its own input, so non-supporting backends never receive a QuantizedActivation.
TODO(mgoin): Producers also need the consumer's quantization scales (e.g. static input scale, global scale). Expose those here as well so producers do not reach into kernel-specific layer attributes.