Introduction¶

vLLM Hardware Plugin for Intel® Gaudi® supports running inference on HPU with 8-bit floating point (FP8) precision using Intel® Neural Compressor (INC) package. Inference requires prior calibration to generate the necessary measurements, quantization files, and configuration data that are required for running quantized models.

This document explains how to perform calibration. It provides separate procedures for a single Intel® Gaudi® node and multiple nodes. Before proceeding, review the notes and recommendations and troubleshooting information to ensure proper execution.

Notes and Recommendations¶

Device Recommendation¶

For calibration, use the same device type that you plan to use for inference. The generated measurements are device-dependent, so scales collected on Intel® Gaudi® 3 cannot be reused on Intel® Gaudi® 2, and vice versa. Using measurements generated on a different device type may cause accuracy issues.

Mandatory Parameters¶

To simplify the calibration process, we offer the calibrate_model.sh script that generates the maxabs_quant_g3.json file for FP8 inference. The script requires providing the following arguments:

-m <path/ID>: Path to a locally stored model or the model ID from the Hugging Face hub.
-d <path>: Path to the source dataset in the pickle (.pkl) format.
-o <path>: Path to the directory where you want to save the generated measurements. We recommend storing unification results in the source directory. This allows you to run the vLLM server with FP8 precision and different tensor parallelism values without modifying the directory specified in the QUANT_CONFIG environment variable.

The script also offers optional arguments that you can explore by executing the script with the -h flag. The more common optional parameters are:

-b <size>: Sets the batch size used for running the measurements (default: 32).
-l <samples>: Sets the limit of the samples in the calibration dataset.
-t <size>: Sets the tensor parallel size (default: 1).

Dataset¶

The calibration procedure works with any dataset that contains the system_prompt and question fields. These fields prepare a calibration dataset with prompts formatted specifically for your model. We recommend using a public dataset from MLCommons, as used in the Llama2-70b inference submission.

DeepSeek Models¶

For the DeepSeek-R1 series models, which contain 256 experts, provide a diverse and sufficiently large sample set to ensure that all experts are properly activated during calibration. Through testing, we observed that using NeelNanda/pile-10k and selecting 512 samples, each with at least 1,024 tokens, provides effective calibration coverage.

Important (MoE models calibrated with expert parallelism). When you calibrate a Mixture-of-Experts model with -u (expert parallelism) on more than one card, each rank only measures its EP-local experts (for example, 16 of 128 experts per rank for Llama-4 Maverick on TP8/EP8). These per-rank measurements must be unified so that every expert has its own scale, otherwise the non-local experts fall back to coarse quantization at serve time and FP8 accuracy drops. The unify step runs only when a target rank is given, so pass both -u and -r to calibrate_model.sh (for example -r 1 to unify onto a single card), then use step-6-expand-measurements.py to expand to the deployment card count. Passing -u with TP>1 but without -r skips unification and leaves only the EP-local experts measured. (Small-expert models such as Mixtral are not affected.) See the MoE recommendations for the full unify-and-expand procedure.

Calibration Procedures¶

Refer to the following chapters to follow the calibration procedure for your setup:

Configuration¶

The calibrate_model.sh script automatically generates appropriate configuration files for the calibration process. However, if you require advanced customization, you can use JSON configuration files. The quantization_config directory contains JSON templates that you can use directly or modify to suit your specific requirements.

To apply custom configurations for calibration, add the QUANT_CONFIG environment variable pointing to your configuration JSON file to the step-2-measure-scales.py and step-4-quantize-scales.py calibration steps. To apply the configuration when running FP8 inference, set QUANT_CONFIG to point to the quantization configuration file, either the one generated by calibration or your custom configuration.

Supported Configuration Options¶

The following table summarizes the options that you can set in a configuration file:

Attribute	Description	Values
`mode`	The mode to run INC with.	- `MEASURE`: Measures statistics of all modules and emits the results to `dump_stats_path`. - `QUANTIZE` (default): Quantizes and runs the model according to the provided measurements.
`observer`	The method used to observe and track tensor statistics.	- `maxabs` (default): Tracks the maximum absolute values of tensors. - `save`: Saves all tensors to files.
`allowlist`	The list of `nn.Module` names or types to quantize. Empty list means all supported modules are quantized by default. See Custom Patched Modules.	Default: empty list
`blocklist`	List of `nn.Module` names or types not to quantize.	Default: empty list
`dump_stats_path`	The path to save and load measurements. Directory structure is created up to the last `/`; the string after the last `/` is used as a prefix for measurement files.	Default: `stats`
`scale_method`	The method for calculating the scale from measurements.	- `unit_scale` (default): Always uses the scale of 1. - `maxabs_arbitrary`: Stretches or compresses maxabs to the full-scale of FP8. - `maxabs_hw`: Stretches or compresses maxabs to full-scale of FP8, then replaces it with hardware-accelerated scale based on `device_for_scales`. - `maxabs_pow2`: Stretches or compresses maxabs to full-scale of FP8, then replaces it with hardware-accelerated scale based on `device_for_scales`, rounded to the power of 2. - `maxabs_hw_opt_weight`: The weight scale chosen for the minimal MSE among hardware-accelerated scales; activations use `maxabs_hw`. - `act_maxabs_pow2_weights_pcs_opt_pow2`: Per-channel weights use `maxabs_hw_opt_weight`; activations use `maxabs_pow2`. - `act_maxabs_hw_weights_pcs_maxabs_pow2`: Per-channel weights use `maxabs_pow2`; activations use `maxabs_hw`. - `act_maxabs_pcs_pow2_weight_maxabs_pts_pow2_hw`: Only for dynamic quantization. Per-tensor weights use `maxabs_hw`; activations use per-token `maxabs_pow2`.
`measure_exclude`	Tensor types to exclude from measurement.	- `NONE`: Measures all tensors. - `OUTPUT` (default): Skips output tensors.
`scale_format`	The format of scales passed to custom PyTorch operations.	- `const`: Scales passed as tensors. - `scalar` (default): Scales passed as scalar values for compilation time and throughput optimizations.
`device_for_scales`	Exponent-bias values for converting FP32/BF16 to FP8-143.	- `GAUDI3`: The expanded exponent-bias range (0–63). - `GAUDI2`: Four possible exponent biases (3, 7, 11, 15), default is 7.
`dynamic_quantization`	Enables dynamic FP8 quantization with per-token scales. Only supported with `act_maxabs_pcs_pow2_weight_maxabs_pts_pow2_hw`.	- `true`: Enable. - `false` (default): Disable.

Configuring Backoff Factors¶

When using any of the maxabs-based scale_method options, you can fine-tune the quantization behavior by configuring backoff factors. The input_backoff and weight_backoff factors provide a safety margin when converting inputs and weights to FP8. For example, if an activation has a larger absolute value than observed in calibration, the maxabs value is scaled to:

input_backoff * FP8_143_FULLSCALE

Similarly, for weights:

weight_backoff * FP8_143_FULLSCALE

By default, the backoff factors are set to:

input_backoff: 0.25
weight_backoff: 0.5

To change these values, add the following to the quantization configuration JSON file:

"scale_params": {"input_backoff": <INPUT_BACKOFF>, "weight_backoff": <WEIGHT_BACKOFF>}

Compilation Time and Throughput Optimization¶

The scale_format configuration option provides performance optimizations for FP8 inference. When set to scalar (default), it improves both compilation speed and runtime throughput by reducing the number of compiled recipes and minimizing host-side overhead when launching FP8 operations. Note that the compilation time improvement varies depending on your model's properties, such as the recipe count and scale distribution.

This optimization is not applicable to Per-Channel Quantization (PCQ).

Troubleshooting¶

If you encounter the following error when running the script, ensure you set a valid tensor parallelism value, for example -t 8:

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::939524096 (896)MB