Skip to content

Introduction

vLLM Hardware Plugin for Intel® Gaudi® supports running inference on HPU with 8-bit floating point (FP8) precision using Intel® Neural Compressor (INC) package. Inference requires prior calibration to generate the necessary measurements, quantization files, and configuration data that are required for running quantized models.

This document explains how to perform calibration. It provides separate procedures for a single Intel® Gaudi® node and multiple nodes. Before proceeding, review the notes and recommendations and troubleshooting information to ensure proper execution.

Notes and Recommendations

Device Recommendation

For calibration, use the same device type that you plan to use for inference. The generated measurements are device-dependent, so scales collected on Intel® Gaudi® 3 cannot be reused on Intel® Gaudi® 2, and vice versa. Using measurements generated on a different device type may cause accuracy issues.

Mandatory Parameters

To simplify the calibration process, we offer the calibrate_model.sh script that generates the maxabs_quant_g3.json file for FP8 inference. The script requires providing the following arguments:

  • -m <path/ID>: Path to a locally stored model or the model ID from the Hugging Face hub.
  • -d <path>: Path to the source dataset in the pickle (.pkl) format.
  • -o <path>: Path to the directory where you want to save the generated measurements. We recommend storing unification results in the source directory. This allows you to run the vLLM server with FP8 precision and different tensor parallelism values without modifying the directory specified in the QUANT_CONFIG environment variable.

The script also offers optional arguments that you can explore by executing the script with the -h flag. The more common optional parameters are:

  • -b <size>: Sets the batch size used for running the measurements (default: 32).
  • -l <samples>: Sets the limit of the samples in the calibration dataset.
  • -t <size>: Sets the tensor parallel size (default: 1).

Dataset

The calibration procedure works with any dataset that contains the system_prompt and question fields. These fields prepare a calibration dataset with prompts formatted specifically for your model. We recommend using a public dataset from MLCommons, as used in the Llama2-70b inference submission.

DeepSeek Models

For the DeepSeek-R1 series models, which contain 256 experts, provide a diverse and sufficiently large sample set to ensure that all experts are properly activated during calibration. Through testing, we observed that using NeelNanda/pile-10k and selecting 512 samples, each with at least 1,024 tokens, provides effective calibration coverage.

Calibration Procedures

Refer to the following chapters to follow the calibration procedure for your setup:

Configuration

The calibrate_model.sh script automatically generates appropriate configuration files for the calibration process. However, if you require advanced customization, you can use JSON configuration files. The quantization_config directory contains JSON templates that you can use directly or modify to suit your specific requirements.

To apply custom configurations for calibration, add the QUANT_CONFIG environment variable pointing to your configuration JSON file to the step-2-measure-scales.py and step-4-quantize-scales.py calibration steps. To apply the configuration when running FP8 inference, set QUANT_CONFIG to point to the quantization configuration file, either the one generated by calibration or your custom configuration.

Supported Configuration Options

The following table summarizes the options that you can set in a configuration file:

Attribute Description Values
mode The mode to run INC with. - MEASURE: Measures statistics of all modules and emits the results to dump_stats_path.
- QUANTIZE (default): Quantizes and runs the model according to the provided measurements.
observer The method used to observe and track tensor statistics. - maxabs (default): Tracks the maximum absolute values of tensors.
- save: Saves all tensors to files.
allowlist The list of nn.Module names or types to quantize. Empty list means all supported modules are quantized by default. See Custom Patched Modules. Default: empty list
blocklist List of nn.Module names or types not to quantize. Default: empty list
dump_stats_path The path to save and load measurements. Directory structure is created up to the last /; the string after the last / is used as a prefix for measurement files. Default: stats
scale_method The method for calculating the scale from measurements. - unit_scale (default): Always uses the scale of 1.
- maxabs_arbitrary: Stretches or compresses maxabs to the full-scale of FP8.
- maxabs_hw: Stretches or compresses maxabs to full-scale of FP8, then replaces it with hardware-accelerated scale based on device_for_scales.
- maxabs_pow2: Stretches or compresses maxabs to full-scale of FP8, then replaces it with hardware-accelerated scale based on device_for_scales, rounded to the power of 2.
- maxabs_hw_opt_weight: The weight scale chosen for the minimal MSE among hardware-accelerated scales; activations use maxabs_hw.
- act_maxabs_pow2_weights_pcs_opt_pow2: Per-channel weights use maxabs_hw_opt_weight; activations use maxabs_pow2.
- act_maxabs_hw_weights_pcs_maxabs_pow2: Per-channel weights use maxabs_pow2; activations use maxabs_hw.
- act_maxabs_pcs_pow2_weight_maxabs_pts_pow2_hw: Only for dynamic quantization. Per-tensor weights use maxabs_hw; activations use per-token maxabs_pow2.
measure_exclude Tensor types to exclude from measurement. - NONE: Measures all tensors.
- OUTPUT (default): Skips output tensors.
scale_format The format of scales passed to custom PyTorch operations. - const: Scales passed as tensors.
- scalar (default): Scales passed as scalar values for compilation time and throughput optimizations.
device_for_scales Exponent-bias values for converting FP32/BF16 to FP8-143. - GAUDI3: The expanded exponent-bias range (0–63).
- GAUDI2: Four possible exponent biases (3, 7, 11, 15), default is 7.
dynamic_quantization Enables dynamic FP8 quantization with per-token scales. Only supported with act_maxabs_pcs_pow2_weight_maxabs_pts_pow2_hw. - true: Enable.
- false (default): Disable.

Configuring Backoff Factors

When using any of the maxabs-based scale_method options, you can fine-tune the quantization behavior by configuring backoff factors. The input_backoff and weight_backoff factors provide a safety margin when converting inputs and weights to FP8. For example, if an activation has a larger absolute value than observed in calibration, the maxabs value is scaled to:

input_backoff * FP8_143_FULLSCALE

Similarly, for weights:

weight_backoff * FP8_143_FULLSCALE

By default, the backoff factors are set to:

  • input_backoff: 0.25
  • weight_backoff: 0.5

To change these values, add the following to the quantization configuration JSON file:

"scale_params": {"input_backoff": <INPUT_BACKOFF>, "weight_backoff": <WEIGHT_BACKOFF>}

Compilation Time and Throughput Optimization

The scale_format configuration option provides performance optimizations for FP8 inference. When set to scalar (default), it improves both compilation speed and runtime throughput by reducing the number of compiled recipes and minimizing host-side overhead when launching FP8 operations. Note that the compilation time improvement varies depending on your model's properties, such as the recipe count and scale distribution.

This optimization is not applicable to Per-Channel Quantization (PCQ).

Troubleshooting

If you encounter the following error when running the script, ensure you set a valid tensor parallelism value, for example -t 8:

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::939524096 (896)MB