Fp8 block example

Gemma 4 FP8 Block Example

This example quantizes the google/gemma-4-31B-it model to FP8 block format using the model_free_ptq entrypoint. Because FP8 block quantization does not require a calibration dataset, no calibration data is needed.

The full example script can be found here.

Code Walkthrough

1. Quantize Model to FP8 Block Format

from llmcompressor import model_free_ptq

MODEL_ID = "google/gemma-4-31B-it"
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8_BLOCK"

model_free_ptq(
    model_stub=MODEL_ID,
    save_directory=SAVE_DIR,
    scheme="FP8_BLOCK",
    ignore=["re:.*vision.*", "lm_head", "re:.*embed_tokens.*"],
    max_workers=8,
    device="cuda:0",
)

The ignore list skips the vision tower, lm_head, and embedding layers, which are kept in their original precision.