Fp8 block example

Kimi-K2.6 FP8 Block Example

Code Walkthrough

The original Kimi K2.6 checkpoint ships in a quantized format with 4-bit integer weights. In order to create an FP8 block checkpoint, we must first dequantize to full-precision (bfloat16), then quantize to the desired FP8 Block format. Fortunately, this can be done in a single call to the model_free_ptq entrypoint because FP8 block quantization does not require a calibration dataset. The original 4-bit weights will be loaded from the safetensors files, upconverted to bfloat16, and quantized to FP8 block in a single pipeline.

The full example script can be found in the examples here.

The snippet below was run successfully on a single H100x80GB GPU.

1. Convert model from 4-bit integer weights to fp8 block format.

from compressed_tensors.entrypoints.convert import CompressedTensorsDequantizer

from llmcompressor import model_free_ptq

MODEL_ID = "moonshotai/Kimi-K2.6"
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-BLOCK"

ignore = [
    "re:.*mlp.gate$",
    "re:.*lm_head",
    "re:.*kv_a_proj_with_mqa$",
    "re:.*q_a_proj$",
    "re:.*vision_tower.*",
    "re:.*embed_tokens$",
    # ignore anything not in language_model
    "re:.*mm_projector.*",
    "re:.*vision.*",
]

model_free_ptq(
    model_stub=MODEL_ID,
    save_directory=SAVE_DIR,
    scheme="FP8_BLOCK",
    ignore=ignore,
    converter=CompressedTensorsDequantizer(
        MODEL_ID,
        ignore=ignore,
    ),
    max_workers=2,
    device="cuda:0",
)