Skip to content

Nvfp4 example

Kimi-K2.6 NVFP4 Example

Code Walkthrough

The original Kimi K2.6 checkpoint ships in a quantized format with 4-bit integer weights. In order to create an NVFP4 checkpoint that can leverage NVIDIA's 4-bit floating point kernels, we must first dequantize to full-precision (bfloat16), then quantize to the desired NVFP4 format. Note that this requires saving the full-precision model to an intermediate directory. Let's walk through the main steps of the quantization process: 1. Dequantize model 2. Apply quantization to full-precision checkpoint

The full example script can be found in the examples here.

1. Dequantize Model

from compressed_tensors.entrypoints.convert import (
    CompressedTensorsDequantizer,
    convert_checkpoint,
)

MODEL_ID = "moonshotai/Kimi-K2.6"
DEQUANTIZED_SAVE_DIR = "Kimi-K2.6-bf16"

ignore = [
    "re:.*mlp.gate$",
    "re:.*lm_head",
    "re:.*self_attn.*",
    "re:.*embed_tokens$",
    # ignore anything not in language_model
    "re:.*mm_projector.*",
    "re:.*vision.*",
]

# Convert to dense bfloat16 format
convert_checkpoint(
    model_stub=MODEL_ID,
    save_directory=DEQUANTIZED_SAVE_DIR,
    converter=CompressedTensorsDequantizer(
        MODEL_ID,
        ignore=ignore,
    ),
    max_workers=4,
)

2. Apply Quantization

Once dequantized, the model can be quantized to NVFP4 via oneshot. NVFP4 uses static activation quantization, so a calibration dataset is required for oneshot. Because the model is one trillion parameters, we leverage the compressed_tensors.offload module with disk offloading to run the calibration dataset through the model. The snippet below was run successfully on a single H100x80GB GPU and 500GB CPU RAM.

from compressed_tensors.offload import load_offloaded_model
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

SAVE_DIR = "Kimi-K2.6-NVFP4"

# Quantize bfloat16 checkpoint to NVFP4, limiting CPU RAM usage to 500GB
with load_offloaded_model():
    model = AutoModelForCausalLM.from_pretrained(
        DEQUANTIZED_SAVE_DIR,
        dtype="auto",
        device_map="auto_offload",
        max_memory={"cpu": 500e9},
        trust_remote_code=True,
        offload_folder="./offload_folder",
    )
    tokenizer = AutoTokenizer.from_pretrained(
        DEQUANTIZED_SAVE_DIR, trust_remote_code=True
    )
    processor = AutoProcessor.from_pretrained(
        DEQUANTIZED_SAVE_DIR, trust_remote_code=True
    )

# Select calibration dataset.
DATASET_ID = "ultrachat-200k"
DATASET_SPLIT = "train_sft"

# Select number of samples. 20 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 20
MAX_SEQUENCE_LENGTH = 2048

# Configure the quantization algorithm to run.
#   * quantize the weights to NVFP4
recipe = QuantizationModifier(
    targets="Linear",
    scheme="NVFP4",
    ignore=ignore,
)

# Apply algorithms.
oneshot(
    model=model,
    processor=tokenizer,
    dataset=DATASET_ID,
    splits={"calibration": f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]"},
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Save to disk compressed.
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
processor.save_pretrained(SAVE_DIR)

The dequantized model can be deleted once step 2 completes.