BitBLAS
vLLM now supports BitBLAS for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations.
Note
Ensure your hardware supports the selected dtype
(torch.bfloat16
or torch.float16
).
Most recent NVIDIA GPUs support float16
, while bfloat16
is more common on newer architectures like Ampere or Hopper.
For details see supported hardware.
Below are the steps to utilize BitBLAS with vLLM.
vLLM reads the model's config file and supports pre-quantized checkpoints.
You can find pre-quantized models on:
Usually, these repositories have a quantize_config.json
file that includes a quantization_config
section.
Read bitblas format checkpoint¶
from vllm import LLM
import torch
# "hxbgsyxh/llama-13b-4bit-g-1-bitblas" is a pre-quantized checkpoint.
model_id = "hxbgsyxh/llama-13b-4bit-g-1-bitblas"
llm = LLM(
model=model_id,
dtype=torch.bfloat16,
trust_remote_code=True,
quantization="bitblas"
)