TurboQuant KV Cache Compression¶

vllm-metal supports TurboQuant-based KV cache compression: a Walsh–Hadamard rotation followed by per-block quantization that shrinks the KV cache by 2.5x–5x with minimal quality loss. Quantize/dequantize runs natively on Apple Silicon via MLX + a Metal kernel.

Quick Start¶

VLLM_METAL_USE_PAGED_ATTENTION=1 vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --additional-config '{"turboquant": true, "k_quant": "q8_0", "v_quant": "q3_0"}'

TurboQuant is controlled via vLLM's --additional-config JSON, not a separate environment variable. Paged attention (VLLM_METAL_USE_PAGED_ATTENTION=1) is required.

Configuration¶

Key	Default	Description
`turboquant`	`false`	Enable TurboQuant KV cache compression
`k_quant`	`"q8_0"`	Key quantization type (see table below)
`v_quant`	`"q3_0"`	Value quantization type (Lloyd-Max)

Supported Key Quant Types¶

K uses per-block affine quantization with a Walsh–Hadamard rotation.

`k_quant`	Bits	Notes
`q8_0`, `int8`, `uint8`	8	Near-lossless
`q5_0`	5	Good quality / size trade-off
`q4_0`, `int4`, `uint4`	4	Matches TurboQuant paper config
`int2`, `uint2`	2	Aggressive; noticeable quality loss

Supported Value Quant Types¶

V uses Lloyd-Max (non-uniform) quantization with a Walsh–Hadamard rotation. Values are mapped to precomputed centroids per bitwidth.

`v_quant`	Bits
`q2_0`	2
`q3_0`	3
`q4_0`	4
`q5_0`	5
`q8_0`	8

Compression¶

Measured on a Qwen3-0.6B-shaped KV cache (28 layers, 4 KV heads, head_dim=128, block_size=16) vs fp16:

Config	Compression	K mse	V mse
`k_quant=q8_0`, `v_quant=q3_0` (default)	2.56x	0.00002	0.03241
`k_quant=q5_0`, `v_quant=q3_0`	3.37x	0.00154	0.03241
`k_quant=q4_0`, `v_quant=q3_0`	3.76x	0.00658	0.03241
`k_quant=uint2`, `v_quant=q3_0`	4.92x	0.16639	0.03241

At max_model_len=32768 on Llama-3.2-1B, the default q8_0/q3_0 configuration frees roughly 2.5x more context for the same KV memory budget.

Requirements and Caveats¶

Paged attention is required (VLLM_METAL_USE_PAGED_ATTENTION=1). TurboQuant cannot run on the MLX KV cache path.
MHA and hybrid (SDPA + GDN linear attention) models are supported. In hybrid models, only the SDPA layers are compressed; GDN recurrent state stays at fp16 (it has no paged KV cache to quantize).
MLA models are not supported. Enabling turboquant on an MLA model raises NotImplementedError at startup rather than silently falling back.
Head dim must be 64, 128, or 256 — sizes supported by the FWHT Metal kernel. Models outside this set are not supported yet.
Quality is model-dependent. For production use, spot-check perplexity with your target config before rolling out aggressive settings (int2, q2_0).

Known Quality Floors¶

Not every supported (k_quant, v_quant) combination is production-grade. K precision is load-bearing in a way V precision is not — K participates in the softmax, so quantization error gets exponentially amplified by exp(QK^T), while V errors get linearly averaged across the context and largely wash out. This asymmetry means aggressive K quantization fails long before aggressive V quantization does.

Observed behavior:

Config	Compression	Qualitative quality	Recommended use
`q8_0` / `q3_0`	2.56x	Matches bf16 within noise	Default
`q8_0` / `q2_0`	~2.9x	Usable; minor fluency dip	Tight-memory serving
`q4_0` / `q3_0`	3.76x	Paper-validated config; mild quality regression	Memory-bound workloads
`int2` / `q3_0`	~4.1x	Degraded: semi-coherent, topic-drift, numeric artefacts ("2018 2018")	Capacity benchmarks only
`int2` / `q2_0`	~4.8x	Broken: degenerate repetition loops ("concept concept concept…")	Not for serving

Examples¶

Normal Compression¶

For production use with minimal quality impact:

VLLM_METAL_USE_PAGED_ATTENTION=1 vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --dtype bfloat16 \
  --max-model-len 65536 \
  --additional-config '{"turboquant": true, "k_quant": "q8_0", "v_quant": "q3_0"}'

Aggressive Compression¶

For memory-bound workloads where some quality loss is acceptable:

VLLM_METAL_USE_PAGED_ATTENTION=1 vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --dtype bfloat16 \
  --max-model-len 65536 \
  --additional-config '{"turboquant": true, "k_quant": "q4_0", "v_quant": "q3_0"}'