KV Cache Offloading#
Eenable KV cache offloading using LMCache in a vLLM deployment. KV cache offloading moves large KV caches from GPU memory to CPU or disk, enabling more potential KV cache hits. vLLM Production Stack uses LMCache for KV cache offloading. For more details, see the LMCache GitHub repository.
KV Cache Offloading Configuration#
Use the following yaml
servingEngineSpec:
modelSpec:
- name: "mistral"
repository: "lmcache/vllm-openai"
tag: "latest"
modelURL: "mistralai/Mistral-7B-Instruct-v0.2"
replicaCount: 1
requestCPU: 10
requestMemory: "40Gi"
requestGPU: 1
pvcStorage: "50Gi"
vllmConfig:
enableChunkedPrefill: false
enablePrefixCaching: false
maxModelLen: 16384
lmcacheConfig:
enabled: true
cpuOffloadingBufferSize: "20"
hf_token: <YOUR HF TOKEN>
Note
Note: Replace <YOUR HF TOKEN> with your actual Hugging Face token.
Note
The lmcacheConfig field enables LMCache and sets the CPU offloading buffer size to 20GB. You can adjust this value based on your workload.
Deploy the Stack using Helm as shown in the Minimal Example section.