KV Cache Aware Routing

KV Cache Aware Routing#

In this tutorial, you’ll learn how to enable and use KV cache aware routing in the vLLM Production Stack. With KV cache aware routing, incoming requests are routed to the instance with the highest KV cache hit rate, which helps maximize cache efficiency and boost overall performance. Unlike prefix aware routing—which always sends requests with the same prefix to the same instance, even if the cache has been evicted—KV cache aware routing prioritizes cache hits to optimize resource usage.

Table of Contents#

Prerequisites
Step 1: Deploy with KV Cache Aware Routing
Step 2: Port Forwarding
Step 3: Testing KV Cache Aware Routing
Step 4: Clean Up

Prerequisites#

Completion of the following tutorials:
- Prerequisite
- Quick Start
A Kubernetes environment with GPU support
Basic familiarity with Kubernetes and Helm

Step 1: Deploy with KV Cache Aware Routing#

We’ll use the predefined configuration file values-17-kv-aware.yaml which sets up two vLLM instances with KV cache aware routing enabled.

Deploy the Helm chart with the configuration:
```
helm install vllm helm/ -f tutorials/assets/values-17-kv-aware.yaml
```
Note that to add more instances, you need to specify different instanceId in lmcacheConfig.

Wait for the deployment to complete:
```
kubectl get pods -w
```

Step 2: Port Forwarding#

Forward the router service port to your local machine:

kubectl port-forward svc/vllm-router-service 30080:80

Step 3: Testing KV Cache Aware Routing#

First, send a request to the router:

curl http://localhost:30080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "What is the capital of France?",
    "max_tokens": 100
  }'

Then, send another request with the same prompt prefix:

curl http://localhost:30080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "What is the capital of France? And what is its population?",
    "max_tokens": 100
  }'

You should observe that the second request is routed to the same instance as the first request. This is because the KV cache aware router detects that the second request has a higher KV cache hit rate in the instance of the first request and routes it to the same instance to maximize KV cache utilization.

Step 4: Clean Up#

To clean up the deployment:

helm uninstall vllm

Conclusion#

In this tutorial, we’ve demonstrated how to:

Deploy vLLM Production Stack with KV cache aware routing
Set up port forwarding to access the router
Test the KV cache aware routing functionality

The KV cache aware routing feature helps improve performance by ensuring that requests will be routed to the instance with the highest KV cache hit rate, maximizing KV cache utilization.