MooncakeStoreConnector Usage Guide¶
MooncakeStoreConnector is a KV cache connector that uses MooncakeDistributedStore as a shared KV cache pool. Unlike MooncakeConnector which does direct point-to-point KV transfer between prefiller and decoder, MooncakeStoreConnector enables KV cache offloading to an external distributed store, supporting:
- CPU offloading: Extend effective KV cache capacity by offloading to CPU memory via Mooncake's transfer engine.
- Prefix caching across instances: Hash-based deduplication allows multiple vLLM instances to share cached KV blocks through the store.
- Single-node and multi-node deployment: Works both as a standalone KV cache extension and in disaggregated prefill-decode setups.
Prerequisites¶
Install Mooncake¶
Install mooncake through pip:
Refer to the Mooncake official repository for more installation instructions and building from source.
Start the Mooncake Master Server¶
The Mooncake master manages metadata and coordinates the distributed store. Start it before launching vLLM:
Default ports:
- RPC: 50051
Multiple vLLM instances can share the same master server.
Configure Mooncake¶
Create a JSON configuration file (e.g., mooncake_config.json):
{
"metadata_server": "P2PHANDSHAKE",
"master_server_address": "127.0.0.1:50051",
"global_segment_size": "80GB",
"local_buffer_size": "4GB",
"protocol": "rdma",
"device_name": ""
}
protocol: Use"rdma"for best performance."tcp"works as a fallback.global_segment_size: CPU memory contributed to the distributed pool (per GPU).local_buffer_size: Private buffer for this node's own operations (per GPU).
Set the config path via environment variable:
Usage¶
Single-Node KV Cache Offloading¶
Use MooncakeStoreConnector to offload KV cache to CPU memory, extending the effective cache size:
MOONCAKE_CONFIG_PATH=mooncake_config.json \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_both"}'
Disaggregated Prefill-Decode (XpYd)¶
In disaggregated prefill-decode mode, use MultiConnector to combine MooncakeConnector (point-to-point KV transfer) with MooncakeStoreConnector (shared KV cache pool). This enables both direct P2P transfer between prefiller and decoder, and cross-instance prefix cache sharing via the distributed store. Prefiller Node:
MOONCAKE_CONFIG_PATH=mooncake_config.json \
VLLM_MOONCAKE_BOOTSTRAP_PORT=50052 \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--port 8100 \
--kv-transfer-config '{
"kv_connector": "MultiConnector",
"kv_role": "kv_producer",
"kv_connector_extra_config": {
"connectors": [
{
"kv_connector": "MooncakeConnector",
"kv_role": "kv_producer"
},
{
"kv_connector": "MooncakeStoreConnector",
"kv_role": "kv_producer"
}
]
}
}'
Decoder Node:
MOONCAKE_CONFIG_PATH=mooncake_config.json \
VLLM_MOONCAKE_BOOTSTRAP_PORT=50053 \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--port 8200 \
--kv-transfer-config '{
"kv_connector": "MultiConnector",
"kv_role": "kv_consumer",
"kv_connector_extra_config": {
"connectors": [
{
"kv_connector": "MooncakeConnector",
"kv_role": "kv_consumer"
},
{
"kv_connector": "MooncakeStoreConnector",
"kv_role": "kv_consumer"
}
]
}
}'
Proxy:
A disaggregation proxy is required to route requests between prefiller and decoder nodes. The proxy assigns do_remote_prefill=True / do_remote_decode=True to coordinate P2P transfer via MooncakeConnector. Refer to the MooncakeConnector usage guide for proxy setup details.
Environment Variables¶
| Variable | Description | Default |
|---|---|---|
MOONCAKE_CONFIG_PATH | Path to Mooncake JSON config file | (required) |
VLLM_MOONCAKE_BOOTSTRAP_PORT | Bootstrap port for MooncakeConnector P2P transfer (disagg mode only) | 8998 |
KV Transfer Config¶
KV Role Options¶
- kv_producer: For prefiller instances that store KV caches to the pool.
- kv_consumer: For decoder instances that load KV caches from the pool.
- kv_both: The instance both stores and loads KV caches. Use this for single-node CPU offloading.
kv_connector_extra_config¶
load_async(bool): Enable asynchronous loading for better compute-I/O overlap. Default:true.enable_cross_layers_blocks(bool): Enable cross-layer block packing for reduced store operations. Default:false.discard_partial_chunks(bool): Discard partial block chunks during store. Default:true.lookup_rpc_port(int): Custom port for the ZMQ lookup RPC socket. Default:0.
Notes¶
Cross-DP Prefix Cache Hits¶
When running with data parallelism, set a fixed PYTHONHASHSEED so that block hashes are consistent across DP ranks:
Without this, identical prompts may produce different block hashes on different DP ranks, preventing cross-instance prefix cache hits.