Welcome to production-stack!#

K8S-native cluster-wide deployment for vLLM.
vLLM Production Stack project provides a reference implementation on how to build an inference stack on top of vLLM, which allows you to:
🚀 Scale from single vLLM instance to distributed vLLM deployment without changing any application code
💻 Monitor the through a web dashboard
😄 Enjoy the performance benefits brought by request routing and KV cache offloading
📈 Easily deploy the stack on AWS, GCP, or any other cloud provider
Documentation#
Getting Started
Developer Guide
Benchmarks