Welcome to production-stack!

Welcome to production-stack!#

production-stack

K8S-native cluster-wide deployment for vLLM.

Star Watch Fork

vLLM Production Stack project provides a reference implementation on how to build an inference stack on top of vLLM, which allows you to:

  • 🚀 Scale from single vLLM instance to distributed vLLM deployment without changing any application code

  • 💻 Monitor the through a web dashboard

  • 😄 Enjoy the performance benefits brought by request routing and KV cache offloading

  • 📈 Easily deploy the stack on AWS, GCP, or any other cloud provider

Documentation#

Developer Guide

Benchmarks