Welcome to production-stack!

Welcome to production-stack!#

K8S-native cluster-wide deployment for vLLM.

vLLM Production Stack project provides a reference implementation on how to build an inference stack on top of vLLM, which allows you to:

🚀 Scale from single vLLM instance to distributed vLLM deployment without changing any application code
💻 Monitor the metrics through a web dashboard
😄 Enjoy the performance benefits brought by request routing and KV cache offloading
📈 Easily deploy the stack on AWS, GCP, or any other cloud provider

Getting Started

Deployment

Use Cases

Developer Guide

Community