Production stack#

Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the vLLM production stack. Born out of a Berkeley-UChicago collaboration, vLLM production stack is an officially released, production-optimized codebase under the vLLM project, designed for LLM deployment with:

  • Upstream vLLM compatibility – It wraps around upstream vLLM without modifying its code.

  • Ease of use – Simplified deployment via Helm charts and observability through Grafana dashboards.

  • High performance – Optimized for LLM workloads with features like multi-model support, model-aware and prefix-aware routing, fast vLLM bootstrapping, and KV cache offloading with LMCache, among others.

If you are new to Kubernetes, don’t worry: in the vLLM production stack repo, we provide a step-by-step guide and a short video to set up everything and get started in 4 minutes!

Pre-requisite#

Ensure that you have a running Kubernetes environment with GPU (you can follow this tutorial to install a Kubernetes environment on a bare-medal GPU machine).

Deployment using vLLM production stack#

The standard vLLM production stack install uses a Helm chart. You can run this bash script to install Helm on your GPU server.

To install the vLLM production stack, run the following commands on your desktop:

sudo helm repo add vllm https://vllm-project.github.io/production-stack
sudo helm install vllm vllm/vllm-stack -f tutorials/assets/values-01-minimal-example.yaml

This will instantiate a vLLM-production-stack-based deployment named vllm that runs a small LLM (Facebook opt-125M model).

Validate Installation#

Monitor the deployment status using:

sudo kubectl get pods

And you will see that pods for the vllm deployment will transit to Running state.

NAME                                           READY   STATUS    RESTARTS   AGE
vllm-deployment-router-859d8fb668-2x2b7        1/1     Running   0          2m38s
vllm-opt125m-deployment-vllm-84dfc9bd7-vb9bs   1/1     Running   0          2m38s

NOTE: It may take some time for the containers to download the Docker images and LLM weights.

Send a Query to the Stack#

Forward the vllm-router-service port to the host machine:

sudo kubectl port-forward svc/vllm-router-service 30080:80

And then you can send out a query to the OpenAI-compatible API to check the available models:

curl -o- http://localhost:30080/models

Expected output:

{
  "object": "list",
  "data": [
    {
      "id": "facebook/opt-125m",
      "object": "model",
      "created": 1737428424,
      "owned_by": "vllm",
      "root": null
    }
  ]
}

To send an actual chatting request, you can issue a curl request to the OpenAI /completion endpoint:

curl -X POST http://localhost:30080/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "Once upon a time,",
    "max_tokens": 10
  }'

Expected output:

{
  "id": "completion-id",
  "object": "text_completion",
  "created": 1737428424,
  "model": "facebook/opt-125m",
  "choices": [
    {
      "text": " there was a brave knight who...",
      "index": 0,
      "finish_reason": "length"
    }
  ]
}

Uninstall#

To remove the deployment, run:

sudo helm uninstall vllm

(Advanced) Configuring vLLM production stack#

The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above:

servingEngineSpec:
  runtimeClassName: ""
  modelSpec:
  - name: "opt125m"
    repository: "vllm/vllm-openai"
    tag: "latest"
    modelURL: "facebook/opt-125m"

    replicaCount: 1

    requestCPU: 6
    requestMemory: "16Gi"
    requestGPU: 1

    pvcStorage: "10Gi"

In this YAML configuration:

  • modelSpec includes:

    • name: A nickname that you prefer to call the model.

    • repository: Docker repository of vLLM.

    • tag: Docker image tag.

    • modelURL: The LLM model that you want to use.

  • replicaCount: Number of replicas.

  • requestCPU and requestMemory: Specifies the CPU and memory resource requests for the pod.

  • requestGPU: Specifies the number of GPUs required.

  • pvcStorage: Allocates persistent storage for the model.

NOTE: If you intend to set up two pods, please refer to this YAML file.

NOTE: vLLM production stack offers many more features (e.g. CPU offloading and a wide range of routing algorithms). Please check out these examples and tutorials and our repo for more details!