Minimal Example#

Introduction#

This is a minimal working example of the vLLM Production Stack using one vLLM instance with the facebook/opt-125m model. The goal is to have a working deployment of vLLM on a Kubernetes environment with GPU.

Prerequisites#

Steps to follow#

1. Deploy vLLM Instance#

1.1 Use existing configuration#

The vLLM Production Stack repository provides a predefined configuration file, values-01-minimal-example.yaml, located here. This file contains the following content:

servingEngineSpec:
runtimeClassName: ""
modelSpec:
- name: "opt125m"
    repository: "vllm/vllm-openai"
    tag: "latest"
    modelURL: "facebook/opt-125m"

    replicaCount: 1

    requestCPU: 6
    requestMemory: "16Gi"
    requestGPU: 1

1.2 Deploy the stack#

Deploy the Helm chart using the predefined configuration file:

helm repo add vllm https://vllm-project.github.io/production-stack
helm install vllm vllm/vllm-stack -f tutorials/assets/values-01-minimal-example.yaml

2. Validate Installation#

2.1 Monitor Deployment Status#

Monitor the deployment status using:

kubectl get pods

Expected output:

NAME                                           READY   STATUS    RESTARTS   AGE
vllm-deployment-router-859d8fb668-2x2b7        1/1     Running   0          2m38s
vllm-opt125m-deployment-vllm-84dfc9bd7-vb9bs   1/1     Running   0          2m38s

Note

It may take some time for the containers to download the Docker images and LLM weights.

3. Send a Query to the Stack#

3.1 Forward the Service Port#

Expose the vllm-router-service port to the host machine:

kubectl port-forward svc/vllm-router-service 30080:80

3.2 Query the OpenAI-Compatible API to list the available models#

Test the stack’s OpenAI-compatible API by querying the available models:

curl -o- http://localhost:30080/v1/models

Expected output:

{
  "object": "list",
  "data": [
    {
      "id": "facebook/opt-125m",
      "object": "model",
      "created": 1737428424,
      "owned_by": "vllm",
      "root": null
    }
  ]
}

3.3 Query the OpenAI Completion Endpoint#

Send a query to the OpenAI /completion endpoint to generate a completion for a prompt:

curl -X POST http://localhost:30080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "Once upon a time,",
    "max_tokens": 10
  }'

Expected output:

{
  "id": "completion-id",
  "object": "text_completion",
  "created": 1737428424,
  "model": "facebook/opt-125m",
  "choices": [
    {
      "text": " there was a brave knight who...",
      "index": 0,
      "finish_reason": "length"
    }
  ]
}

4. Uninstall#

To remove the deployment, run:

helm uninstall vllm