CRD Deployment

CRD Deployment#

Deploy minimal vLLM production stack on Kubernetes using Custom Resource Definitions (CRDs) and Custom Resources (CRs).

Note

This deployment method is recommended for production environments as it provides better resource management, monitoring, and lifecycle management through Kubernetes operators.

Prerequisites#

kubectl version v1.11.3+
Access to a Kubernetes v1.11.3+ cluster

Installation#

Clone the repository

First, clone the vLLM production stack repository:
```
git clone https://github.com/vllm-project/production-stack.git
```
Deploy the Operator

Deploy the production stack operator by running:
```
kubectl create -f operator/config/default.yaml
```
This command achieves the following:
- Namespace Creation: Creates a namespace called production-stack-system where the operator will run
- Custom Resource Definitions (CRDs): Defines 4 new custom resources that can be managed by this operator:
  - CacheServer: For managing cache servers
  - LoraAdapter: For managing LoRA adapters (used for model fine-tuning)
  - VLLMRouter: For managing vLLM routing
  - VLLMRuntime: For managing vLLM runtime instances
- RBAC (Role-Based Access Control): Creates various roles and role bindings to control access to these resources with permissions for:
  - Admin roles (full access)
  - Editor roles (create/update/delete)
  - Viewer roles (read-only)
  - Metrics and leader election
- Service Account: Creates a service account production-stack-controller-manager for the operator
- Deployment: Deploys the operator controller manager as a deployment with health checks, resource limits, and security settings using the image lmcache/production-stack-operator:latest
- Service: Creates a metrics service for monitoring the operator

Verify the Operator Deployment

Check the status of the operator deployment:

kubectl get pods -n production-stack-system
kubectl get deployment -n production-stack-system

You should see output similar to:

NAME                                                              READY   STATUS    RESTARTS   AGE
production-stack-production-stack-controller-manager-65b86brxm6   1/1     Running   0          21s

NAME                                                   READY   UP-TO-DATE   AVAILABLE   AGE
production-stack-production-stack-controller-manager   1/1     1            1           25s

Deploying vLLM Resources#

Deploy vLLM Runtime

(Optional) If your model requires a Hugging Face token (like Llama-3.1-8B), create a secret:

kubectl create secret generic huggingface-token \
  --from-literal=token=<your-hf-token> \
  --namespace=default

Deploy the vLLM runtime:

kubectl apply -f operator/config/samples/production-stack_v1alpha1_vllmruntime.yaml

This creates a vLLM runtime instance in your Kubernetes cluster.

Deploy vLLM Router

Start the vLLM router:

kubectl apply -f operator/config/samples/production-stack_v1alpha1_vllmrouter.yaml

Verify both components are running:

kubectl get pods

You should see:

      NAME                                  READY   STATUS    RESTARTS   AGE
vllmrouter-sample-6fc78b7f85-lt5n7    1/1     Running   0          3m31s
vllmruntime-sample-7448f7547c-pdfml   1/1     Running   0          6m10s

Troubleshooting Initial Deployment

If you encounter a RunContainerError, check the logs:
```
kubectl get pods
kubectl logs <pod-name>
kubectl describe pod <pod-name>
```

Sample Configurations#

VLLMRuntime Sample (production-stack_v1alpha1_vllmruntime.yaml)

apiVersion: production-stack.vllm.ai/v1alpha1
kind: VLLMRuntime
metadata:
  labels:
    app.kubernetes.io/name: production-stack
    app.kubernetes.io/managed-by: kustomize
  name: vllmruntime-sample
spec:
  # Model configuration
  model:
    modelURL: "meta-llama/Llama-3.1-8B"
    enableLoRA: false
    enableTool: false
    toolCallParser: ""
    maxModelLen: 4096
    dtype: "bfloat16"
    maxNumSeqs: 32
    # HuggingFace token secret (optional)
    hfTokenSecret:
      name: "huggingface-token"
    hfTokenName: "token"

  # vLLM server configuration
  vllmConfig:
    # vLLM specific configurations
    enableChunkedPrefill: false
    enablePrefixCaching: false
    tensorParallelSize: 1
    gpuMemoryUtilization: "0.8"
    maxLoras: 4
    extraArgs: ["--disable-log-requests"]
    v1: true
    port: 8000
    # Environment variables
    env:
      - name: HF_HOME
        value: "/data"

  # LM Cache configuration
  lmCacheConfig:
    enabled: true
    cpuOffloadingBufferSize: "15"
    diskOffloadingBufferSize: "0"
    remoteUrl: "lm://cacheserver-sample.default.svc.cluster.local:80"
    remoteSerde: "naive"

  # Deployment configuration
  deploymentConfig:
    # Resource requirements
    resources:
      cpu: "10"
      memory: "32Gi"
      gpu: "1"

    # Image configuration
    image:
      registry: "docker.io"
      name: "lmcache/vllm-openai:2025-05-27-v1"
      pullPolicy: "IfNotPresent"
      pullSecretName: ""

    # Number of replicas
    replicas: 1

    # Deployment strategy
    deploymentStrategy: "Recreate"

VLLMRouter Sample (production-stack_v1alpha1_vllmrouter.yaml)

apiVersion: production-stack.vllm.ai/v1alpha1
kind: VLLMRouter
metadata:
  labels:
    app.kubernetes.io/name: production-stack
    app.kubernetes.io/managed-by: kustomize
  name: vllmrouter-sample
spec:
  # Enable the router deployment
  enableRouter: true

  # Number of router replicas
  replicas: 1

  # Service discovery method (k8s or static)
  serviceDiscovery: k8s

  # Label selector for vLLM runtime pods
  k8sLabelSelector: "app=vllmruntime-sample"

  # Routing strategy (roundrobin or session)
  routingLogic: roundrobin

  # Engine statistics collection interval
  engineScrapeInterval: 30

  # Request statistics window
  requestStatsWindow: 60

  # Container port for the router service
  port: 80

  # Service account name
  serviceAccountName: vllmrouter-sa

  # Image configuration
  image:
    registry: docker.io
    name: lmcache/lmstack-router
    pullPolicy: IfNotPresent

  # Resource requirements
  resources:
    cpu: "2"
    memory: "8Gi"

  # Environment variables
  env:
    - name: LOG_LEVEL
      value: "info"
    - name: METRICS_ENABLED
      value: "true"

  # Node selector for pod scheduling
  nodeSelectorTerms:
    - matchExpressions:
        - key: kubernetes.io/os
          operator: In
          values:
            - linux

Testing the Deployment#

Port Forward the Router

Expose the router service locally:

kubectl port-forward svc/vllmrouter-sample 30080:80 --address 0.0.0.0

Test with a Simple Request

In a separate terminal, test the deployment with a curl command:

curl -X POST http://localhost:30080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B",
    "prompt": "1 plus 1 equals to",
    "max_tokens": 100
  }'

A successful response should look like:

{
  "id": "cmpl-0c3a06af79df4cb2a5e6f8c3fb1f1215",
  "object": "text_completion",
  "created": 1750121964,
  "model": "meta-llama/Llama-3.1-8B",
  "choices": [
    {
      "index": 0,
      "text": " 2\nThis is a very simple equation...",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 108,
    "completion_tokens": 100,
    "prompt_tokens_details": null
  },
           "kv_transfer_params": null
 }

Uninstall#

Remove Custom Resources

kubectl delete vllmrouter vllmrouter-sample
kubectl delete vllmruntime vllmruntime-sample

Remove Secrets (if created)

kubectl delete secret huggingface-token --namespace=default

Remove the Operator and CRDs

Remove the entire operator deployment and custom resource definitions:
```
kubectl delete -f operator/config/default.yaml
```

Verify Cleanup

Confirm that all resources have been removed:

kubectl get namespace production-stack-system

kubectl get crd | grep production-stack

kubectl get pods --all-namespaces | grep -E "(vllmruntime|vllmrouter)"

You should see no results from these commands, indicating successful cleanup.

Happy deploying! 🚀