CRD Deployment#

Deploy minimal vLLM production stack on Kubernetes using Custom Resource Definitions (CRDs) and Custom Resources (CRs).

Note

This deployment method is recommended for production environments as it provides better resource management, monitoring, and lifecycle management through Kubernetes operators.

Prerequisites#

  • kubectl version v1.11.3+

  • Access to a Kubernetes v1.11.3+ cluster

Installation#

  1. Clone the repository

    First, clone the vLLM production stack repository:

    git clone https://github.com/vllm-project/production-stack.git
    
  2. Deploy the Operator

    Deploy the production stack operator by running:

    kubectl create -f operator/config/default.yaml
    

    This command achieves the following:

    • Namespace Creation: Creates a namespace called production-stack-system where the operator will run

    • Custom Resource Definitions (CRDs): Defines 4 new custom resources that can be managed by this operator:

      • CacheServer: For managing cache servers

      • LoraAdapter: For managing LoRA adapters (used for model fine-tuning)

      • VLLMRouter: For managing vLLM routing

      • VLLMRuntime: For managing vLLM runtime instances

    • RBAC (Role-Based Access Control): Creates various roles and role bindings to control access to these resources with permissions for:

      • Admin roles (full access)

      • Editor roles (create/update/delete)

      • Viewer roles (read-only)

      • Metrics and leader election

    • Service Account: Creates a service account production-stack-controller-manager for the operator

    • Deployment: Deploys the operator controller manager as a deployment with health checks, resource limits, and security settings using the image lmcache/production-stack-operator:latest

    • Service: Creates a metrics service for monitoring the operator

  3. Verify the Operator Deployment

    Check the status of the operator deployment:

    kubectl get pods -n production-stack-system
    kubectl get deployment -n production-stack-system
    

    You should see output similar to:

    NAME                                                              READY   STATUS    RESTARTS   AGE
    production-stack-production-stack-controller-manager-65b86brxm6   1/1     Running   0          21s
    
    NAME                                                   READY   UP-TO-DATE   AVAILABLE   AGE
    production-stack-production-stack-controller-manager   1/1     1            1           25s
    

Deploying vLLM Resources#

  1. Deploy vLLM Runtime

    (Optional) If your model requires a Hugging Face token (like Llama-3.1-8B), create a secret:

    kubectl create secret generic huggingface-token \
      --from-literal=token=<your-hf-token> \
      --namespace=default
    

    Deploy the vLLM runtime:

    kubectl apply -f operator/config/samples/production-stack_v1alpha1_vllmruntime.yaml
    

    This creates a vLLM runtime instance in your Kubernetes cluster.

  2. Deploy vLLM Router

    Start the vLLM router:

    kubectl apply -f operator/config/samples/production-stack_v1alpha1_vllmrouter.yaml
    

    Verify both components are running:

    kubectl get pods
    

    You should see:

          NAME                                  READY   STATUS    RESTARTS   AGE
    vllmrouter-sample-6fc78b7f85-lt5n7    1/1     Running   0          3m31s
    vllmruntime-sample-7448f7547c-pdfml   1/1     Running   0          6m10s
    
  3. Troubleshooting Initial Deployment

    If you encounter a RunContainerError, check the logs:

    kubectl get pods
    kubectl logs <pod-name>
    kubectl describe pod <pod-name>
    

Sample Configurations#

VLLMRuntime Sample (production-stack_v1alpha1_vllmruntime.yaml)

apiVersion: production-stack.vllm.ai/v1alpha1
kind: VLLMRuntime
metadata:
  labels:
    app.kubernetes.io/name: production-stack
    app.kubernetes.io/managed-by: kustomize
  name: vllmruntime-sample
spec:
  # Model configuration
  model:
    modelURL: "meta-llama/Llama-3.1-8B"
    enableLoRA: false
    enableTool: false
    toolCallParser: ""
    maxModelLen: 4096
    dtype: "bfloat16"
    maxNumSeqs: 32
    # HuggingFace token secret (optional)
    hfTokenSecret:
      name: "huggingface-token"
    hfTokenName: "token"

  # vLLM server configuration
  vllmConfig:
    # vLLM specific configurations
    enableChunkedPrefill: false
    enablePrefixCaching: false
    tensorParallelSize: 1
    gpuMemoryUtilization: "0.8"
    maxLoras: 4
    extraArgs: ["--disable-log-requests"]
    v1: true
    port: 8000
    # Environment variables
    env:
      - name: HF_HOME
        value: "/data"

  # LM Cache configuration
  lmCacheConfig:
    enabled: true
    cpuOffloadingBufferSize: "15"
    diskOffloadingBufferSize: "0"
    remoteUrl: "lm://cacheserver-sample.default.svc.cluster.local:80"
    remoteSerde: "naive"

  # Deployment configuration
  deploymentConfig:
    # Resource requirements
    resources:
      cpu: "10"
      memory: "32Gi"
      gpu: "1"

    # Image configuration
    image:
      registry: "docker.io"
      name: "lmcache/vllm-openai:2025-05-27-v1"
      pullPolicy: "IfNotPresent"
      pullSecretName: ""

    # Number of replicas
    replicas: 1

    # Deployment strategy
    deploymentStrategy: "Recreate"

VLLMRouter Sample (production-stack_v1alpha1_vllmrouter.yaml)

apiVersion: production-stack.vllm.ai/v1alpha1
kind: VLLMRouter
metadata:
  labels:
    app.kubernetes.io/name: production-stack
    app.kubernetes.io/managed-by: kustomize
  name: vllmrouter-sample
spec:
  # Enable the router deployment
  enableRouter: true

  # Number of router replicas
  replicas: 1

  # Service discovery method (k8s or static)
  serviceDiscovery: k8s

  # Label selector for vLLM runtime pods
  k8sLabelSelector: "app=vllmruntime-sample"

  # Routing strategy (roundrobin or session)
  routingLogic: roundrobin

  # Engine statistics collection interval
  engineScrapeInterval: 30

  # Request statistics window
  requestStatsWindow: 60

  # Container port for the router service
  port: 80

  # Service account name
  serviceAccountName: vllmrouter-sa

  # Image configuration
  image:
    registry: docker.io
    name: lmcache/lmstack-router
    pullPolicy: IfNotPresent

  # Resource requirements
  resources:
    cpu: "2"
    memory: "8Gi"

  # Environment variables
  env:
    - name: LOG_LEVEL
      value: "info"
    - name: METRICS_ENABLED
      value: "true"

  # Node selector for pod scheduling
  nodeSelectorTerms:
    - matchExpressions:
        - key: kubernetes.io/os
          operator: In
          values:
            - linux

Testing the Deployment#

  1. Port Forward the Router

    Expose the router service locally:

    kubectl port-forward svc/vllmrouter-sample 30080:80 --address 0.0.0.0
    
  2. Test with a Simple Request

    In a separate terminal, test the deployment with a curl command:

    curl -X POST http://localhost:30080/v1/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "meta-llama/Llama-3.1-8B",
        "prompt": "1 plus 1 equals to",
        "max_tokens": 100
      }'
    

    A successful response should look like:

    {
      "id": "cmpl-0c3a06af79df4cb2a5e6f8c3fb1f1215",
      "object": "text_completion",
      "created": 1750121964,
      "model": "meta-llama/Llama-3.1-8B",
      "choices": [
        {
          "index": 0,
          "text": " 2\nThis is a very simple equation...",
          "logprobs": null,
          "finish_reason": "length",
          "stop_reason": null,
          "prompt_logprobs": null
        }
      ],
      "usage": {
        "prompt_tokens": 8,
        "total_tokens": 108,
        "completion_tokens": 100,
        "prompt_tokens_details": null
      },
               "kv_transfer_params": null
     }
    

Uninstall#

  1. Remove Custom Resources

    kubectl delete vllmrouter vllmrouter-sample
    kubectl delete vllmruntime vllmruntime-sample
    
  2. Remove Secrets (if created)

    kubectl delete secret huggingface-token --namespace=default
    
  3. Remove the Operator and CRDs

    Remove the entire operator deployment and custom resource definitions:

    kubectl delete -f operator/config/default.yaml
    
  4. Verify Cleanup

    Confirm that all resources have been removed:

    kubectl get namespace production-stack-system
    
    kubectl get crd | grep production-stack
    
    kubectl get pods --all-namespaces | grep -E "(vllmruntime|vllmrouter)"
    

    You should see no results from these commands, indicating successful cleanup.

Happy deploying! 🚀