Helm Charts

Helm Charts#

Production Stack uses Helm charts for deployment. It lets users deploy multiple serving engines and a router into the Kubernetes cluster.

Key features#

Support running multiple serving engines with multiple different models
Load the model weights directly from the existing PersistentVolumes

Prerequisites#

A running Kubernetes cluster with GPU. (You can set it up through minikube)
Helm installed.

Values for Helm charts are found in a values.yaml file.

To configure the file automatically, you can use the a json file as a schema called values.schema.json.

Example `values.yaml` file#

servingEngineSpec:
    modelSpec:
    - name: "opt125m"
        repository: "lmcache/vllm-openai"
        tag: "latest"
        modelURL: "facebook/opt-125m"

        replicaCount: 1

        requestCPU: 6
        requestMemory: "16Gi"
        requestGPU: 1

        pvcStorage: "10Gi"

Explanation of the fields#

name: The name of the model.
repository: The repository of the model to download the weights.
tag: The tag of the model to download the weights.
modelURL: The model URL to download the weights.
replicaCount: The number of replicas to run.
requestCPU: The CPU request for the serving engine.
requestMemory: The memory request for the serving engine.
requestGPU: The GPU request for the serving engine.
pvcStorage: The storage request for the serving engine.