Helm Charts#
Production Stack uses Helm charts for deployment. It lets users deploy multiple serving engines and a router into the Kubernetes cluster.
Key features#
Support running multiple serving engines with multiple different models
Load the model weights directly from the existing PersistentVolumes
Prerequisites#
Values for Helm charts are found in a values.yaml file.
To configure the file automatically, you can use the a json file as a schema called values.schema.json.
Example values.yaml file#
servingEngineSpec:
modelSpec:
- name: "opt125m"
repository: "lmcache/vllm-openai"
tag: "latest"
modelURL: "facebook/opt-125m"
replicaCount: 1
requestCPU: 6
requestMemory: "16Gi"
requestGPU: 1
pvcStorage: "10Gi"
Explanation of the fields#
name: The name of the model.repository: The repository of the model to download the weights.tag: The tag of the model to download the weights.modelURL: The model URL to download the weights.replicaCount: The number of replicas to run.requestCPU: The CPU request for the serving engine.requestMemory: The memory request for the serving engine.requestGPU: The GPU request for the serving engine.pvcStorage: The storage request for the serving engine.