vLLM-Omni Helm Chart¶
Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/chart-helm.
Helm chart for deploying vLLM-Omni on Kubernetes. vLLM-Omni extends vLLM with omni-modality model serving, supporting text-to-image, multimodal chat, text-to-speech, and more.
Prerequisites¶
- Kubernetes 1.24+
- Helm 3.x
- NVIDIA GPU nodes with NVIDIA Device Plugin
Quick Start¶
Configuration¶
Model Selection¶
Set the model value to any supported HuggingFace model ID:
| Model | Type | GPUs | Notes |
|---|---|---|---|
Tongyi-MAI/Z-Image-Turbo | text-to-image | 1 | Small, fast (default) |
stabilityai/stable-diffusion-3.5-medium | text-to-image | 1 | ~6GB VRAM |
Qwen/Qwen-Image | text-to-image | 1 | Large, ~40GB+ VRAM |
Qwen/Qwen2.5-Omni-7B | multimodal | 2 | Text + audio + image + video |
Qwen/Qwen3-Omni-7B-Chat | multimodal | 2 | Latest omni model |
Qwen/Qwen3-TTS | text-to-speech | 1 | TTS |
HuggingFace Token¶
For gated models that require authentication:
helm install my-release ./chart-helm \
--set model=Qwen/Qwen2.5-Omni-7B \
--set hfToken=hf_xxxxx \
--set resources.requests."nvidia\.com/gpu"=2 \
--set resources.limits."nvidia\.com/gpu"=2
Omni-Specific Flags¶
Enable VAE memory optimizations for diffusion models:
helm install my-release ./chart-helm \
--set model=Qwen/Qwen-Image \
--set omniArgs.vaeUseSlicing=true \
--set omniArgs.vaeUseTiling=true
Enable CPU offloading:
helm install my-release ./chart-helm \
--set model=Qwen/Qwen-Image \
--set omniArgs.enableCpuOffload=true
Pass additional raw CLI flags:
helm install my-release ./chart-helm \
--set model=Qwen/Qwen-Image \
--set omniArgs.extraArgs[0]="--enable-layerwise-offload"
Model Cache¶
By default, a PersistentVolumeClaim is created for the HuggingFace model cache to avoid re-downloading models on pod restarts:
To use an ephemeral volume instead:
Custom Command Override¶
To fully override the container command:
helm install my-release ./chart-helm \
--set image.command[0]=vllm \
--set image.command[1]=serve \
--set image.command[2]=my-model \
--set image.command[3]=--omni \
--set image.command[4]=--host \
--set image.command[5]=0.0.0.0
API Endpoints¶
Once deployed, vLLM-Omni exposes the following OpenAI-compatible endpoints:
| Endpoint | Method | Description |
|---|---|---|
/health | GET | Health check |
/v1/models | GET | List available models |
/v1/chat/completions | POST | Chat completions (text/multimodal) |
/v1/images/generations | POST | Image generation |
/v1/images/edits | POST | Image editing |
/v1/audio/speech | POST | Text-to-speech |
Files¶
| File | Description |
|---|---|
Chart.yaml | Chart metadata (name, version, maintainers) |
values.yaml | Default configuration values |
values.schema.json | JSON schema for validating values |
templates/_helpers.tpl | Helper templates for common configurations |
templates/deployment.yaml | Kubernetes Deployment |
templates/service.yaml | Kubernetes Service (ClusterIP) |
templates/secrets.yaml | Secrets (generic + HuggingFace token) |
templates/pvc.yaml | PersistentVolumeClaim for model cache |
templates/configmap.yaml | Optional ConfigMap |
templates/hpa.yaml | HorizontalPodAutoscaler |
templates/poddisruptionbudget.yaml | PodDisruptionBudget |
templates/custom-objects.yaml | Custom Kubernetes objects |
Running Tests¶
This chart includes unit tests using helm-unittest. Install the plugin and run tests:
# Install plugin
helm plugin install https://github.com/helm-unittest/helm-unittest
# Run tests
helm unittest .
Example materials¶
Chart.yaml
apiVersion: v2
name: chart-vllm-omni
description: A Helm chart for deploying vLLM-Omni on Kubernetes for omni-modality model serving
# Application chart that can be packaged and deployed
type: application
# Chart version - increment on each change to the chart
version: 0.1.0
# vllm-omni application version — keep in sync with image.tag in values.yaml
appVersion: "0.16.0"
maintainers:
- name: vllm-omni-team
lintconf.yaml
---
rules:
braces:
min-spaces-inside: 0
max-spaces-inside: 0
min-spaces-inside-empty: -1
max-spaces-inside-empty: -1
brackets:
min-spaces-inside: 0
max-spaces-inside: 0
min-spaces-inside-empty: -1
max-spaces-inside-empty: -1
colons:
max-spaces-before: 0
max-spaces-after: 1
commas:
max-spaces-before: 0
min-spaces-after: 1
max-spaces-after: 1
comments:
require-starting-space: true
min-spaces-from-content: 2
document-end: disable
document-start: disable
empty-lines:
max: 2
max-start: 0
max-end: 0
hyphens:
max-spaces-after: 1
indentation:
spaces: consistent
indent-sequences: whatever
check-multi-line-strings: false
key-duplicates: enable
line-length: disable
new-line-at-end-of-file: disable
new-lines:
type: unix
trailing-spaces: enable
truthy:
level: warning
templates/_helpers.tpl
{{/*
Define the vllm-omni serve command from model + omniArgs.
If image.command is set, uses that as a full override.
*/}}
{{- define "chart.omni-command" -}}
{{- if .Values.image.command }}
{{- toYaml .Values.image.command }}
{{- else }}
- "vllm"
- "serve"
- {{ .Values.model | quote }}
- "--omni"
- "--host"
- "0.0.0.0"
- "--port"
- {{ include "chart.container-port" . | quote }}
{{- if .Values.omniArgs.vaeUseSlicing }}
- "--vae-use-slicing"
{{- end }}
{{- if .Values.omniArgs.vaeUseTiling }}
- "--vae-use-tiling"
{{- end }}
{{- if .Values.omniArgs.enableCpuOffload }}
- "--enable-cpu-offload"
{{- end }}
{{- if .Values.omniArgs.numGpus }}
- "--num-gpus"
- {{ .Values.omniArgs.numGpus | quote }}
{{- end }}
{{- if .Values.omniArgs.stageConfigsPath }}
- "--stage-configs-path"
- {{ .Values.omniArgs.stageConfigsPath | quote }}
{{- end }}
{{- if and .Values.omniArgs.cacheBackend (ne .Values.omniArgs.cacheBackend "none") }}
- "--cache-backend"
- {{ .Values.omniArgs.cacheBackend | quote }}
{{- end }}
{{- if .Values.omniArgs.defaultSamplingParams }}
- "--default-sampling-params"
- {{ .Values.omniArgs.defaultSamplingParams | quote }}
{{- end }}
{{- if and .Values.omniArgs.workerBackend (ne .Values.omniArgs.workerBackend "multi_process") }}
- "--worker-backend"
- {{ .Values.omniArgs.workerBackend | quote }}
{{- end }}
{{- range .Values.omniArgs.extraArgs }}
- {{ . | quote }}
{{- end }}
{{- end }}
{{- end }}
{{/*
Define HuggingFace environment variables
*/}}
{{- define "chart.hf-env" -}}
- name: HF_HOME
value: "/cache/huggingface"
- name: HOME
value: "/cache"
{{- if .Values.hfToken }}
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: {{ .Release.Name }}-hf-token
key: token
{{- end }}
{{- end }}
{{/*
Define ports for the pods
*/}}
{{- define "chart.container-port" -}}
{{- default "8000" .Values.containerPort }}
{{- end }}
{{/*
Define service name
*/}}
{{- define "chart.service-name" -}}
{{- if .Values.serviceName -}}
{{ .Values.serviceName | lower | trim }}
{{- else -}}
{{ printf "%s-service" .Release.Name }}
{{- end -}}
{{- end }}
{{/*
Define service port
*/}}
{{- define "chart.service-port" -}}
{{- if .Values.servicePort }}
{{- .Values.servicePort }}
{{- else }}
{{- include "chart.container-port" . }}
{{- end }}
{{- end }}
{{/*
Define service port name
*/}}
{{- define "chart.service-port-name" -}}
"service-port"
{{- end }}
{{/*
Define container port name
*/}}
{{- define "chart.container-port-name" -}}
"container-port"
{{- end }}
{{/*
Define deployment strategy
*/}}
{{- define "chart.strategy" -}}
strategy:
{{- if not .Values.deploymentStrategy }}
type: Recreate
{{- else }}
{{ toYaml .Values.deploymentStrategy | indent 2 }}
{{- end }}
{{- end }}
{{/*
Define additional ports
*/}}
{{- define "chart.extraPorts" }}
{{- with .Values.extraPorts }}
{{ toYaml . }}
{{- end }}
{{- end }}
{{/*
Define chart external ConfigMaps and Secrets
*/}}
{{- define "chart.externalConfigs" -}}
{{- with .Values.externalConfigs -}}
{{ toYaml . }}
{{- end }}
{{- end }}
{{/*
Define startup, liveness and readiness probes
*/}}
{{- define "chart.probes" -}}
{{- if .Values.startupProbe }}
startupProbe:
{{- with .Values.startupProbe }}
{{- toYaml . | nindent 2 }}
{{- end }}
{{- end }}
{{- if .Values.readinessProbe }}
readinessProbe:
{{- with .Values.readinessProbe }}
{{- toYaml . | nindent 2 }}
{{- end }}
{{- end }}
{{- if .Values.livenessProbe }}
livenessProbe:
{{- with .Values.livenessProbe }}
{{- toYaml . | nindent 2 }}
{{- end }}
{{- end }}
{{- end }}
{{/*
Define resources
*/}}
{{- define "chart.resources" -}}
requests:
memory: {{ required "Value 'resources.requests.memory' must be defined !" .Values.resources.requests.memory | quote }}
cpu: {{ required "Value 'resources.requests.cpu' must be defined !" .Values.resources.requests.cpu | quote }}
{{- if and (gt (int (index .Values.resources.requests "nvidia.com/gpu")) 0) (gt (int (index .Values.resources.limits "nvidia.com/gpu")) 0) }}
nvidia.com/gpu: {{ required "Value 'resources.requests.nvidia.com/gpu' must be defined !" (index .Values.resources.requests "nvidia.com/gpu") }}
{{- end }}
limits:
memory: {{ required "Value 'resources.limits.memory' must be defined !" .Values.resources.limits.memory | quote }}
cpu: {{ required "Value 'resources.limits.cpu' must be defined !" .Values.resources.limits.cpu | quote }}
{{- if and (gt (int (index .Values.resources.requests "nvidia.com/gpu")) 0) (gt (int (index .Values.resources.limits "nvidia.com/gpu")) 0) }}
nvidia.com/gpu: {{ required "Value 'resources.limits.nvidia.com/gpu' must be defined !" (index .Values.resources.limits "nvidia.com/gpu") }}
{{- end }}
{{- end }}
{{/*
Define user for the main container
*/}}
{{- define "chart.user" }}
{{- if .Values.image.runAsUser }}
runAsUser:
{{- with .Values.image.runAsUser }}
{{- toYaml . | nindent 2 }}
{{- end }}
{{- end }}
{{- end }}
{{/*
Define chart labels
*/}}
{{- define "chart.labels" -}}
{{- with .Values.labels -}}
{{ toYaml . }}
{{- end }}
{{- end }}
templates/configmap.yaml
templates/custom-objects.yaml
templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: "{{ .Release.Name }}-deployment-vllm-omni"
namespace: {{ .Release.Namespace }}
labels:
{{- include "chart.labels" . | nindent 4 }}
spec:
{{- if not .Values.autoscaling.enabled }}
replicas: {{ .Values.replicaCount }}
{{- end }}
{{- include "chart.strategy" . | nindent 2 }}
selector:
matchLabels:
{{- include "chart.labels" . | nindent 6 }}
progressDeadlineSeconds: 1800
template:
metadata:
labels:
{{- include "chart.labels" . | nindent 8 }}
spec:
containers:
- name: "vllm-omni"
image: "{{ required "Required value 'image.repository' must be defined !" .Values.image.repository }}:{{ required "Required value 'image.tag' must be defined !" .Values.image.tag }}"
command:
{{- include "chart.omni-command" . | nindent 12 }}
securityContext:
{{- if .Values.image.securityContext }}
{{- with .Values.image.securityContext }}
{{- toYaml . | nindent 12 }}
{{- end }}
{{- else }}
runAsNonRoot: false
{{- include "chart.user" . | indent 12 }}
{{- end }}
imagePullPolicy: IfNotPresent
env:
{{- include "chart.hf-env" . | nindent 12 }}
{{- if .Values.image.env }}
{{- toYaml .Values.image.env | nindent 12 }}
{{- end }}
{{- if or .Values.externalConfigs .Values.configs .Values.secrets }}
envFrom:
{{- if .Values.configs }}
- configMapRef:
name: "{{ .Release.Name }}-configs"
{{- end }}
{{- if .Values.secrets }}
- secretRef:
name: "{{ .Release.Name }}-secrets"
{{- end }}
{{- include "chart.externalConfigs" . | nindent 12 }}
{{- end }}
ports:
- name: {{ include "chart.container-port-name" . }}
containerPort: {{ include "chart.container-port" . }}
{{- include "chart.extraPorts" . | nindent 12 }}
{{- include "chart.probes" . | indent 10 }}
resources: {{- include "chart.resources" . | nindent 12 }}
volumeMounts:
- name: shm
mountPath: /dev/shm
- name: model-cache
mountPath: /cache
{{- with .Values.extraContainers }}
{{ toYaml . | nindent 8 }}
{{- end }}
{{- if .Values.extraInit.initContainers }}
initContainers:
{{- toYaml .Values.extraInit.initContainers | nindent 8 }}
{{- end }}
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: {{ .Values.shmSize }}
- name: model-cache
{{- if .Values.modelCache.enabled }}
persistentVolumeClaim:
claimName: {{ .Release.Name }}-model-cache
{{- else }}
emptyDir: {}
{{- end }}
{{- with .Values.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.tolerations }}
tolerations:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- if .Values.gpuModels }}
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
{{- toYaml .Values.gpuModels | nindent 20 }}
{{- end }}
templates/hpa.yaml
{{- if .Values.autoscaling.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: "{{ .Release.Name }}-hpa"
namespace: {{ .Release.Namespace }}
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: "{{ .Release.Name }}-deployment-vllm-omni"
minReplicas: {{ .Values.autoscaling.minReplicas }}
maxReplicas: {{ .Values.autoscaling.maxReplicas }}
metrics:
{{- if .Values.autoscaling.targetCPUUtilizationPercentage }}
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.targetCPUUtilizationPercentage }}
{{- end }}
{{- if .Values.autoscaling.targetMemoryUtilizationPercentage }}
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.targetMemoryUtilizationPercentage }}
{{- end }}
{{- end }}
templates/ingress.yaml
{{- if .Values.ingress.enabled }}
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: "{{ .Release.Name }}-ingress"
namespace: {{ .Release.Namespace }}
labels:
{{- include "chart.labels" . | nindent 4 }}
{{- with .Values.ingress.annotations }}
annotations:
{{- toYaml . | nindent 4 }}
{{- end }}
spec:
{{- if .Values.ingress.ingressClassName }}
ingressClassName: {{ .Values.ingress.ingressClassName }}
{{- end }}
{{- if .Values.ingress.tls }}
tls:
{{- toYaml .Values.ingress.tls | nindent 4 }}
{{- end }}
rules:
- host: {{ .Values.ingress.host }}
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: "{{ include "chart.service-name" . }}"
port:
number: {{ include "chart.service-port" . }}
{{- end }}
templates/poddisruptionbudget.yaml
{{- if gt (int .Values.replicaCount) 1 }}
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: "{{ .Release.Name }}-pdb"
namespace: {{ .Release.Namespace }}
spec:
maxUnavailable: {{ default 1 .Values.maxUnavailablePodDisruptionBudget }}
selector:
matchLabels:
{{- include "chart.labels" . | nindent 6 }}
{{- end }}
templates/pvc.yaml
{{- if .Values.modelCache.enabled }}
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: "{{ .Release.Name }}-model-cache"
namespace: {{ .Release.Namespace }}
spec:
accessModes:
{{- toYaml .Values.modelCache.accessModes | nindent 4 }}
{{- if .Values.modelCache.storageClassName }}
storageClassName: {{ .Values.modelCache.storageClassName }}
{{- end }}
resources:
requests:
storage: {{ .Values.modelCache.storageSize }}
{{- end }}
templates/secrets.yaml
{{- if .Values.secrets }}
apiVersion: v1
kind: Secret
metadata:
name: "{{ .Release.Name }}-secrets"
namespace: {{ .Release.Namespace }}
type: Opaque
data:
{{- range $key, $val := .Values.secrets }}
{{ $key }}: {{ $val | b64enc | quote }}
{{- end }}
---
{{- end }}
{{- if .Values.hfToken }}
apiVersion: v1
kind: Secret
metadata:
name: "{{ .Release.Name }}-hf-token"
namespace: {{ .Release.Namespace }}
type: Opaque
data:
token: {{ .Values.hfToken | b64enc | quote }}
{{- end }}
templates/service.yaml
apiVersion: v1
kind: Service
metadata:
name: "{{ include "chart.service-name" . }}"
namespace: {{ .Release.Namespace }}
spec:
type: ClusterIP
ports:
- name: {{ include "chart.service-port-name" . }}
port: {{ include "chart.service-port" . }}
targetPort: {{ include "chart.container-port-name" . }}
protocol: TCP
selector:
{{- include "chart.labels" . | nindent 4 }}
tests/deployment_test.yaml
suite: test deployment
templates:
- deployment.yaml
tests:
- it: should create deployment with default omni command
asserts:
- hasDocuments:
count: 1
- isKind:
of: Deployment
- equal:
path: spec.template.spec.containers[0].name
value: vllm-omni
- equal:
path: spec.template.spec.containers[0].command[0]
value: vllm
- equal:
path: spec.template.spec.containers[0].command[1]
value: serve
- equal:
path: spec.template.spec.containers[0].command[2]
value: Tongyi-MAI/Z-Image-Turbo
- equal:
path: spec.template.spec.containers[0].command[3]
value: "--omni"
- equal:
path: spec.template.spec.containers[0].command[4]
value: "--host"
- equal:
path: spec.template.spec.containers[0].command[5]
value: "0.0.0.0"
- equal:
path: spec.template.spec.containers[0].command[6]
value: "--port"
- equal:
path: spec.template.spec.containers[0].command[7]
value: "8000"
- it: should use custom model when set
set:
model: "Qwen/Qwen2.5-Omni-7B"
asserts:
- equal:
path: spec.template.spec.containers[0].command[2]
value: Qwen/Qwen2.5-Omni-7B
- it: should include omniArgs flags in command
set:
omniArgs:
vaeUseSlicing: true
vaeUseTiling: true
enableCpuOffload: true
numGpus: 2
cacheBackend: "tea_cache"
workerBackend: "ray"
extraArgs:
- "--enable-layerwise-offload"
asserts:
- contains:
path: spec.template.spec.containers[0].command
content: "--vae-use-slicing"
- contains:
path: spec.template.spec.containers[0].command
content: "--vae-use-tiling"
- contains:
path: spec.template.spec.containers[0].command
content: "--enable-cpu-offload"
- contains:
path: spec.template.spec.containers[0].command
content: "--num-gpus"
- contains:
path: spec.template.spec.containers[0].command
content: "--cache-backend"
- contains:
path: spec.template.spec.containers[0].command
content: "--worker-backend"
- contains:
path: spec.template.spec.containers[0].command
content: "--enable-layerwise-offload"
- it: should use full command override when image.command is set
set:
image:
command:
- "custom-binary"
- "--flag"
asserts:
- equal:
path: spec.template.spec.containers[0].command[0]
value: custom-binary
- equal:
path: spec.template.spec.containers[0].command[1]
value: "--flag"
- it: should mount shm and model-cache volumes
asserts:
- contains:
path: spec.template.spec.containers[0].volumeMounts
content:
name: shm
mountPath: /dev/shm
- contains:
path: spec.template.spec.containers[0].volumeMounts
content:
name: model-cache
mountPath: /cache
- contains:
path: spec.template.spec.volumes
content:
name: shm
emptyDir:
medium: Memory
sizeLimit: 8Gi
- it: should use PVC for model-cache when modelCache is enabled
set:
modelCache:
enabled: true
storageSize: "50Gi"
accessModes:
- ReadWriteOnce
asserts:
- contains:
path: spec.template.spec.volumes
content:
name: model-cache
persistentVolumeClaim:
claimName: RELEASE-NAME-model-cache
- it: should use emptyDir for model-cache when modelCache is disabled
set:
modelCache:
enabled: false
storageSize: "50Gi"
accessModes:
- ReadWriteOnce
asserts:
- contains:
path: spec.template.spec.volumes
content:
name: model-cache
emptyDir: {}
- it: should set HF_HOME environment variable
asserts:
- contains:
path: spec.template.spec.containers[0].env
content:
name: HF_HOME
value: /cache/huggingface
- it: should set HF_TOKEN env when hfToken is provided
set:
hfToken: "hf_test_token_123"
asserts:
- contains:
path: spec.template.spec.containers[0].env
content:
name: HF_TOKEN
valueFrom:
secretKeyRef:
name: RELEASE-NAME-hf-token
key: token
- it: should create custom init containers when specified
set:
extraInit:
initContainers:
- name: my-init
image: busybox:latest
command: ["echo", "init"]
asserts:
- lengthEqual:
path: spec.template.spec.initContainers
count: 1
- equal:
path: spec.template.spec.initContainers[0].name
value: my-init
- equal:
path: spec.template.spec.initContainers[0].image
value: busybox:latest
tests/ingress_test.yaml
suite: test ingress
templates:
- ingress.yaml
tests:
- it: should not create Ingress when disabled
set:
ingress:
enabled: false
asserts:
- hasDocuments:
count: 0
- it: should create Ingress when enabled
set:
ingress:
enabled: true
host: "vllm-omni.example.com"
asserts:
- hasDocuments:
count: 1
- isKind:
of: Ingress
- equal:
path: spec.rules[0].host
value: vllm-omni.example.com
- equal:
path: spec.rules[0].http.paths[0].path
value: /
- equal:
path: spec.rules[0].http.paths[0].pathType
value: Prefix
- equal:
path: spec.rules[0].http.paths[0].backend.service.name
value: RELEASE-NAME-service
- equal:
path: spec.rules[0].http.paths[0].backend.service.port.number
value: 80
- it: should set ingressClassName when specified
set:
ingress:
enabled: true
host: "vllm-omni.example.com"
ingressClassName: "nginx"
asserts:
- equal:
path: spec.ingressClassName
value: nginx
- it: should not set ingressClassName when empty
set:
ingress:
enabled: true
host: "vllm-omni.example.com"
ingressClassName: ""
asserts:
- notExists:
path: spec.ingressClassName
- it: should include annotations when specified
set:
ingress:
enabled: true
host: "vllm-omni.example.com"
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "100m"
asserts:
- isNotEmpty:
path: metadata.annotations
- it: should include TLS configuration when specified
set:
ingress:
enabled: true
host: "vllm-omni.example.com"
tls:
- secretName: vllm-omni-tls
hosts:
- vllm-omni.example.com
asserts:
- equal:
path: spec.tls[0].secretName
value: vllm-omni-tls
- equal:
path: spec.tls[0].hosts[0]
value: vllm-omni.example.com
tests/pvc_test.yaml
suite: test pvc
templates:
- pvc.yaml
tests:
- it: should create PVC when modelCache is enabled
set:
modelCache:
enabled: true
storageSize: "50Gi"
storageClassName: ""
accessModes:
- ReadWriteOnce
asserts:
- hasDocuments:
count: 1
- isKind:
of: PersistentVolumeClaim
- equal:
path: spec.accessModes[0]
value: ReadWriteOnce
- equal:
path: spec.resources.requests.storage
value: 50Gi
- it: should not create PVC when modelCache is disabled
set:
modelCache:
enabled: false
storageSize: "50Gi"
accessModes:
- ReadWriteOnce
asserts:
- hasDocuments:
count: 0
- it: should use custom storage size
set:
modelCache:
enabled: true
storageSize: "100Gi"
storageClassName: ""
accessModes:
- ReadWriteOnce
asserts:
- equal:
path: spec.resources.requests.storage
value: 100Gi
- it: should set storageClassName when specified
set:
modelCache:
enabled: true
storageSize: "50Gi"
storageClassName: "fast-ssd"
accessModes:
- ReadWriteOnce
asserts:
- equal:
path: spec.storageClassName
value: fast-ssd
- it: should support ReadWriteMany access mode
set:
modelCache:
enabled: true
storageSize: "50Gi"
storageClassName: ""
accessModes:
- ReadWriteMany
asserts:
- equal:
path: spec.accessModes[0]
value: ReadWriteMany
tests/secrets_test.yaml
suite: test secrets
templates:
- secrets.yaml
tests:
- it: should create HF token secret when hfToken is provided
set:
hfToken: "hf_test_token_123"
asserts:
- hasDocuments:
count: 1
- isKind:
of: Secret
- equal:
path: metadata.name
value: RELEASE-NAME-hf-token
- exists:
path: data.token
- it: should not create any secrets when hfToken is empty and no secrets defined
set:
hfToken: ""
secrets: {}
asserts:
- hasDocuments:
count: 0
- it: should create both generic and HF token secrets
set:
hfToken: "hf_test_token_123"
secrets:
mykey: myvalue
asserts:
- hasDocuments:
count: 2
- it: should create only generic secrets when no hfToken
set:
hfToken: ""
secrets:
mykey: myvalue
asserts:
- hasDocuments:
count: 1
- isKind:
of: Secret
- equal:
path: metadata.name
value: RELEASE-NAME-secrets
values.yaml
# -- Default values for chart vllm-omni
# -- Declare variables to be passed into your templates.
# -- HuggingFace model ID to serve
model: "Tongyi-MAI/Z-Image-Turbo"
# -- Image configuration
image:
# -- Image repository
repository: "vllm/vllm-omni"
# -- Image tag
tag: "v0.16.0"
# -- Override the container command entirely. If empty, the chart constructs
# -- it automatically from model + omniArgs.
command: []
# -- Optional environment variables for the container
env: []
# -- Security context override
securityContext: {}
# -- Container port
containerPort: 8000
# -- Service name (auto-generated from release name if empty)
serviceName:
# -- Service port
servicePort: 80
# -- Additional ports configuration
extraPorts: []
# -- Number of replicas
replicaCount: 1
# -- Deployment strategy configuration
deploymentStrategy: {}
# -- Resource configuration
resources:
requests:
# -- Number of CPUs
cpu: 4
# -- CPU memory
memory: 24Gi
# -- Number of GPUs
nvidia.com/gpu: 1
limits:
# -- Number of CPUs
cpu: 4
# -- CPU memory
memory: 24Gi
# -- Number of GPUs
nvidia.com/gpu: 1
# -- GPU model types for node affinity scheduling (optional)
# -- Leave empty to schedule on any GPU node. Set to restrict to specific GPU types.
# -- Example: ["NVIDIA-A100-SXM4-40GB"]
gpuModels: []
# -- vllm-omni specific CLI arguments
omniArgs:
# -- Additional raw CLI flags appended to the serve command (list of strings)
extraArgs: []
# -- Enable VAE slicing for memory optimization
vaeUseSlicing: false
# -- Enable VAE tiling for memory optimization
vaeUseTiling: false
# -- Enable CPU offloading for diffusion models
enableCpuOffload: false
# -- Number of GPUs for diffusion inference (empty = auto)
numGpus:
# -- Path to stage configs file (empty = auto-detected from model)
stageConfigsPath:
# -- Cache backend: none, tea_cache, or cache_dit
cacheBackend: "none"
# -- Default sampling params as a JSON string
defaultSamplingParams:
# -- Worker backend: multi_process or ray
workerBackend: "multi_process"
# -- HuggingFace token for gated models (optional)
hfToken: ""
# -- Shared memory size for PyTorch multiprocessing
shmSize: "8Gi"
# -- Model cache configuration (HuggingFace downloads)
modelCache:
# -- Use a PersistentVolumeClaim for model cache (recommended for production)
enabled: true
# -- Storage size for the PVC
storageSize: "50Gi"
# -- Storage class name (empty = cluster default)
storageClassName: ""
# -- Access modes
accessModes:
- ReadWriteOnce
# -- Autoscaling configuration
autoscaling:
# -- Enable autoscaling
enabled: false
# -- Minimum replicas
minReplicas: 1
# -- Maximum replicas
maxReplicas: 10
# -- Target CPU utilization for autoscaling
targetCPUUtilizationPercentage: 80
# targetMemoryUtilizationPercentage: 80
# -- ConfigMap data (key-value pairs injected as environment variables)
configs: {}
# -- Secrets data (key-value pairs, base64-encoded automatically)
secrets: {}
# -- External ConfigMaps/Secrets references
externalConfigs: []
# -- Custom Kubernetes objects
customObjects: []
# -- PodDisruptionBudget max unavailable
maxUnavailablePodDisruptionBudget: ""
# -- Additional init containers
extraInit:
initContainers: []
# -- Additional sidecar containers
extraContainers: []
# -- Startup probe configuration
# -- Protects slow-starting containers. Liveness and readiness probes
# -- do not start until the startup probe succeeds.
# -- failureThreshold * periodSeconds = max startup time (40 * 30 = 1200s = 20 min)
startupProbe:
# -- HTTP check configuration
httpGet:
# -- Path to check
path: /health
# -- Port to check (uses named port to stay in sync with containerPort)
port: "container-port"
# -- Failures before killing the container
failureThreshold: 40
# -- How often to perform the probe
periodSeconds: 30
# -- Readiness probe configuration
readinessProbe:
# -- How often to perform the probe
periodSeconds: 10
# -- Failures before marking unready
failureThreshold: 3
# -- HTTP check configuration
httpGet:
# -- Path to check
path: /health
# -- Port to check (uses named port to stay in sync with containerPort)
port: "container-port"
# -- Liveness probe configuration
livenessProbe:
# -- Failures before restarting
failureThreshold: 3
# -- How often to perform the probe
periodSeconds: 15
# -- HTTP check configuration
httpGet:
# -- Path to check
path: /health
# -- Port to check (uses named port to stay in sync with containerPort)
port: "container-port"
# -- Ingress configuration
ingress:
# -- Enable Ingress
enabled: false
# -- Ingress class name (e.g., nginx, traefik)
ingressClassName: ""
# -- Hostname for the Ingress rule
host: "vllm-omni.example.com"
# -- Additional annotations
annotations: {}
# -- TLS configuration
tls: []
# - secretName: vllm-omni-tls
# hosts:
# - vllm-omni.example.com
# -- Node selector for pod scheduling
nodeSelector: {}
# -- Tolerations for pod scheduling
tolerations: []
# -- Labels applied to all resources
labels:
app: "vllm-omni"