FAQ#

Frequently Asked Questions about vLLM Production Stack.

Installation & Setup#

to be updated

Deployment & Configuration#

Q: How do I update to a new version of vLLM Production Stack?#

Update your values.yaml file with the new version and upgrade:

helm upgrade my-vllm-stack vllm/vllm-stack -f values.yaml

Q: How do I scale my deployment?#

You can scale in several ways:

  • Horizontal scaling: Increase replicaCount in your values

  • Vertical scaling: Allocate more GPUs per replica

  • Auto-scaling: Use Autoscaling with KEDA for automatic scaling

Q: What’s the difference between router and vLLM instances?#

A:

  • Router: Handles request routing, load balancing, and advanced features like KV cache management

  • vLLM instances: Run the actual model inference

  • The router distributes requests across multiple vLLM instances for better performance and availability

Performance & Optimization#

Q: How can I improve inference performance?#

Several optimization strategies are available:

Q: What is KV cache and why does it matter?#

KV (Key-Value) cache stores computed attention keys and values from previous tokens, enabling faster generation of subsequent tokens. Proper KV cache management significantly improves performance for:

  • Long conversations

  • Similar prompts

  • Batch processing

Q: How do I monitor performance?#

Use the built-in monitoring features:

Troubleshooting#

Q: Pods are stuck in Pending state#

Check:

kubectl describe pod <pod-name> -n vllm-system

Common causes: * Insufficient GPU resources * Node selector/affinity issues * Resource quotas exceeded * Image pull failures

Q: Where can I get help?#

A:

  • GitHub Issues: Report bugs and feature requests

  • Community meetings: See Community Meetings

  • Documentation: Check other sections of this documentation

  • vLLM Community: Join the broader vLLM community discussions

Q: How can I contribute?#

See Contributing for contribution guidelines.

Q: Is there a roadmap?#

Check the GitHub repository for the latest roadmap and feature plans.