GPU Memory Calculation and Configuration¶
This guide explains how to calculate GPU memory requirements and properly configure gpu_memory_utilization for vLLM-Omni stages.
Overview¶
gpu_memory_utilization is a critical parameter that controls how much GPU memory each stage can use. It's specified as a fraction between 0.0 and 1.0, where: - 0.8 means 80% of the GPU's total memory - 1.0 means 100% of the GPU's total memory (not recommended, leaves no buffer)
How Memory is Calculated¶
Memory Allocation Formula¶
For each stage, vLLM-Omni calculates the requested memory as:
The system checks that:
If this condition is not met, the stage will fail to initialize with an error message showing the memory requirements.
Memory Components¶
The total memory used by a stage includes:
- Model Weights: The size of the model parameters loaded on the GPU
- KV Cache: Memory for storing key-value cache during generation
- Activation Memory: Temporary memory for intermediate computations
- System Overhead: Memory used by CUDA, PyTorch, and other system components
- Non-Torch Memory: Memory allocated outside of PyTorch (e.g., CUDA graphs)
Example Calculation¶
For a GPU with 80GB total memory: - gpu_memory_utilization: 0.8 → 64GB available for the stage - gpu_memory_utilization: 0.6 → 48GB available for the stage - gpu_memory_utilization: 0.15 → 12GB available for the stage
Setting Up gpu_memory_utilization¶
Step 1: Determine GPU Memory¶
First, check your GPU's total memory:
# Using nvidia-smi
nvidia-smi --query-gpu=memory.total --format=csv
# Or using Python
python -c "import torch; print(f'{torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB')"
Step 2: Estimate Model Memory Requirements¶
For Autoregressive (AR) Stages¶
AR stages typically need more memory due to: - Large model weights - KV cache for attention - Activation buffers
For Diffusion/Generation Stages¶
Diffusion stages (like code2wav) typically need less memory: - Smaller model components - Different memory access patterns
Typical values: - 0.1 - 0.3 for most diffusion stages
Step 3: Consider Multi-Stage Scenarios¶
When multiple stages share the same GPU, you must ensure the sum of their gpu_memory_utilization values doesn't exceed 1.0.
Example: Two stages on GPU 0
stage_args:
- stage_id: 0
runtime:
devices: "0"
engine_args:
gpu_memory_utilization: 0.6 # Uses 60% of GPU 0
- stage_id: 1
runtime:
devices: "0"
engine_args:
gpu_memory_utilization: 0.3 # Uses 30% of GPU 0
# Total: 90% of GPU 0 (safe, leaves 10% buffer)
Important: If stages run on different GPUs, each can use up to 1.0 independently.
Step 4: Account for Tensor Parallelism¶
When using tensor_parallel_size > 1, the model is split across multiple GPUs, so each GPU needs less memory.
Example: 2-way tensor parallelism
stage_args:
- stage_id: 0
runtime:
devices: "0,1" # Uses both GPUs
engine_args:
tensor_parallel_size: 2
gpu_memory_utilization: 0.6 # 60% per GPU
# Model is split, so each GPU uses ~30% of model memory
Examples¶
Qwen3-Omni-MoE on 2x H100-80GB¶
stage_args:
- stage_id: 0 # Thinker stage with TP=2
runtime:
devices: "0,1"
engine_args:
tensor_parallel_size: 2
gpu_memory_utilization: 0.6 # 48GB per GPU
- stage_id: 1 # Talker stage
runtime:
devices: "1"
engine_args:
gpu_memory_utilization: 0.3 # 24GB on GPU 1
- stage_id: 2 # Code2Wav stage
runtime:
devices: "0"
engine_args:
gpu_memory_utilization: 0.1 # 8GB on GPU 0
Troubleshooting¶
Error: "Free memory is less than desired GPU memory utilization"¶
This means the GPU doesn't have enough free memory when the stage starts.
Solutions: 1. Free up memory by closing other processes 2. Reduce gpu_memory_utilization for this stage 3. Use a GPU with more memory 4. Move the stage to a different GPU
Error: OOM during inference¶
The stage initialized but ran out of memory during processing.
Solutions: 1. Reduce max_num_batched_tokens 2. Reduce max_num_seqs in engine_args 3. Lower gpu_memory_utilization slightly 4. Enable quantization if supported
Memory Not Fully Utilized¶
If you see low memory usage, you can: 1. Increase gpu_memory_utilization to allow larger KV cache 2. Increase max_num_batched_tokens for better batching 3. Check if other stages are limiting throughput
Useful formula for Memory Calculation¶
KV Cache Memory¶
The KV cache size depends on: - Number of sequences in batch - Sequence length (prompt + generation) - Model hidden size - Number of attention heads - Number of layers
approximate Formula:
2 for k & vModel Weight Memory¶
For example: - 7B parameters in FP16: ~14GB - 7B parameters in FP32: ~28GB - 7B parameters in INT8: ~7GB
Activation Memory¶
Activation memory is typically smaller but varies with: - Batch size - Sequence length - Model architecture
It's usually 10-30% of model weight memory during inference.