vllm_omni.worker.memory_utils ¶
GPU memory utilities for vLLM Omni workers.
Includes a tolerant version of the upstream request_memory() that handles multi-stage GPU sharing by capping the memory budget to available free memory instead of raising ValueError.
request_memory_tolerant ¶
request_memory_tolerant(
init_snapshot: MemorySnapshot, cache_config: CacheConfig
) -> int
Calculate the amount of memory required for this stage.
Like upstream request_memory() but tolerates multi-stage GPU sharing: if free_memory < requested_memory (because another stage on the same GPU has already consumed memory), caps the requested budget to the actual free memory instead of raising ValueError. The downstream OmniGPUWorkerBase.determine_available_memory() already does per-process NVML accounting and correctly computes the KV cache budget regardless.
Logs a warning when the budget is capped so operators can detect under-provisioned GPU memory.