Skip to content

vllm_omni.worker.memory_utils

GPU memory utilities for vLLM Omni workers.

Includes a tolerant version of the upstream request_memory() that handles multi-stage GPU sharing by capping the memory budget to available free memory instead of raising ValueError.

logger module-attribute

logger = init_logger(__name__)

request_memory_tolerant

request_memory_tolerant(
    init_snapshot: MemorySnapshot, cache_config: CacheConfig
) -> int

Calculate the amount of memory required for this stage.

Like upstream request_memory() but tolerates multi-stage GPU sharing: if free_memory < requested_memory (because another stage on the same GPU has already consumed memory), caps the requested budget to the actual free memory instead of raising ValueError. The downstream OmniGPUWorkerBase.determine_available_memory() already does per-process NVML accounting and correctly computes the KV cache budget regardless.

Logs a warning when the budget is capped so operators can detect under-provisioned GPU memory.