vllm_gaudi.patches
¶
Runtime monkey-patches applied when the HPU plugin is loaded.
Currently:
-
torch.accelerator.empty_cache— HPU's allocator does not implement thec10::DeviceAllocatorinterface, so the upstream helper raisesRuntimeError: Allocator for hpu is not a DeviceAllocator. We replace it with an HPU-safe variant that routes throughcurrent_platform.empty_cache()(a no-op on HPU). This also makes thecleanup_dist_env_and_memorypatch resilient to import-order issues. -
torch._C._host_emptyCache— does not exist on HPU; we install a no-op stub to preventAttributeErrorincleanup_dist_env_and_memory. -
vllm.distributed.parallel_state.cleanup_dist_env_and_memory— upstream (since vllm PR #34328) callstorch.accelerator.empty_cache(), which requires the device's allocator to be ac10::DeviceAllocator. We replace it with an HPU-safe variant that usescurrent_platform.empty_cache()instead (see GAUDISW-247825). -
vllm.v1.sample.ops.logprobs.batched_count_greater_than— upstream decorates this function with@torch.compile(dynamic=True, ...). Habana'srecipe_compilerbackend cannot handle the symbolic shapes produced bydynamic=True(and bymark_unbackedin the caller), raisingTypeError: Cannot convert symbols to int. We replace it with a plain (uncompiled) version of the same function. The replacement is deferred toload_general_pluginstime to avoid importingvllm.v1.sample.samplerduring early plugin registration, which would trigger a heavy import chain that interferes with platform initialisation.
_hpu_accelerator_empty_cache
¶
HPU-safe replacement for torch.accelerator.empty_cache().
HPU's allocator does not implement the c10::DeviceAllocator
interface, so the upstream torch.accelerator.empty_cache() raises
RuntimeError. Route through current_platform.empty_cache
instead (which is None on HPU, making this a no-op).
Source code in vllm_gaudi/patches.py
_hpu_batched_count_greater_than
¶
HPU-safe replacement for batched_count_greater_than.
Identical logic to the upstream implementation but not wrapped in
torch.compile. The upstream decorator uses dynamic=True whose
symbolic shapes are incompatible with Habana's recipe_compiler
backend, and mark_unbacked in the caller prevents dynamic=False
from helping.
Source code in vllm_gaudi/patches.py
_hpu_cleanup_dist_env_and_memory
¶
_hpu_cleanup_dist_env_and_memory(
shutdown_ray: bool = False,
) -> None
HPU-safe replacement for cleanup_dist_env_and_memory.
Mirrors the upstream implementation but routes the device-side cache
release through current_platform.empty_cache() instead of
torch.accelerator.empty_cache() (which is incompatible with the
HPU allocator).
Source code in vllm_gaudi/patches.py
_patch_batched_count_greater_than
¶
Replace batched_count_greater_than in the sampler & logprobs modules.
Called from the load_general_plugins hook so that the heavy
vllm.v1.sample.* import chain runs after platform initialisation.
Source code in vllm_gaudi/patches.py
apply
¶
Install all HPU runtime monkey-patches.
Source code in vllm_gaudi/patches.py
patch_hf3fs_mock_client
¶
Guard CUDA sync in the HF3FS mock client on non-CUDA platforms.
The upstream mock client's batch_write unconditionally calls
torch.cuda.current_stream().wait_event(event), which raises
RuntimeError on platforms without CUDA (e.g. HPU). We wrap
batch_write to stub torch.cuda.current_stream with a no-op
mock for the duration of the call.
Called from register_utils() (general plugin) rather than
apply() (platform plugin) to avoid circular imports — the mock
client transitively imports vllm.config which is not yet fully
initialized during platform registration.