`vllm.v1.kv_offload.base` ¶

Core abstractions for KV cache offloading in vLLM v1.

Classes:

BlockIDsLoadStoreSpec –

Spec for loading/storing KV blocks from given block numbers.
CanonicalKVCacheRef –

Per-layer (or group of layers) reference to a specific (by index)
CanonicalKVCacheTensor –

A canonicalized KV cache tensor whose first dimension is num_blocks.
CanonicalKVCaches –

Canonicalized block-level representation of the KV caches.
GPULoadStoreSpec –

Spec for loading/storing a KV block to GPU memory.
LoadStoreSpec –

Metadata that encapsulates information allowing a worker
Locality –

Locality of a tier's storage relative to the publishing instance.
LookupResult –

Result of OffloadingManager.lookup().
OffloadingManager –
OffloadingSpec –

Spec for an offloading connector
OffloadingWorker –

Runs in the worker process. Performs async KV transfers for ONE
ScheduleEndContext –

Per-step scheduling info passed to on_schedule_end().

Functions:

get_offload_block_hash –

Extract the block hash from an OffloadKey.
get_offload_group_idx –

Extract the group index from an OffloadKey.
make_offload_key –

Pack a block hash and group index into an OffloadKey.

`BlockIDsLoadStoreSpec` ¶

Bases: LoadStoreSpec, ABC

Spec for loading/storing KV blocks from given block numbers.

Source code in vllm/v1/kv_offload/base.py

class BlockIDsLoadStoreSpec(LoadStoreSpec, ABC):
    """
    Spec for loading/storing KV blocks from given block numbers.
    """

    def __init__(self, block_ids: list[int]):
        self.block_ids = np.array(block_ids, dtype=np.int64)

    def __repr__(self) -> str:
        return repr(self.block_ids)

`CanonicalKVCacheRef` `dataclass` ¶

Per-layer (or group of layers) reference to a specific (by index) CanonicalKVCacheTensor and records the un-padded page size used by that layer.

Source code in vllm/v1/kv_offload/base.py

@dataclass
class CanonicalKVCacheRef:
    """
    Per-layer (or group of layers) reference to a specific (by index)
    CanonicalKVCacheTensor and records the un-padded page size used by that layer.
    """

    # Index into the list of CanonicalKVCacheTensor objects
    tensor_idx: int
    # The un-padded page size per block in bytes
    page_size_bytes: int

`CanonicalKVCacheTensor` `dataclass` ¶

A canonicalized KV cache tensor whose first dimension is num_blocks.

For attention backends where the raw tensor has num_blocks at a non-leading physical dimension (e.g. FlashAttention's (2, num_blocks, ...) layout), the tensor is split so that each resulting CanonicalKVCacheTensor starts with (num_blocks, ...).

Source code in vllm/v1/kv_offload/base.py

@dataclass
class CanonicalKVCacheTensor:
    """
    A canonicalized KV cache tensor whose first dimension is num_blocks.

    For attention backends where the raw tensor has num_blocks at a
    non-leading physical dimension (e.g. FlashAttention's
    (2, num_blocks, ...) layout), the tensor is split so that each
    resulting CanonicalKVCacheTensor starts with (num_blocks, ...).
    """

    # The KV cache tensor with shape (num_blocks, ...)
    tensor: torch.Tensor
    # The (possibly padded) page size per block in bytes
    page_size_bytes: int

`CanonicalKVCaches` `dataclass` ¶

Canonicalized block-level representation of the KV caches.

Composed of

Unique list of KV cache data tensors, each with shape (num_blocks, page_size_in_bytes) and int8 dtype.
Per-group data references of the tensors. i.e. how each KV cache group maps to the tensors.

Source code in vllm/v1/kv_offload/base.py

@dataclass
class CanonicalKVCaches:
    """
    Canonicalized block-level representation of the KV caches.

    Composed of:
        - Unique list of KV cache data tensors,
          each with shape (num_blocks, page_size_in_bytes) and int8 dtype.
        - Per-group data references of the tensors.
          i.e. how each KV cache group maps to the tensors.
    """

    # Ordered list of unique block tensors, each with shape
    # (num_blocks, ...).
    tensors: list[CanonicalKVCacheTensor]
    # Per-KV-cache-group list of data references that map each layer
    # in the group to the appropriate entry in the tensors list.
    group_data_refs: list[list[CanonicalKVCacheRef]]

`GPULoadStoreSpec` ¶

Bases: BlockIDsLoadStoreSpec

Spec for loading/storing a KV block to GPU memory.

If there are multiple KV groups, the blocks are expected to be ordered by the group index. In that case, group_sizes[i] determines the number of blocks per the i-th KV group, and thus sum(group_sizes) == len(block_ids). group_sizes=None indicates a single KV group.

If block_indices is given, each group (determined by group_sizes) of block IDs will correspond to logically contiguous blocks, e.g. blocks 5-10 of a some request. block_indices[i] will represent the block index of the first block in group #i. Thus, len(block_indices) == len(group_sizes) = number of KV cache groups. This information is required in order to support off/loading from offloaded blocks which are larger than GPU blocks. In such cases, the first GPU block per each group may be unaligned to the offloaded block size, and so knowing block_indices[i] allows the worker to correctly skip part of the first matching offloaded block.

Source code in vllm/v1/kv_offload/base.py

class GPULoadStoreSpec(BlockIDsLoadStoreSpec):
    """
    Spec for loading/storing a KV block to GPU memory.

    If there are multiple KV groups, the blocks are expected to be
    ordered by the group index.
    In that case, group_sizes[i] determines the number of blocks
    per the i-th KV group, and thus sum(group_sizes) == len(block_ids).
    group_sizes=None indicates a single KV group.

    If block_indices is given, each group (determined by group_sizes) of block IDs
    will correspond to logically contiguous blocks, e.g. blocks 5-10 of a some request.
    block_indices[i] will represent the block index of the first block in group #i.
    Thus, len(block_indices) == len(group_sizes) = number of KV cache groups.
    This information is required in order to support off/loading from offloaded blocks
    which are larger than GPU blocks.
    In such cases, the first GPU block per each group may be unaligned to the offloaded
    block size, and so knowing block_indices[i] allows the worker to correctly
    skip part of the first matching offloaded block.
    """

    def __init__(
        self,
        block_ids: list[int],
        group_sizes: Sequence[int],
        block_indices: Sequence[int],
    ):
        super().__init__(block_ids)
        assert sum(group_sizes) == len(block_ids)
        assert len(block_indices) == len(group_sizes)
        self.group_sizes: Sequence[int] = group_sizes
        self.block_indices: Sequence[int] = block_indices

`LoadStoreSpec` ¶

Metadata that encapsulates information allowing a worker to load, and optionally also to store, blocks of KV data.

Source code in vllm/v1/kv_offload/base.py

class LoadStoreSpec:
    """
    Metadata that encapsulates information allowing a worker
    to load, and optionally also to store, blocks of KV data.
    """

`Locality` ¶

Bases: Enum

Locality of a tier's storage relative to the publishing instance.

Source code in vllm/v1/kv_offload/base.py

class Locality(Enum):
    """Locality of a tier's storage relative to the publishing instance."""

    LOCAL = "LOCAL"
    REMOTE = "REMOTE"

`LookupResult` ¶

Bases: Enum

Result of OffloadingManager.lookup().

Source code in vllm/v1/kv_offload/base.py

class LookupResult(Enum):
    """Result of OffloadingManager.lookup()."""

    MISS = auto()
    HIT = auto()
    HIT_PENDING = auto()
    RETRY = auto()

`OffloadingManager` ¶

Bases: ABC

Methods:

complete_load –

Marks previous blocks that were prepared to load as done loading.
complete_store –

Marks blocks which were previously prepared to be stored, as stored.
get_stats –

Return collected metrics since last call, or None if disabled.
has_pending_work –

Whether this manager needs the engine to keep stepping.
lookup –

Checks whether a single block is offloaded and ready to be read.
on_new_request –

Called when a new request is first seen by the scheduler.
on_request_finished –

Called when a request has finished.
on_schedule_end –

Called once at the end of each scheduler step.
prepare_load –

Prepare the given blocks to be read.
prepare_store –

Prepare the given blocks to be offloaded.
reset_cache –

Evict all tracked blocks and reset internal state.
shutdown –

Shutdown the manager and release any resources.
take_events –

Take the offloading events from the manager.
touch –

Mark the given blocks as recently used.

Source code in vllm/v1/kv_offload/base.py

class OffloadingManager(ABC):
    @abstractmethod
    def lookup(self, key: OffloadKey, req_context: ReqContext) -> LookupResult:
        """
        Checks whether a single block is offloaded and ready to be read.

        Args:
            key: the key identifying the block to lookup.
            req_context: per-request context (e.g. kv_transfer_params).

        Returns:
            HIT if the block is offloaded and ready, MISS if not found,
            HIT_PENDING if found but not yet readable, or RETRY if the
            lookup should be retried later.
        """
        pass

    @abstractmethod
    def prepare_load(
        self,
        keys: Collection[OffloadKey],
        req_context: ReqContext,
    ) -> LoadStoreSpec:
        """
        Prepare the given blocks to be read.
        The given blocks will be protected from eviction until
        complete_load is called.
        It assumes all given blocks are offloaded.

        Args:
            keys: the keys identifying the blocks.
            req_context: per-request context (e.g. kv_transfer_params).

        Returns:
            A LoadStoreSpec that can be used by a worker to locate and load
            the actual offloaded KV data.
        """
        pass

    def touch(self, keys: Collection[OffloadKey], req_context: ReqContext):
        """
        Mark the given blocks as recently used.
        This could in practice mean moving them to the end of an LRU list.

        Args:
            keys: the keys identifying the blocks.
            req_context: per-request context (e.g. kv_transfer_params).
        """
        return

    def complete_load(self, keys: Collection[OffloadKey], req_context: ReqContext):
        """
        Marks previous blocks that were prepared to load as done loading.

        Args:
            keys: the keys identifying the blocks.
            req_context: per-request context (e.g. kv_transfer_params).
        """
        return

    @abstractmethod
    def prepare_store(
        self,
        keys: Collection[OffloadKey],
        req_context: ReqContext,
    ) -> PrepareStoreOutput | None:
        """
        Prepare the given blocks to be offloaded.
        The given blocks will be protected from eviction until
        complete_store is called.

        Args:
            keys: the keys identifying the blocks.
            req_context: per-request context (e.g. kv_transfer_params).

        Returns:
            A PrepareStoreOutput indicating which blocks need storing,
            where to store them (LoadStoreSpec), and list of blocks that
            were evicted as a result.
            None is returned if the blocks cannot be stored.
        """
        pass

    def complete_store(
        self,
        keys: Collection[OffloadKey],
        req_context: ReqContext,
        success: bool = True,
    ):
        """
        Marks blocks which were previously prepared to be stored, as stored.
        Following this call, the blocks become loadable.
        If success is False, blocks that were not marked as stored will be
        removed.

        Args:
            keys: the keys identifying the blocks.
            req_context: per-request context (e.g. kv_transfer_params).
            success: whether the blocks were stored successfully.
        """
        return

    @abstractmethod
    def on_new_request(self, req_context: ReqContext) -> RequestOffloadingContext:
        """
        Called when a new request is first seen by the scheduler.

        Returns a RequestOffloadingContext indicating how this request's
        blocks should be offloaded.

        Args:
            req_context: per-request context.
        """
        pass

    def on_request_finished(self, req_context: ReqContext) -> None:
        """
        Called when a request has finished.

        By the time this is called, the scheduler will issue no more
        submit-side calls for this request, such as prepare_store() and
        prepare_load(). Completion callbacks for already-submitted transfers
        (complete_store() and complete_load()) may still arrive afterward.

        This hook does NOT imply the data has been persisted. Asynchronous
        transfers already submitted for this request may still be in flight.
        Managers that cascade to lower tiers should delay those tiers'
        on_request_finished() calls until no more lower-tier submit calls can
        be issued for this request.

        Args:
            req_context: per-request context.
        """
        return

    def take_events(self) -> Iterable[OffloadingEvent]:
        """
        Take the offloading events from the manager.

        A tier manager emits only events for storage state it owns. A
        composing manager may aggregate child event streams, but should not
        synthesize events on behalf of a child tier.

        Yields:
            New OffloadingEvents collected since the last call.
        """
        return ()

    def on_schedule_end(self, context: ScheduleEndContext) -> None:
        """Called once at the end of each scheduler step.

        Managers may override this to flush deferred work accumulated
        during the step (e.g., batched promotions).
        """
        return

    def has_pending_work(self) -> bool:
        """Whether this manager needs the engine to keep stepping.

        While True, on_schedule_end() and get_finished_jobs() continue
        to be called even when no requests are scheduled.
        """
        return False

    def reset_cache(self) -> None:
        """Evict all tracked blocks and reset internal state."""
        return

    def get_stats(self) -> "OffloadingConnectorStats | None":
        """Return collected metrics since last call, or None if disabled."""
        return None

    def shutdown(self) -> None:
        """Shutdown the manager and release any resources."""
        return

`complete_load(keys, req_context)` ¶

Marks previous blocks that were prepared to load as done loading.

Parameters:

keys ¶
(Collection[OffloadKey]) –

the keys identifying the blocks.
req_context ¶
(ReqContext) –

per-request context (e.g. kv_transfer_params).

Source code in vllm/v1/kv_offload/base.py

def complete_load(self, keys: Collection[OffloadKey], req_context: ReqContext):
    """
    Marks previous blocks that were prepared to load as done loading.

    Args:
        keys: the keys identifying the blocks.
        req_context: per-request context (e.g. kv_transfer_params).
    """
    return

`complete_store(keys, req_context, success=True)` ¶

Marks blocks which were previously prepared to be stored, as stored. Following this call, the blocks become loadable. If success is False, blocks that were not marked as stored will be removed.

Parameters:

keys ¶
(Collection[OffloadKey]) –

the keys identifying the blocks.
req_context ¶
(ReqContext) –

per-request context (e.g. kv_transfer_params).
success ¶
(bool, default: True ) –

whether the blocks were stored successfully.

Source code in vllm/v1/kv_offload/base.py

def complete_store(
    self,
    keys: Collection[OffloadKey],
    req_context: ReqContext,
    success: bool = True,
):
    """
    Marks blocks which were previously prepared to be stored, as stored.
    Following this call, the blocks become loadable.
    If success is False, blocks that were not marked as stored will be
    removed.

    Args:
        keys: the keys identifying the blocks.
        req_context: per-request context (e.g. kv_transfer_params).
        success: whether the blocks were stored successfully.
    """
    return

`get_stats()` ¶

Return collected metrics since last call, or None if disabled.

Source code in vllm/v1/kv_offload/base.py

def get_stats(self) -> "OffloadingConnectorStats | None":
    """Return collected metrics since last call, or None if disabled."""
    return None

`has_pending_work()` ¶

Whether this manager needs the engine to keep stepping.

While True, on_schedule_end() and get_finished_jobs() continue to be called even when no requests are scheduled.

Source code in vllm/v1/kv_offload/base.py

def has_pending_work(self) -> bool:
    """Whether this manager needs the engine to keep stepping.

    While True, on_schedule_end() and get_finished_jobs() continue
    to be called even when no requests are scheduled.
    """
    return False

`lookup(key, req_context)` `abstractmethod` ¶

Checks whether a single block is offloaded and ready to be read.

Parameters:

key ¶
(OffloadKey) –

the key identifying the block to lookup.
req_context ¶
(ReqContext) –

per-request context (e.g. kv_transfer_params).

Returns:

LookupResult –

HIT if the block is offloaded and ready, MISS if not found,
LookupResult –

HIT_PENDING if found but not yet readable, or RETRY if the
LookupResult –

lookup should be retried later.

Source code in vllm/v1/kv_offload/base.py

@abstractmethod
def lookup(self, key: OffloadKey, req_context: ReqContext) -> LookupResult:
    """
    Checks whether a single block is offloaded and ready to be read.

    Args:
        key: the key identifying the block to lookup.
        req_context: per-request context (e.g. kv_transfer_params).

    Returns:
        HIT if the block is offloaded and ready, MISS if not found,
        HIT_PENDING if found but not yet readable, or RETRY if the
        lookup should be retried later.
    """
    pass

`on_new_request(req_context)` `abstractmethod` ¶

Called when a new request is first seen by the scheduler.

Returns a RequestOffloadingContext indicating how this request's blocks should be offloaded.

Parameters:

req_context ¶
(ReqContext) –

per-request context.

Source code in vllm/v1/kv_offload/base.py

@abstractmethod
def on_new_request(self, req_context: ReqContext) -> RequestOffloadingContext:
    """
    Called when a new request is first seen by the scheduler.

    Returns a RequestOffloadingContext indicating how this request's
    blocks should be offloaded.

    Args:
        req_context: per-request context.
    """
    pass

`on_request_finished(req_context)` ¶

Called when a request has finished.

By the time this is called, the scheduler will issue no more submit-side calls for this request, such as prepare_store() and prepare_load(). Completion callbacks for already-submitted transfers (complete_store() and complete_load()) may still arrive afterward.

This hook does NOT imply the data has been persisted. Asynchronous transfers already submitted for this request may still be in flight. Managers that cascade to lower tiers should delay those tiers' on_request_finished() calls until no more lower-tier submit calls can be issued for this request.

Parameters:

req_context ¶
(ReqContext) –

per-request context.

Source code in vllm/v1/kv_offload/base.py

def on_request_finished(self, req_context: ReqContext) -> None:
    """
    Called when a request has finished.

    By the time this is called, the scheduler will issue no more
    submit-side calls for this request, such as prepare_store() and
    prepare_load(). Completion callbacks for already-submitted transfers
    (complete_store() and complete_load()) may still arrive afterward.

    This hook does NOT imply the data has been persisted. Asynchronous
    transfers already submitted for this request may still be in flight.
    Managers that cascade to lower tiers should delay those tiers'
    on_request_finished() calls until no more lower-tier submit calls can
    be issued for this request.

    Args:
        req_context: per-request context.
    """
    return

`on_schedule_end(context)` ¶

Called once at the end of each scheduler step.

Managers may override this to flush deferred work accumulated during the step (e.g., batched promotions).

Source code in vllm/v1/kv_offload/base.py

def on_schedule_end(self, context: ScheduleEndContext) -> None:
    """Called once at the end of each scheduler step.

    Managers may override this to flush deferred work accumulated
    during the step (e.g., batched promotions).
    """
    return

`prepare_load(keys, req_context)` `abstractmethod` ¶

Prepare the given blocks to be read. The given blocks will be protected from eviction until complete_load is called. It assumes all given blocks are offloaded.

Parameters:

keys ¶
(Collection[OffloadKey]) –

the keys identifying the blocks.
req_context ¶
(ReqContext) –

per-request context (e.g. kv_transfer_params).

Returns:

LoadStoreSpec –

A LoadStoreSpec that can be used by a worker to locate and load
LoadStoreSpec –

the actual offloaded KV data.

Source code in vllm/v1/kv_offload/base.py

@abstractmethod
def prepare_load(
    self,
    keys: Collection[OffloadKey],
    req_context: ReqContext,
) -> LoadStoreSpec:
    """
    Prepare the given blocks to be read.
    The given blocks will be protected from eviction until
    complete_load is called.
    It assumes all given blocks are offloaded.

    Args:
        keys: the keys identifying the blocks.
        req_context: per-request context (e.g. kv_transfer_params).

    Returns:
        A LoadStoreSpec that can be used by a worker to locate and load
        the actual offloaded KV data.
    """
    pass

`prepare_store(keys, req_context)` `abstractmethod` ¶

Prepare the given blocks to be offloaded. The given blocks will be protected from eviction until complete_store is called.

Parameters:

keys ¶
(Collection[OffloadKey]) –

the keys identifying the blocks.
req_context ¶
(ReqContext) –

per-request context (e.g. kv_transfer_params).

Returns:

PrepareStoreOutput | None –

A PrepareStoreOutput indicating which blocks need storing,
PrepareStoreOutput | None –

where to store them (LoadStoreSpec), and list of blocks that
PrepareStoreOutput | None –

were evicted as a result.
PrepareStoreOutput | None –

None is returned if the blocks cannot be stored.

Source code in vllm/v1/kv_offload/base.py

@abstractmethod
def prepare_store(
    self,
    keys: Collection[OffloadKey],
    req_context: ReqContext,
) -> PrepareStoreOutput | None:
    """
    Prepare the given blocks to be offloaded.
    The given blocks will be protected from eviction until
    complete_store is called.

    Args:
        keys: the keys identifying the blocks.
        req_context: per-request context (e.g. kv_transfer_params).

    Returns:
        A PrepareStoreOutput indicating which blocks need storing,
        where to store them (LoadStoreSpec), and list of blocks that
        were evicted as a result.
        None is returned if the blocks cannot be stored.
    """
    pass

`reset_cache()` ¶

Evict all tracked blocks and reset internal state.

Source code in vllm/v1/kv_offload/base.py

def reset_cache(self) -> None:
    """Evict all tracked blocks and reset internal state."""
    return

`shutdown()` ¶

Shutdown the manager and release any resources.

Source code in vllm/v1/kv_offload/base.py

def shutdown(self) -> None:
    """Shutdown the manager and release any resources."""
    return

`take_events()` ¶

Take the offloading events from the manager.

A tier manager emits only events for storage state it owns. A composing manager may aggregate child event streams, but should not synthesize events on behalf of a child tier.

Yields:

Iterable[OffloadingEvent] –

New OffloadingEvents collected since the last call.

Source code in vllm/v1/kv_offload/base.py

def take_events(self) -> Iterable[OffloadingEvent]:
    """
    Take the offloading events from the manager.

    A tier manager emits only events for storage state it owns. A
    composing manager may aggregate child event streams, but should not
    synthesize events on behalf of a child tier.

    Yields:
        New OffloadingEvents collected since the last call.
    """
    return ()

`touch(keys, req_context)` ¶

Mark the given blocks as recently used. This could in practice mean moving them to the end of an LRU list.

Parameters:

keys ¶
(Collection[OffloadKey]) –

the keys identifying the blocks.
req_context ¶
(ReqContext) –

per-request context (e.g. kv_transfer_params).

Source code in vllm/v1/kv_offload/base.py

def touch(self, keys: Collection[OffloadKey], req_context: ReqContext):
    """
    Mark the given blocks as recently used.
    This could in practice mean moving them to the end of an LRU list.

    Args:
        keys: the keys identifying the blocks.
        req_context: per-request context (e.g. kv_transfer_params).
    """
    return

`OffloadingSpec` ¶

Bases: ABC

Spec for an offloading connector

Methods:

build_metric_definitions –

Return Prometheus metric definitions emitted by this spec.
get_manager –

Get an OffloadingManager that will be used
get_worker –

Get an OffloadingWorker that handles async KV transfers for this spec.

Source code in vllm/v1/kv_offload/base.py

class OffloadingSpec(ABC):
    """Spec for an offloading connector"""

    @classmethod
    def build_metric_definitions(
        cls, extra_config: dict[str, Any]
    ) -> dict[str, "OffloadingMetricMetadata"]:
        """Return Prometheus metric definitions emitted by this spec."""
        return {}

    def __init__(self, config: OffloadingConfig):
        logger.warning(
            "Initializing OffloadingSpec. This API is experimental and "
            "subject to change in the future as we iterate the design."
        )
        self.config = config
        self.extra_config = config.extra_config
        self.kv_events_config = OffloadingKVEventsConfig(
            enable_kv_cache_events=config.enable_kv_cache_events,
            self_describing_kv_events=bool(
                self.extra_config.get("self_describing_kv_events", False)
            ),
        )

        # When True, only prompt (prefill) blocks are offloaded; decode-phase
        # blocks (KV generated after the prompt) are skipped. Useful when prior
        # turns' generated tokens are dropped before the next turn (e.g.
        # reasoning models that strip thinking).
        self.offload_prompt_only: bool = bool(
            self.extra_config.get("offload_prompt_only", True)
        )

        self.tokens_per_block = tuple(group.tokens_per_block for group in config.groups)
        self.tokens_per_hash = config.cache.tokens_per_hash
        self.blocks_per_chunk = config.cache.blocks_per_chunk

    @abstractmethod
    def get_manager(self) -> OffloadingManager:
        """
        Get an OffloadingManager that will be used
        by the scheduler-side offloading connector to track
        offloaded blocks and manage evictions.
        """
        pass

    @abstractmethod
    def get_worker(self, kv_caches: CanonicalKVCaches) -> OffloadingWorker:
        """
        Get an OffloadingWorker that handles async KV transfers for this spec.

        Args:
            kv_caches: Canonicalized KV caches.

        Returns:
            An OffloadingWorker instance for this medium.
        """
        pass

`build_metric_definitions(extra_config)` `classmethod` ¶

Return Prometheus metric definitions emitted by this spec.

Source code in vllm/v1/kv_offload/base.py

@classmethod
def build_metric_definitions(
    cls, extra_config: dict[str, Any]
) -> dict[str, "OffloadingMetricMetadata"]:
    """Return Prometheus metric definitions emitted by this spec."""
    return {}

`get_manager()` `abstractmethod` ¶

Get an OffloadingManager that will be used by the scheduler-side offloading connector to track offloaded blocks and manage evictions.

Source code in vllm/v1/kv_offload/base.py

@abstractmethod
def get_manager(self) -> OffloadingManager:
    """
    Get an OffloadingManager that will be used
    by the scheduler-side offloading connector to track
    offloaded blocks and manage evictions.
    """
    pass

`get_worker(kv_caches)` `abstractmethod` ¶

Get an OffloadingWorker that handles async KV transfers for this spec.

Parameters:

kv_caches ¶
(CanonicalKVCaches) –

Canonicalized KV caches.

Returns:

OffloadingWorker –

An OffloadingWorker instance for this medium.

Source code in vllm/v1/kv_offload/base.py

@abstractmethod
def get_worker(self, kv_caches: CanonicalKVCaches) -> OffloadingWorker:
    """
    Get an OffloadingWorker that handles async KV transfers for this spec.

    Args:
        kv_caches: Canonicalized KV caches.

    Returns:
        An OffloadingWorker instance for this medium.
    """
    pass

`OffloadingWorker` ¶

Bases: ABC

Runs in the worker process. Performs async KV transfers for ONE offloaded medium (e.g. CPU). Direction is explicit via submit_store / submit_load, so there is no (src_medium, dst_medium) routing.

Methods:

submit_load –

Async offloaded medium -> GPU.
submit_store –

Async GPU -> offloaded medium.

Source code in vllm/v1/kv_offload/base.py

class OffloadingWorker(ABC):
    """Runs in the worker process. Performs async KV transfers for ONE
    offloaded medium (e.g. CPU). Direction is explicit via submit_store /
    submit_load, so there is no (src_medium, dst_medium) routing."""

    @abstractmethod
    def submit_store(
        self, job_id: int, src_spec: GPULoadStoreSpec, dst_spec: LoadStoreSpec
    ) -> bool:
        """Async GPU -> offloaded medium."""

    @abstractmethod
    def submit_load(
        self, job_id: int, src_spec: LoadStoreSpec, dst_spec: GPULoadStoreSpec
    ) -> bool:
        """Async offloaded medium -> GPU."""

    @abstractmethod
    def get_finished(self) -> list[TransferResult]: ...

    @abstractmethod
    def wait(self, job_ids: set[int]) -> None: ...

    def shutdown(self) -> None:
        return

`submit_load(job_id, src_spec, dst_spec)` `abstractmethod` ¶

Async offloaded medium -> GPU.

Source code in vllm/v1/kv_offload/base.py

@abstractmethod
def submit_load(
    self, job_id: int, src_spec: LoadStoreSpec, dst_spec: GPULoadStoreSpec
) -> bool:
    """Async offloaded medium -> GPU."""

`submit_store(job_id, src_spec, dst_spec)` `abstractmethod` ¶

Async GPU -> offloaded medium.

Source code in vllm/v1/kv_offload/base.py

@abstractmethod
def submit_store(
    self, job_id: int, src_spec: GPULoadStoreSpec, dst_spec: LoadStoreSpec
) -> bool:
    """Async GPU -> offloaded medium."""

`ScheduleEndContext` ¶

Bases: NamedTuple

Per-step scheduling info passed to on_schedule_end().

Source code in vllm/v1/kv_offload/base.py

class ScheduleEndContext(NamedTuple):
    """Per-step scheduling info passed to on_schedule_end()."""

    # Request IDs scheduled for the first time this step.
    new_req_ids: Collection[str]
    # Request IDs preempted this step.
    preempted_req_ids: Collection[str]

`get_offload_block_hash(key)` ¶

Extract the block hash from an OffloadKey.

Source code in vllm/v1/kv_offload/base.py

def get_offload_block_hash(key: OffloadKey) -> bytes:
    """Extract the block hash from an `OffloadKey`."""
    return key[:-4]

`get_offload_group_idx(key)` ¶

Extract the group index from an OffloadKey.

Source code in vllm/v1/kv_offload/base.py

def get_offload_group_idx(key: OffloadKey) -> int:
    """Extract the group index from an `OffloadKey`."""
    return int.from_bytes(key[-4:], "big", signed=False)

`make_offload_key(block_hash, group_idx)` ¶

Pack a block hash and group index into an OffloadKey.

Source code in vllm/v1/kv_offload/base.py

def make_offload_key(block_hash: bytes, group_idx: int) -> OffloadKey:
    """Pack a block hash and group index into an `OffloadKey`."""
    return OffloadKey(block_hash + group_idx.to_bytes(4, "big", signed=False))

vllm.v1.kv_offload.base ¶

BlockIDsLoadStoreSpec ¶

CanonicalKVCacheRef dataclass ¶

CanonicalKVCacheTensor dataclass ¶

CanonicalKVCaches dataclass ¶

GPULoadStoreSpec ¶

LoadStoreSpec ¶

Locality ¶

LookupResult ¶

OffloadingManager ¶

complete_load(keys, req_context) ¶

keys ¶

req_context ¶

complete_store(keys, req_context, success=True) ¶

keys ¶

req_context ¶

success ¶

get_stats() ¶

has_pending_work() ¶

lookup(key, req_context) abstractmethod ¶

key ¶

req_context ¶

on_new_request(req_context) abstractmethod ¶

req_context ¶

on_request_finished(req_context) ¶

req_context ¶

on_schedule_end(context) ¶

prepare_load(keys, req_context) abstractmethod ¶

keys ¶

req_context ¶

prepare_store(keys, req_context) abstractmethod ¶

keys ¶

req_context ¶

reset_cache() ¶

shutdown() ¶

take_events() ¶

touch(keys, req_context) ¶

keys ¶

req_context ¶

OffloadingSpec ¶

build_metric_definitions(extra_config) classmethod ¶

get_manager() abstractmethod ¶

get_worker(kv_caches) abstractmethod ¶

kv_caches ¶

OffloadingWorker ¶

submit_load(job_id, src_spec, dst_spec) abstractmethod ¶

submit_store(job_id, src_spec, dst_spec) abstractmethod ¶

ScheduleEndContext ¶

get_offload_block_hash(key) ¶

get_offload_group_idx(key) ¶

make_offload_key(block_hash, group_idx) ¶

`vllm.v1.kv_offload.base` ¶

`BlockIDsLoadStoreSpec` ¶

`CanonicalKVCacheRef` `dataclass` ¶

`CanonicalKVCacheTensor` `dataclass` ¶

`CanonicalKVCaches` `dataclass` ¶

`GPULoadStoreSpec` ¶

`LoadStoreSpec` ¶

`Locality` ¶

`LookupResult` ¶

`OffloadingManager` ¶

`complete_load(keys, req_context)` ¶

`keys` ¶

`req_context` ¶

`complete_store(keys, req_context, success=True)` ¶

`keys` ¶

`req_context` ¶

`success` ¶

`get_stats()` ¶

`has_pending_work()` ¶

`lookup(key, req_context)` `abstractmethod` ¶

`key` ¶

`req_context` ¶

`on_new_request(req_context)` `abstractmethod` ¶

`req_context` ¶

`on_request_finished(req_context)` ¶

`req_context` ¶

`on_schedule_end(context)` ¶

`prepare_load(keys, req_context)` `abstractmethod` ¶

`keys` ¶

`req_context` ¶

`prepare_store(keys, req_context)` `abstractmethod` ¶

`keys` ¶

`req_context` ¶

`reset_cache()` ¶

`shutdown()` ¶

`take_events()` ¶

`touch(keys, req_context)` ¶

`keys` ¶

`req_context` ¶

`OffloadingSpec` ¶

`build_metric_definitions(extra_config)` `classmethod` ¶

`get_manager()` `abstractmethod` ¶

`get_worker(kv_caches)` `abstractmethod` ¶

`kv_caches` ¶

`OffloadingWorker` ¶

`submit_load(job_id, src_spec, dst_spec)` `abstractmethod` ¶

`submit_store(job_id, src_spec, dst_spec)` `abstractmethod` ¶

`ScheduleEndContext` ¶

`get_offload_block_hash(key)` ¶

`get_offload_group_idx(key)` ¶

`make_offload_key(block_hash, group_idx)` ¶