vllm.v1.kv_offload.tiering.base ¶
Abstract interfaces and data types for the secondary tiering layer.
Classes:
-
JobMetadata–Metadata for an in-flight async transfer job.
-
JobResult–Result of an async transfer job (successful or failed).
-
SecondaryTierManager–Abstract interface for managing a single non-primary offloading tier.
JobMetadata dataclass ¶
Metadata for an in-flight async transfer job.
Source code in vllm/v1/kv_offload/tiering/base.py
JobResult dataclass ¶
SecondaryTierManager ¶
Bases: ABC
Abstract interface for managing a single non-primary offloading tier.
Secondary tiers cannot directly access GPU memory. All data transfers must go through the CPU (primary) tier: - Store: GPU → CPU (primary) → secondary (cascade) - Load: secondary → CPU (primary) → GPU (promotion)
IMPORTANT: All methods run in the Scheduler process and must be lightweight and non-blocking. submit_load() and submit_store() submit async jobs; get_finished_jobs() polls for completion.
Methods:
-
__init__–Args:
-
build_metric_definitions–Return Prometheus metric definitions emitted by this tier.
-
drain_jobs–Block until every submitted load/store job has completed or failed.
-
get_finished_jobs–Return all jobs (loads and stores) that completed since the last call.
-
get_stats–Return and reset metric observations collected by this tier.
-
has_pending_work–Whether this tier needs the engine to keep stepping.
-
lookup–Check whether a block exists in this secondary tier.
-
on_new_request–Called when a new request is first seen by the scheduler.
-
on_request_finished–Called when a request has finished.
-
on_schedule_end–Called once at the end of each scheduler step.
-
shutdown–Release resources held by this tier (threads, connections, etc.).
-
submit_load–Submit an async job to load blocks from this secondary tier to the
-
submit_store–Submit an async job to store blocks from the primary tier to this
-
touch–Mark blocks as recently used for eviction policy.
Source code in vllm/v1/kv_offload/tiering/base.py
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 | |
__init__(offloading_spec, primary_kv_view, tier_type) ¶
Parameters:
-
(offloading_spec¶OffloadingSpec) –Offloading configuration.
-
(primary_kv_view¶memoryview) –Memoryview of the primary tier's CPU KV cache.
-
(tier_type¶str) –Tier type identifier, set by SecondaryTierFactory from the registered tier type.
Source code in vllm/v1/kv_offload/tiering/base.py
build_metric_definitions(extra_config) classmethod ¶
Return Prometheus metric definitions emitted by this tier.
drain_jobs() abstractmethod ¶
Block until every submitted load/store job has completed or failed.
After this returns, no tier I/O is touching the primary memoryview, and every submitted job's result is available from get_finished_jobs() (yielded by a prior call or queued for the next one). Used by TieringOffloadingManager.reset_cache to release primary slots without racing with in-flight transfers.
Implementations must not abort a mid-flight transfer: a partial copy would corrupt either the primary memoryview or the secondary backing store. Queued (not-yet-started) transfers may be cancelled, but their failure result must still appear in get_finished_jobs().
Source code in vllm/v1/kv_offload/tiering/base.py
get_finished_jobs() abstractmethod ¶
Return all jobs (loads and stores) that completed since the last call.
The framework uses these results to release resources and finalize transfers.
Returns:
-
Iterable[JobResult]–Iterable of JobResult objects for jobs finished since the
-
Iterable[JobResult]–last call.
Source code in vllm/v1/kv_offload/tiering/base.py
get_stats() ¶
has_pending_work() ¶
Whether this tier needs the engine to keep stepping.
While True, on_schedule_end() and get_finished_jobs() continue to be called even when no requests are scheduled.
lookup(key, req_context) abstractmethod ¶
Check whether a block exists in this secondary tier.
Parameters:
-
(key¶OffloadKey) –Offload key to look up.
-
(req_context¶ReqContext) –per-request context (e.g. kv_transfer_params).
Returns:
-
LookupResult–HIT if the block is present and ready,
-
LookupResult–MISS if not found,
-
LookupResult–or RETRY if the block is being transferred (retry later).
Source code in vllm/v1/kv_offload/tiering/base.py
on_new_request(req_context) abstractmethod ¶
Called when a new request is first seen by the scheduler.
Returns a RequestOffloadingContext expressing this tier's preference for how blocks should be offloaded for this request.
Parameters:
-
(req_context¶ReqContext) –Per-request context.
Source code in vllm/v1/kv_offload/tiering/base.py
on_request_finished(req_context) ¶
Called when a request has finished.
By the time this is called, all per-request calls for this request (submit_store, submit_load, touch) have already been issued, and none will follow. Note this does NOT imply the tier's transfers have completed: jobs already submitted may still be in flight and will report via get_finished_jobs(). This is the right place to release per-request bookkeeping.
Parameters:
-
(req_context¶ReqContext) –per-request context.
Source code in vllm/v1/kv_offload/tiering/base.py
on_schedule_end() ¶
Called once at the end of each scheduler step.
Secondary tiers may override this for per-step cleanup or deferred work submission.
shutdown() ¶
submit_load(job_metadata) abstractmethod ¶
Submit an async job to load blocks from this secondary tier to the primary tier.
This method must be lightweight and non-blocking: mark blocks as in-flight and submit the transfer, but do NOT perform the data copy on the calling thread.
Preconditions (guaranteed by the framework): - job_metadata.block_ids are allocated primary-tier slots ready to receive data.
The implementation must copy data from this tier into the primary-tier slots identified by block_ids.
Report completion via get_finished_jobs().
Parameters:
-
(job_metadata¶JobMetadata) –Job metadata including job_id, keys, and block_ids identifying the primary-tier slots to write into.
Source code in vllm/v1/kv_offload/tiering/base.py
submit_store(job_metadata) abstractmethod ¶
Submit an async job to store blocks from the primary tier to this secondary tier.
This method must be lightweight and non-blocking: allocate metadata and submit the transfer, but do NOT perform the data copy on the calling thread.
Preconditions (guaranteed by the framework): - job_metadata.block_ids are valid primary-tier slots, pinned (ref-counted) for the duration of the transfer.
The implementation is responsible for
- Filtering out blocks already present in this tier
- Evicting blocks if capacity is needed
- Allocating space in this tier
- Submitting the async transfer (read from primary via block_ids)
Report completion via get_finished_jobs().
Parameters:
-
(job_metadata¶JobMetadata) –Job metadata including job_id, keys, and block_ids identifying the primary-tier slots to read from.
Source code in vllm/v1/kv_offload/tiering/base.py
touch(keys, req_context) ¶
Mark blocks as recently used for eviction policy.
Parameters:
-
(keys¶Collection[OffloadKey]) –Offload keys to mark as recently used.
-
(req_context¶ReqContext) –Per-request context.