NIXL push-mode KV transfer¶
The default NIXL connector is pull-based: the decode (D) instance reads KV blocks from the prefill (P) instance via NIXL READ after prefill completes. NixlPushConnector adds a push-based alternative in which P writes the KV blocks directly into D's pre-allocated memory via NIXL WRITE.
This document describes the threading, queues, and scheduling interactions specific to the push design. The pull-mode design is unchanged; the push connector reuses the same handshake, NIXL agent setup, and metadata path wherever possible.
High-level flow¶
sequenceDiagram
autonumber
participant Client
participant Proxy
participant DSched as D Scheduler
participant DWorker as D Worker (main)
participant DWriter as D Writer
participant PWriter as P Writer
participant PWorker as P Worker (main)
participant PSched as P Scheduler
Client->>Proxy: POST /v1/completions
Proxy->>PSched: prefill leg (do_remote_decode=True, max_tokens=1)
Proxy->>DSched: decode leg (do_remote_prefill=True, P coordinates)
note over DSched,DWriter: D side - register blocks with P
DSched->>DSched: update_state_after_alloc, stash registration, arm watchdog
DSched->>DWorker: build_connector_meta -> meta.push_registrations
DWorker->>DWriter: enqueue (req_id, reg_data) on _reg_send_inbox
DWriter->>PWriter: NIXL send_notif PUSH_REG msgpack
note over PSched,PWriter: P side - prefill, stage finished blocks
PSched->>PSched: request_finished, stash blocks
PSched->>PWorker: build_connector_meta -> meta.push_finished_blocks
PWorker->>PWriter: enqueue (req_id, blocks) on _finished_blocks_inbox
note over PWriter: P writer matches and WRITEs
PWriter->>PWriter: get_new_notifs returns PUSH_REG, route via _handle_push_reg_notif
alt PUSH_REG and finished blocks both present
PWriter->>PWriter: pop matching pair, fire WRITE
else only one side present
PWriter->>PWriter: stash and wait, self-poll only when blocks unmatched
end
PWriter->>PWriter: ensure D handshake (one-time)
PWriter->>DWriter: NIXL WRITE direct to D GPU + completion notif
note over DWorker,DWriter: D side - completion accounting
DWriter-->>DWorker: forward HB and completion notifs via _pending_completion_notifs
DWorker->>DWorker: _get_new_notifs drains, HB extends lease, completion marks recv done
DWorker->>DSched: update_connector_output(finished_recving)
DSched->>DSched: clear watchdog deadline
note over PWorker,PWriter: P side - reclaim
PWorker->>PWorker: get_finished, drain _sending_transfers, queue eviction
PWriter->>PWriter: drain _evict_finished_inbox, drop stale state
PWorker->>PSched: update_connector_output(finished_sending)
PSched->>PSched: free lease
DWorker-->>Proxy: stream decode tokens
Proxy-->>Client: response Threads¶
NixlPushConnectorWorker introduces a single dedicated background thread per worker (i.e. per TP rank), named nixl-push-writer. Each owns the new push-specific NIXL operations on its rank:
nixl_wrapper.get_new_notifs()— receive notifications.nixl_wrapper.send_notif(...)for thePUSH_REG:<msgpack>(D side) and for the per-WRITE completion notif (P side).nixl_wrapper.make_prepped_xfer(...) / transfer(...)— submit the WRITE itself.
Heartbeats continue to go out from the engine main thread via the existing base-worker _send_heartbeats plumbing inside start_load_kv.
Wake model¶
The writer thread blocks on _push_writer_wake (a threading.Event) when it has no work. Three callers set the event:
start_load_kv(worker main thread, called once per engine step with the scheduler's metadata) — sets the wake only when the step actually hands the writer new work, i.e. whenmeta.push_registrationsormeta.push_finished_blocksis non-empty. This is the wake for new transfers.get_finished(worker main thread, called once per engine step to report completions) — always sets the wake. The writer is the sole consumer ofnixl_wrapper.get_new_notifs()for push, so this gives it a chance to drain inbound notifs (heartbeats from D, completion notifs after a WRITE, late-arrivingPUSH_REG) even when there is no new metadata to act on.- Handshake-completion callback (background handshake executor thread) — when a deferred D→P handshake finishes successfully, the future's done-callback re-enqueues the registration onto
_reg_send_inboxand sets the wake so the correspondingsend_notifruns on the writer (we never callsend_notiffrom the executor thread). On this second pass_ensure_handshakereturnsNone(the agent is now connected), so the writer sends thePUSH_REGdirectly. If the handshake failed, the callback fails the request instead of re-enqueuing, so there is no retry loop.
In addition to event-driven wakes, the writer self-polls at _PUSH_WRITER_POLL_INTERVAL_MS = 1.0 ms while there are P-side finished blocks waiting for an unmatched PUSH_REG.
When a request completes on P (lease expires or the WRITE finishes), get_finished enqueues the request id onto _evict_finished_inbox, which the writer drains to drop stale _push_finished_blocks / _pending_d_registrations and stop self-polling.
Writer-local matching tables¶
| Table | Owner | Holds |
|---|---|---|
_pending_d_registrations | writer | D registrations received from a remote D, waiting for P's blocks |
_push_finished_blocks | writer | P blocks staged by the scheduler, waiting for a remote D registration |
Either side can arrive first. The writer matches in both directions: when a PUSH_REG arrives we look up _push_finished_blocks, and when finished blocks arrive we look up _pending_d_registrations. Both lookups try an exact request_id match first, then fall back to comparing the ids after stripping the trailing per-engine random suffix (via get_base_request_id). The fallback exists because the proxy hands the same X-Request-Id to both legs, so P and D wrap it into the same cmpl-<uuid>-<index> form and differ only by the 8-hex randomization suffix that input_processor.assign_request_id appends per engine. Stripping just that suffix normalizes both sides to the same id while preserving the completion index (so multi-prompt sub-requests stay distinct). It also works whether or not VLLM_DISABLE_REQUEST_ID_RANDOMIZATION is set, which matters since that env var is slated for removal upstream.
Wire format¶
A push registration is sent as a NIXL notification:
Fields in the dict:
| Field | Set by | Meaning |
|---|---|---|
request_id | D | D's own vLLM request id; P's match key, echoed in the completion notif |
decode_engine_id | D | D's engine id (P uses this for the reverse handshake) |
decode_host | D | D's NIXL side-channel host |
decode_port | D | D's NIXL side-channel port |
decode_tp_size | D | D's tensor-parallel size |
local_block_ids | D | per-group lists of D's logical block ids (preallocated) |
remote_engine_id | D | P's engine id (for the existing P-side handshake) |
remote_host | D | P's NIXL side-channel host |
remote_port | D | P's NIXL side-channel port |
remote_tp_size | D | P's tensor-parallel size |
D ships logical block ids; P expands them to physical block ids at WRITE-submission time using the ratio learned during the NIXL handshake (remote_physical_blocks_per_logical). This matches the pull-mode contract — schedulers ship logical ids, workers expand to physical at submission.
The completion notif sent from P to D after a WRITE is the existing <request_id>:<tp_size> format used in pull mode (here request_id is D's own request id, taken from the registration), so the D-side accounting code is unchanged.
Scheduler-side responsibilities¶
NixlPushConnectorScheduler extends the base scheduler with:
- D side —
update_state_after_allocstashes registration data in_push_pending_registrationsand arms a soft watchdog (_push_registration_deadlines).build_connector_metadrains the stash intometa.push_registrationsand any expired entries are dropped with a warning. - P side —
request_finishedstashes block IDs in_finished_request_blocks(for the lease and forhas_pending_push_work) and_newly_finished_push_blocks(for the next worker step viameta.push_finished_blocks). - Both sides —
has_pending_push_workkeeps the engine main loop stepping while there is in-flight push state, so the writer always gets at least one wake per step.
update_connector_output:
finished_sending(P side) clears the lease entry.finished_recving(D side) clears the watchdog deadline.
Timeouts and watchdogs¶
Two per-request timers are armed on the scheduler:
- D-side registration watchdog —
_push_registration_deadlines. If a registered request does not see a push completion withinpush_registration_timeoutseconds (defaults todecoder_kv_blocks_ttl),build_connector_metadrops the stale registration and the pending entry, logs a warning, and stops trying to resend the registration. The corresponding request remains tracked in_reqs_need_recv; it is the engine's request-level abort path (or the user / proxy timing out the HTTP call) that ultimately fails the request. - P-side block lease — same
_kv_lease_durationused by pull mode.request_finishedsets the expiration in_reqs_need_sendandupdate_connector_output(finished_sending=...)clears it on successful WRITE. Stale leases are reaped byget_finishedin the base worker, which then enqueues the eviction onto_evict_finished_inboxso the writer also stops self-polling.
Failure handling¶
- D-side handshake failure (P→D handshake before sending PUSH_REG) — the future's done-callback calls
_handle_failed_transfer(rid, None), which marks D's pre-allocated blocks invalid and enqueues onto_failed_recv_reqsso the nextget_finishedreports the request as a failed recv. Same recv-side accounting as pull mode. - D-side
send_notiffailure when shipping the PUSH_REG to P — identical handling:_handle_failed_transfermarks the recv as failed. - P-side WRITE submission failure — the WRITE handle (if any) is released and
xfer_stats.record_failed_transfer()bumps the failure counter. We deliberately do not call_handle_failed_transferhere:req_idon the P side has no entry in_recving_metadata(P is not the receiver), so the helper would put a P-local request id into_failed_recv_reqsand trip the assertion in the base worker'sget_finished. The outbound WRITE is dropped on the floor; D's lease watchdog handles the missing completion.
Summary¶
The push design is a small, well-contained extension on top of the existing NIXL connector:
- one new connector class, one new scheduler class, one new worker class — all subclasses of the existing base classes;
- one dedicated background thread per worker;
- a few cross-thread queues, each with a single consumer (the writer); most have one producer, except
_reg_send_inbox, which is fed both by the engine main thread (new registrations) and by the handshake-completion callback (registrations replayed after their D→P handshake finishes); - one new notification type (
PUSH_REG:<msgpack>).
Behavior on the engine main thread is otherwise unchanged. The writer thread is event-driven and idle when there is no push work.