MooncakeTransferEngineConnector¶

When to Use¶

Best for high-performance multi-node data transfer between stages using Mooncake Transfer Engine. Supports both RDMA and TCP protocols with a managed memory pool, zero-copy deserialization, and optional GPUDirect RDMA. Applicable to any inter-stage data (KV caches, request payloads, etc.), not limited to KV cache transfer.

Compared to MooncakeStoreConnector (TCP key-value store), this connector provides ~60x faster data transfer via RDMA direct memory access.

Installation¶

pip install mooncake-transfer-engine

Ensure RDMA drivers are installed on all nodes (e.g., Mellanox OFED for InfiniBand/RoCE NICs).

Configuration¶

Define the connector in runtime:

runtime:
  connectors:
    rdma_connector:
      name: MooncakeTransferEngineConnector
      extra:
        host: "auto"                  # Auto-detect local RDMA IP
        zmq_port: 50051               # ZMQ base port (see "Port Offset Scheme" below)
        protocol: "rdma"              # "rdma" or "tcp"
        device_name: ""               # RDMA device (e.g., "mlx5_0"), empty for auto-detect
        memory_pool_size: 4294967296  # 4 GB (CPU); use 2147483648 (2 GB) for GPU
        memory_pool_device: "cpu"     # "cpu" for pinned memory (recommended), "cuda" for GPUDirect RDMA

Wire stages to the connector:

stage_args:
  - stage_id: 0
    output_connectors:
      to_stage_1: rdma_connector

  - stage_id: 1
    input_connectors:
      from_stage_0: rdma_connector

Parameters¶

Required¶

Parameter	Description
`role`	Internal, do not set manually. Auto-injected by the orchestration layer (`"sender"` for `output_connectors`, `"receiver"` for `input_connectors`). Defaults to `"sender"` if omitted.
`host`	Local IP address for RDMA. `"auto"` detects from network interfaces.
`protocol`	Transport protocol: `"rdma"` (InfiniBand/RoCE) or `"tcp"`.

Memory Pool¶

Parameter	Default	Description
`memory_pool_size`	4 GB (CPU) / 2 GB (GPU)	Total size of the RDMA-registered memory pool in bytes. Recommended 4 GB for CPU pinned memory; 2 GB for GPU VRAM to conserve device memory.
`memory_pool_device`	`"cpu"`	`"cpu"`: pinned host memory (recommended, works on all topologies). `"cuda"`: GPU VRAM for GPUDirect RDMA (requires NIC-GPU direct PCIe connectivity, PIX topology).

Networking¶

Parameter	Default	Description
`zmq_port`	50051	ZMQ base port. The orchestration layer computes the actual port as `base + purpose_offset + stage_offset` (see table below). Users only set this base value.
`sender_host`	`None`	Internal. Receiver-side only — dynamically resolved via `update_sender_info()`. Not needed in YAML.
`sender_zmq_port`	`None`	Internal. Receiver-side only — defaults to the sender's adjusted port. Not needed in YAML.
`device_name`	`""`	RDMA device name (e.g., `"mlx5_0"`). Empty for auto-detect. Can also be set via `RDMA_DEVICE_NAME` env var.

ZMQ Port Offset Scheme¶

To avoid port conflicts when multiple edges, purposes, DP replicas, or TP ranks share the same node, the actual ZMQ port is computed as:

side_channel_port = zmq_port + purpose_offset + stage_offset + dp_index * tp_size
sender_listen     = side_channel_port + tp_rank
receiver_connect  = remote_side_channel_port + tp_rank

Component	Value	Description
`zmq_port`	50051 (default)	Base port from YAML config
`purpose_offset`	`request_forwarding` = 0, `kv_transfer` = 100	Separates control-plane vs KV-cache connections
`stage_offset`	`int(from_stage)` (0, 1, 2...)	Separates edges from different source stages
`dp_index * tp_size`	e.g., DP1 × TP2 = 2	Each DP replica reserves a port range of size `tp_size` (following vLLM convention: `VLLM_MOONCAKE_BOOTSTRAP_PORT + dp_index * tp_size`)
`tp_rank`	0, 1, 2...	Each TP rank within a DP replica uses its own port
orchestrator	+200	Extra offset when caller is the orchestrator (avoids collision with stage workers on the same node)

Example (base=50051, stage 0→1, DP=2, TP=2, kv_transfer):

Caller	DP	TP rank	Port
Stage worker	DP0	rank 0	`50051 + 100 + 0 + 0×2 + 0 = 50151`
Stage worker	DP0	rank 1	`50051 + 100 + 0 + 0×2 + 1 = 50152`
Stage worker	DP1	rank 0	`50051 + 100 + 0 + 1×2 + 0 = 50153`
Stage worker	DP1	rank 1	`50051 + 100 + 0 + 1×2 + 1 = 50154`
Orchestrator	—	—	`50051 + 200 + 0 = 50251`

Memory Pool Modes¶

Mode	Config	Recommended Pool Size	Data Flow	Best For
CPU Pinned	`memory_pool_device: "cpu"`	4 GB	GPU → CPU pool → RDMA → CPU pool → GPU	Most hardware topologies (recommended)
GPUDirect	`memory_pool_device: "cuda"`	2 GB	GPU → GPU pool → RDMA (NIC reads GPU BAR1) → GPU pool	NIC-GPU direct PCIe (PIX topology)

Note: GPUDirect RDMA requires the NIC and GPU to share a direct PCIe switch (PIX topology). On systems where they are connected via PXB or NODE, CPU pinned memory is faster due to GPU BAR1 bandwidth limitations.

Environment Variables¶

Variable	Description
`RDMA_DEVICE_NAME`	Override RDMA device name (e.g., `mlx5_0`).
`MC_IB_PCI_RELAXED_ORDERING`	Set to `1` to enable PCIe relaxed ordering for GPUDirect.

Docker / Container Setup¶

RDMA requires host-level device access:

docker run -it \
    --cap-add=SYS_PTRACE \
    --cap-add=IPC_LOCK \
    --security-opt seccomp=unconfined \
    --network=host \
    --device=/dev/infiniband \
    -v /sys/class/infiniband:/sys/class/infiniband:ro \
    your-image:tag

Performance¶

Benchmark results on H800 GPUs with mlx5_0 RDMA NIC (~186 MB KV cache):

Metric	MooncakeStoreConnector	MooncakeTransferEngineConnector (CPU)
KV transfer wall time	~810 ms	~14 ms
RDMA throughput	N/A (TCP)	~22 GB/s
Speedup	1x	58x

Troubleshooting¶

Quick Diagnostics¶

# 1. Check RDMA devices and link status
ibdev2netdev
# Expected: "mlx5_X port 1 ==> <iface> (Up)"
# RoCE devices map to Ethernet interfaces (e.g., enp75s0f0)
# IB devices map to ib0, ib1, etc.

# 2. Check InfiniBand device details
ibstat

# 3. Verify /dev/infiniband is accessible (required in containers)
ls /dev/infiniband/

# 4. Check Mooncake installation
python -c "from mooncake.engine import TransferEngine; print('OK')"

# 5. Check environment variables
echo "RDMA_DEVICE_NAME=${RDMA_DEVICE_NAME:-<not set>}"
echo "MC_IB_PCI_RELAXED_ORDERING=${MC_IB_PCI_RELAXED_ORDERING:-<not set>}"

Common Issues¶

Symptom	Cause	Fix
`Failed to modify QP to RTR`	Cross-NIC QP handshake failure (multi-NIC DGX)	Set `device_name` to a single RoCE NIC (e.g., `mlx5_2`) or set `RDMA_DEVICE_NAME` env var
`transport retry counter exceeded`	RDMA path between incompatible NICs	Same as above — restrict to one NIC
`zmq.error.Again: Resource temporarily unavailable`	ZMQ recv timeout (transfer took too long)	Check NIC selection; increase data may need longer timeout
`Mooncake Engine initialization failed`	Missing RDMA drivers or `/dev/infiniband`	Install Mellanox OFED; in Docker add `--device=/dev/infiniband`
`MemoryError` in allocator	Memory pool too small for payload	Increase `memory_pool_size`
GPU transfer slower than CPU	GPU BAR1 bandwidth limitation (PXB/NODE topology)	Use `memory_pool_device: "cpu"` instead of `"cuda"`

Multi-NIC Environments (DGX)¶

On DGX machines with 12+ RDMA NICs, only RoCE NICs (with a bound network interface) reliably support loopback. IB-only NICs may fail cross-NIC QP handshakes. To identify RoCE NICs:

ibdev2netdev | grep -v "ib[0-9]"
# RoCE devices show Ethernet interface names like enp75s0f0

Then configure the connector:

device_name: "mlx5_2"  # or set RDMA_DEVICE_NAME=mlx5_2

See the RDMA Test README in tests/distributed/omni_connectors/README.md for test-specific setup instructions.

For more details on the underlying engine, refer to the Mooncake repository.

Design¶

1. Overview¶

MooncakeTransferEngineConnector is the high-performance remote connector in vllm_omni/distributed/omni_connectors. It is built on top of Mooncake TransferEngine and combines:

a direct data plane for remote memory writes
a ZMQ side channel for metadata lookup, handshake, and completion signaling
a managed local memory pool for both send and receive buffers

Unlike MooncakeStoreConnector, which treats the backend as a distributed store, MooncakeTransferEngineConnector is designed as a peer-to-peer transport. Its goal is to move large stage payloads efficiently while still fitting the common put() / get() API defined by OmniConnectorBase.

It is the most performance-oriented connector in the current OmniConnector family and is intended for large remote payloads such as:

KV cache transfer
stage hidden-state payloads
streaming chunk payloads
other binary-heavy inter-stage artifacts

2. Relationship with the OmniConnector System¶

MooncakeTransferEngineConnector implements the same connector contract as the other backends:

put(from_stage, to_stage, put_key, data)
get(from_stage, to_stage, get_key, metadata=None)
cleanup(request_id, ...)
health()
close()

It is integrated into the system through the standard connector plumbing:

OmniConnectorFactory constructs the connector from ConnectorSpec
load_omni_transfer_config() resolves the edge-level connector configuration
get_connectors_config_for_stage() and resolve_omni_kv_config_for_stage() inject the connector role
All callers (batch forwarding, chunk transfer, KV transfer, etc.) interact with it through the same put() / get() contract

The key system-level distinction is that this connector is role-aware:

sender instances expose data and listen for pull requests
receiver instances allocate buffers and actively pull data from the sender

3. Design Goals¶

The connector is designed around four primary goals:

High-throughput remote transfer Avoid store-mediated round trips and write directly into the receiver memory region.
Fast path for raw payloads Support torch.Tensor, bytes, and ManagedBuffer without forcing all traffic through full object serialization.
Unified connector abstraction Preserve the same put() / get() interface used by the rest of the OmniConnector stack.
Safe lifecycle management Manage allocation, reuse, cleanup, and failure recovery for a registered memory pool.

4. Architecture Overview¶

At a high level, the connector is composed of four main subsystems:

classDiagram
    class OmniConnectorBase {
        <<abstract>>
        +put(from_stage, to_stage, put_key, data)
        +get(from_stage, to_stage, get_key, metadata)
        +cleanup(request_id)
        +health()
        +close()
    }

    class MooncakeTransferEngineConnector {
        +supports_raw_data: bool
        -engine: TransferEngine
        -allocator: BufferAllocator
        -pool: torch.Tensor
        -zmq_ctx: zmq.Context
        -_local_buffers: dict
        -_sender_executor: ThreadPoolExecutor
        -_listener_thread: threading.Thread
        +put(...)
        +get(...)
        +update_sender_info(sender_host, sender_zmq_port)
        +get_connection_info()
        +cleanup(request_id, from_stage, to_stage)
        +close()
    }

    class BufferAllocator {
        -total_size: int
        -alignment: int
        -free_blocks: list
        +alloc(size) int
        +free(offset, size)
    }

    class ManagedBuffer {
        -allocator: BufferAllocator
        -offset: int
        -size: int
        -pool_tensor: torch.Tensor
        +tensor
        +as_tensor(dtype, shape) torch.Tensor
        +to_bytes() bytes
        +release()
    }

    class TransferEngine {
        +initialize(host, handshake, protocol, device_name)
        +register_memory(base_ptr, size)
        +batch_transfer_sync_write(remote_session, src_addrs, dst_addrs, lengths)
        +unregister_memory(base_ptr)
        +get_rpc_port() int
    }

    class QueryRequest {
        +request_id: str
    }

    class QueryResponse {
        +request_id: str
        +data_size: int
        +is_fast_path: bool
    }

    class MooncakeAgentMetadata {
        +remote_hostname: str
        +remote_port: int
        +request_id: str
        +dst_addrs: list[int]
        +lengths: list[int]
    }

    OmniConnectorBase <|-- MooncakeTransferEngineConnector
    MooncakeTransferEngineConnector *-- BufferAllocator
    MooncakeTransferEngineConnector *-- TransferEngine
    ManagedBuffer --> BufferAllocator : releases to
    ManagedBuffer --> "1" torch.Tensor : views
    MooncakeTransferEngineConnector ..> ManagedBuffer : returns / retains
    MooncakeTransferEngineConnector ..> QueryRequest : decodes
    MooncakeTransferEngineConnector ..> QueryResponse : encodes
    MooncakeTransferEngineConnector ..> MooncakeAgentMetadata : exchanges

4.1 Transfer Engine¶

Mooncake TransferEngine is responsible for the actual data-plane transfer. It registers local memory and performs synchronous remote writes through:

batch_transfer_sync_write(...)

4.2 Managed Memory Pool¶

Each connector instance owns a large pre-registered memory pool:

CPU pinned memory when memory_pool_device == "cpu"
GPU memory when memory_pool_device == "cuda"

This avoids repeated memory registration and allows each transfer to allocate subranges from one long-lived pool.

4.3 Buffer Manager¶

Two helper classes control local memory ownership:

BufferAllocator Manages aligned subrange allocation and free-list merging.
ManagedBuffer Represents one live slice of the pool and exposes:
.tensor
.as_tensor(dtype, shape)
.to_bytes()
.release()

4.4 ZMQ Side Channel¶

ZMQ is used for transport coordination, not for the data payload itself. It handles:

metadata query from receiver to sender
pull request submission
completion or error signaling
internal notification from worker threads back to the listener thread

This split makes the control plane lightweight while keeping the bulk payload on the transfer engine data plane.

5. Role Model¶

5.1 Sender Role¶

A sender connector:

accepts put() calls
stores live transfer-ready buffers in _local_buffers
starts a ZMQ listener thread
responds to metadata queries and pull requests from receivers

5.2 Receiver Role¶

A receiver connector:

does not bind the sender-side ZMQ listener
accepts get() calls
allocates receive buffers from its own pool
requests metadata or transfer service from the sender

The role is not inferred dynamically. It is injected by the stage configuration layer:

incoming edge for a stage -> role="receiver"
outgoing edge for a stage -> role="sender"

This is important because incorrect role assignment would break initialization semantics.

5.3 Host Auto-Detection¶

The host configuration field supports the special value "auto". When set, the connector auto-detects the local IP address that would be used for external communication (via a UDP socket probe to 8.8.8.8). If that fails, it falls back to hostname resolution, and ultimately to 127.0.0.1.

This is useful in environments where the operator does not want to hard-code IP addresses in the connector config.

5.4 RDMA Device Filtering¶

The device_name configuration field allows the operator to specify which RDMA NICs to use (comma-separated, e.g. "mlx5_0,mlx5_1"). If not set in config, the connector also checks the RDMA_DEVICE_NAME environment variable.

This is important in environments with mixed InfiniBand/RoCE NICs, where not all devices are suitable for the transfer engine.

6. Local Memory Management¶

6.1 Memory Pool Registration¶

During initialization, the connector:

allocates a large pool tensor
records its base pointer
registers that memory with Mooncake
creates a BufferAllocator for subrange management

This means every later transfer only allocates offsets inside the pre-registered pool rather than registering memory per request.

6.2 BufferAllocator¶

BufferAllocator maintains a sorted free list of (offset, size) blocks and enforces alignment. Its responsibilities include:

aligned allocation
freeing previously allocated blocks
adjacent block merging
double-free detection
overlap detection to catch corruption

This is a critical piece of the connector because both sender and receiver depend on long-lived pool reuse.

6.3 ManagedBuffer¶

ManagedBuffer is the main fast-path data wrapper. It can:

expose the pool slice as a zero-copy 1D uint8 tensor
reinterpret that slice as a typed tensor
copy out the contents as Python bytes
release the slice back to the allocator

The connector uses ManagedBuffer in two different ways:

as a send-side holder to keep the pool slice alive
as a receive-side return type when is_fast_path=True

7. Put Flow¶

7.1 High-Level Behavior¶

put() is only valid in sender mode. Its job is to expose a payload for later remote pull by the receiver.

The high-level flow is:

validate connector state and role
convert the input into a pool-backed transferable representation
store the transfer metadata in _local_buffers
return lightweight metadata describing how the receiver can fetch the data

7.2 Payload Type Handling¶

put() supports three payload classes:

A. ManagedBuffer

If the buffer belongs to the same pool, the connector can use it directly without copying. This is the most efficient path.

If the buffer comes from a different pool, the connector falls back to a copy path.

B. torch.Tensor or bytes

These payloads are copied into the local pool and marked as fast-path data:

no Omni object serialization is required
receiver can get a ManagedBuffer back

C. Generic Python object

Any other payload is serialized via OmniSerializer.serialize(...) and then copied into the pool.

In this case:

is_fast_path=False
the receiver will deserialize back into a Python object

7.3 Sender Metadata¶

The sender returns:

{
    "source_host": self.host,
    "source_port": self.zmq_port,
    "data_size": size,
    "is_fast_path": is_fast_path,
}

This metadata is intentionally lightweight. It tells the receiver:

where the sender-side control plane lives
how large the remote transfer will be
whether the payload should be returned as a ManagedBuffer or a deserialized object

7.4 Sender Buffer Table¶

The sender stores each live payload in _local_buffers under the stage-qualified key. Each entry contains:

source addresses
lengths
the holder object
ownership information (should_release)
is_fast_path
creation time

This table is the sender-side truth source for both metadata queries and pull requests.

8. Get Flow¶

8.1 High-Level Behavior¶

get() runs on the receiver side and performs four steps:

resolve metadata
allocate a destination buffer in the local pool
request the sender to write into that destination buffer
return either a ManagedBuffer or a deserialized object

8.2 Metadata Resolution Paths¶

The metadata parameter in get() is optional. The connector supports two resolution modes depending on whether the caller supplies it.

With metadata

When the caller passes metadata, the connector uses it directly. The metadata carries:

source_host / source_port — sender ZMQ endpoint
data_size — payload byte count
is_fast_path — whether the receiver gets a ManagedBuffer or a deserialized object

This mode is suitable when the control plane already forwards the sender's put() output to the receiver.

Without metadata

When get(metadata=None) is called, the connector queries the sender over ZMQ to discover the same fields (data_size, is_fast_path). The caller must first call:

update_sender_info(sender_host, sender_zmq_port)

so that the connector knows where to send the query.

This mode is suitable for polling-based flows (e.g. KV transfer, async chunk transfer) where the receiver does not have metadata from the control plane.

8.3 Destination Allocation¶

Once metadata is resolved, the receiver:

allocates a subrange from its own local pool
wraps it in a ManagedBuffer
builds a MooncakeAgentMetadata request containing:
receiver hostname
receiver RPC port
request ID
destination addresses
transfer lengths

This tells the sender exactly where to write the incoming data.

8.4 Transfer Completion¶

The receiver then sends the pull request over ZMQ and waits for:

TRANS_DONE
or TRANS_ERROR

If the transfer succeeds:

for is_fast_path=True, the receiver returns (ManagedBuffer, size)
for is_fast_path=False, the receiver copies to bytes, deserializes, releases the buffer, and returns (object, size)

9. Sender-Side Listener Design¶

9.1 Listener Thread¶

In sender mode, the connector starts _zmq_listener_loop() after initialization. The listener:

binds tcp://{host}:{zmq_port}
receives incoming requests
uses a poller for socket events and internal notifications
periodically reclaims stale buffers

If the bind fails, initialization fails immediately. The code does not silently downgrade the connector role.

9.2 Worker Thread Pool¶

The listener hands work to _sender_executor so that the listener thread does not block on transfer work.

There are two request types:

metadata query -> _handle_query_request(...)
transfer request -> _handle_pull_request(...)

9.3 Query Handling¶

For metadata queries, the sender looks up the request ID in _local_buffers and returns:

data size
fast-path flag

This supports consumers that only know the sender endpoint but not the original sender metadata.

9.4 Pull Handling¶

For a transfer request, the sender:

locates the source addresses in _local_buffers
constructs the remote session identifier
calls batch_transfer_sync_write(...)
replies TRANS_DONE or TRANS_ERROR

On success, the sender immediately calls cleanup(meta.request_id) and frees the producer-side buffer if it owns it.

This choice is important: it makes the connector effectively a single-consumer transfer model for each successful put/get pair.

10. Fast Path Semantics¶

This connector explicitly advertises:

supports_raw_data = True

That means it can move raw payloads without forcing everything through the Omni object serializer.

Fast Path¶

For torch.Tensor, bytes, or pool-local ManagedBuffer:

sender returns is_fast_path=True
receiver returns a ManagedBuffer
caller is responsible for calling release()

This avoids an unnecessary copy on the receiver side.

Serialized Object Path¶

For arbitrary Python objects:

sender serializes the payload
receiver converts the receive buffer to bytes
receiver deserializes the object
receive buffer is released internally

This preserves a uniform object-oriented API while still allowing optimized raw-data transport when possible.

11. Failure Handling and Cleanup¶

11.1 Timeouts and Socket Recovery¶

The receiver caches ZMQ REQ sockets per thread, but invalidates them after failures. This avoids reusing sockets that may be stuck in a bad state after timeout or receive errors.

Timeout is scaled based on payload size:

a base timeout
plus additional time for large payloads

This is intended to reduce false timeouts for large remote writes.

11.2 Stale Buffer Reclamation¶

The sender periodically reclaims old entries from _local_buffers using a TTL policy. This protects the memory pool from permanent leaks if a receiver crashes or never consumes a prepared payload.

This is a practical recovery mechanism, although the code notes that TTL cleanup can still race with very long-running in-flight transfers.

11.3 Connector Shutdown¶

close() is a full resource teardown routine. It:

stops the listener thread
shuts down the worker executor
releases all pending buffers
closes cached sockets
unregisters memory from the engine when supported
terminates the ZMQ context
drops the pool reference

This makes MooncakeTransferEngineConnector the most lifecycle-aware connector in the current connector family.

12. Current Implementation Constraints¶

The current code documents several important topology constraints.

12.1 One Sender to One Receiver per Successful Transfer¶

After a successful transfer, the sender-side buffer is cleaned up immediately. This means the same prepared payload is not retained for multiple independent receivers.

12.2 One Receiver to One Active Sender Endpoint¶

The receiver only stores one (sender_host, sender_zmq_port) pair through update_sender_info(...). So the metadata-query mode is currently single-sender at a time.

12.3 Explicit Buffer Ownership Matters¶

When the connector allocates a pool slice internally, it is responsible for releasing it. When a caller passes an externally owned ManagedBuffer, the connector keeps it alive for transfer but does not assume ownership of its eventual release.

These constraints are consistent with the current implementation and should be treated as design assumptions rather than incidental behavior.

13. Data Flow in the Pipeline¶

The end-to-end sender/receiver interaction is:

sequenceDiagram
    participant SenderStage
    participant SenderConnector
    participant ReceiverConnector
    participant ReceiverStage

    SenderStage->>SenderConnector: put(from_stage, to_stage, put_key, data)
    SenderConnector->>SenderConnector: place payload in local memory pool
    SenderConnector-->>SenderStage: metadata(source_host, source_port, data_size, is_fast_path)

    ReceiverStage->>ReceiverConnector: get(..., metadata)
    ReceiverConnector->>ReceiverConnector: allocate destination buffer
    ReceiverConnector->>SenderConnector: ZMQ pull request with dst addr
    SenderConnector->>ReceiverConnector: TransferEngine remote write
    SenderConnector-->>ReceiverConnector: TRANS_DONE
    ReceiverConnector-->>ReceiverStage: ManagedBuffer or deserialized object

For metadata-less polling, the flow simply adds a metadata query step before the pull request.

14. Strengths and Trade-offs¶

Strengths¶

Best remote-transfer design in the current connector stack for large payloads.
Supports raw-data fast path.
Keeps stage communication under the same connector abstraction.
Includes real lifecycle and memory-pool management.
Works for both stage payload transfer and KV transfer scenarios.

Trade-offs¶

More complex than the store-based connector.
Correctness depends on role injection and endpoint coordination.
Caller must release fast-path receive buffers.
Current implementation is optimized for single-consumer transfer semantics.

15. Summary¶

MooncakeTransferEngineConnector is the high-performance peer-to-peer transport in the OmniConnector system. Its design combines:

a registered memory pool
a safe subrange allocator
a ZMQ control plane
a Mooncake transfer-engine data plane

This allows the connector to support both:

a fast path for raw tensors and bytes
a generic object path for arbitrary Python payloads

Within vLLM-Omni, it is the connector that most directly targets performance-sensitive remote transfer, especially for large payloads and KV cache movement. Its additional complexity is deliberate: it is the connector that turns the generic OmniConnector abstraction into a transport capable of efficient remote memory movement rather than simple object storage.