Disaggregated Inference for Omni-Modality Models¶

This guide explains how to configure and use distributed connectors (vllm_omni/distributed/omni_connectors) in vllm-omni for multi-stage pipelines.

Backend-specific setup lives in separate docs:

Overview¶

Connectors enable data transfer between pipeline stages (e.g., Thinker -> Talker). Current connectors operate in D2H2D (device to host to device) mode.

Connector Choices¶

Use Case	Recommended Connector	Notes
Single node	SharedMemoryConnector	Auto-configured if no connector is specified.
Multi node (Mooncake Store)	MooncakeStoreConnector	TCP-based, requires Mooncake Master + metadata server.
Multi node (Mooncake RDMA)	MooncakeTransferEngineConnector	RDMA/TCP direct transfer with managed memory pool. Fastest.
Multi node (Mori RDMA)	MoriTransferEngineConnector	RDMA direct transfer via Mori IOEngine.
Multi node (Yuanrong)	YuanrongConnector	Requires Yuanrong Datasystem + etcd.
Ascend NPU P2P (Yuanrong TE)	YuanrongTransferEngineConnector	Uses Yuanrong TransferEngine directly. Configure NPU device IPv4 and `memory_pool_device: "npu"`.

Core API¶

The connector system is built around OmniConnectorBase.

class OmniConnectorBase(ABC):
    @abstractmethod
    def put(self, from_stage: str, to_stage: str, put_key: str, data: Any) -> tuple[bool, int, Optional[dict]]:
        """
        Store data.
        Returns: (success, serialized_size, metadata)
        """
        pass

    @abstractmethod
    def get(self, from_stage: str, to_stage: str, get_key: str, metadata: Optional[dict] = None) -> Optional[tuple[Any, int]]:
        """
        Retrieve data.
        Args: metadata - transport-specific handles returned by put() (e.g., SHM name).
        Returns: (object, serialized_size)
        """
        pass

Metadata Passing¶

Some connectors (e.g., SharedMemoryConnector) generate transient resources during put(). This metadata must be passed through the control plane so get() can locate the data.

Configuration Model¶

Define connectors in runtime:

runtime:
  connectors:
    connector_of_shared_memory:
      name: SharedMemoryConnector
      extra:
        shm_threshold_bytes: 65536

Wire stages to connectors:

stage_args:
  - stage_id: 0
    output_connectors:
      to_stage_1: connector_of_shared_memory

  - stage_id: 1
    input_connectors:
      from_stage_0: connector_of_shared_memory

If a pipeline edge has no explicit connector, the system auto-creates a SharedMemoryConnector for that edge.

Relationship with vLLM¶

vLLM provides specialized distributed mechanisms for specific artifacts:

KV Transfer (vllm.distributed.kv_transfer): optimized for KV caches.
EC Transfer (vllm.distributed.ec_transfer): optimized for encoder embeddings.
Device Communicators (vllm.distributed.device_communicators): low-level primitives (NCCL, SHM).

vllm-omni complements this with a generalized connector abstraction:

Unifies transport via a single put/get API for any stage artifact.
Enables DAG-style pipelines across processes or nodes with per-edge transports.
Can wrap vLLM-specific transfers for KV paths while keeping a consistent interface.

Operational Notes¶

Fail-fast config validation: missing expected edges cause startup failures.
Missing payloads halt stages: verify connector wiring and metadata propagation.

Future Roadmap: D2D Transport¶

Current connectors use D2H2D paths. Future versions will introduce direct device-to-device connectors (NCCL, UCX, IPC) to reduce latency for large tensor payloads.