Skip to content

Disaggregated Inference for Omni-Modality Models

This guide explains how to configure and use distributed connectors (vllm_omni/distributed/omni_connectors) in vllm-omni for multi-stage pipelines.

Backend-specific setup lives in separate docs:

Overview

Connectors enable data transfer between pipeline stages (e.g., Thinker -> Talker). Current connectors operate in D2H2D (device to host to device) mode.

Connector Choices

Use Case Recommended Connector Notes
Single node SharedMemoryConnector Auto-configured if no connector is specified.
Multi node (Mooncake Store) MooncakeStoreConnector TCP-based, requires Mooncake Master + metadata server.
Multi node (Mooncake RDMA) MooncakeTransferEngineConnector RDMA/TCP direct transfer with managed memory pool. Fastest.
Multi node (Yuanrong) YuanrongConnector Requires Yuanrong Datasystem + etcd.
Ascend NPU P2P (Yuanrong TE) YuanrongTransferEngineConnector Uses Yuanrong TransferEngine directly. Configure NPU device IPv4 and memory_pool_device: "npu".

Core API

The connector system is built around OmniConnectorBase.

class OmniConnectorBase(ABC):
    @abstractmethod
    def put(self, from_stage: str, to_stage: str, put_key: str, data: Any) -> tuple[bool, int, Optional[dict]]:
        """
        Store data.
        Returns: (success, serialized_size, metadata)
        """
        pass

    @abstractmethod
    def get(self, from_stage: str, to_stage: str, get_key: str, metadata: Optional[dict] = None) -> Optional[tuple[Any, int]]:
        """
        Retrieve data.
        Args: metadata - transport-specific handles returned by put() (e.g., SHM name).
        Returns: (object, serialized_size)
        """
        pass

Metadata Passing

Some connectors (e.g., SharedMemoryConnector) generate transient resources during put(). This metadata must be passed through the control plane so get() can locate the data.

Configuration Model

Define connectors in runtime:

runtime:
  connectors:
    connector_of_shared_memory:
      name: SharedMemoryConnector
      extra:
        shm_threshold_bytes: 65536

Wire stages to connectors:

stage_args:
  - stage_id: 0
    output_connectors:
      to_stage_1: connector_of_shared_memory

  - stage_id: 1
    input_connectors:
      from_stage_0: connector_of_shared_memory

If a pipeline edge has no explicit connector, the system auto-creates a SharedMemoryConnector for that edge.

Relationship with vLLM

vLLM provides specialized distributed mechanisms for specific artifacts:

  • KV Transfer (vllm.distributed.kv_transfer): optimized for KV caches.
  • EC Transfer (vllm.distributed.ec_transfer): optimized for encoder embeddings.
  • Device Communicators (vllm.distributed.device_communicators): low-level primitives (NCCL, SHM).

vllm-omni complements this with a generalized connector abstraction:

  1. Unifies transport via a single put/get API for any stage artifact.
  2. Enables DAG-style pipelines across processes or nodes with per-edge transports.
  3. Can wrap vLLM-specific transfers for KV paths while keeping a consistent interface.

Operational Notes

  • Fail-fast config validation: missing expected edges cause startup failures.
  • Missing payloads halt stages: verify connector wiring and metadata propagation.

Future Roadmap: D2D Transport

Current connectors use D2H2D paths. Future versions will introduce direct device-to-device connectors (NCCL, UCX, IPC) to reduce latency for large tensor payloads.