Skip to content

vllm_omni.engine.output_modality

Output modality types for vLLM-Omni.

This module defines the OutputModality enum and TensorAccumulationStrategy for type-safe multimodal output routing and tensor merging.

DRAINABLE_MODALITIES module-attribute

DRAINABLE_MODALITIES = {
    mod
    for mod in OutputModalityNames
    if mod not in NON_DRAINABLE_MODALITIES
}

FinalOutputModalityType module-attribute

FinalOutputModalityType: TypeAlias = Literal[
    "text", "image", "audio", "video"
]

NON_DRAINABLE_MODALITIES module-attribute

NON_DRAINABLE_MODALITIES = {TEXT, LATENT}

OutputModality

Bases: Flag

Bit-flag enum for output modalities.

Compose freely with | — no need to enumerate every combination.

Single: OutputModality.TEXT, OutputModality.IMAGE, ... Compound: OutputModality.TEXT | OutputModality.IMAGE (text+image)

Note: POOLING is intentionally excluded. Pooling/embedding is vLLM's native path (pooling_output → PoolingRequestOutput), handled entirely by the base OutputProcessor. vLLM-Omni's layer does not participate.

AUDIO class-attribute instance-attribute

AUDIO = auto()

IMAGE class-attribute instance-attribute

IMAGE = auto()

LATENT class-attribute instance-attribute

LATENT = auto()

TEXT class-attribute instance-attribute

TEXT = auto()

has_multimodal property

has_multimodal: bool

has_text property

has_text: bool

from_string classmethod

from_string(s: str | None) -> OutputModality

Parse a free-text modality string into an OutputModality flag.

Handles common aliases and compound strings separated by + or ,.

Examples::

OutputModality.from_string("text+image")
# → OutputModality.TEXT | OutputModality.IMAGE

OutputModalityNames

Bases: StrEnum

Keys for output modalities.

TODO: (Alex) Integrate this with the big-flag enum below + throughout the code for better type safety (currently only used for output processor).

AUDIO class-attribute instance-attribute

AUDIO = 'audio'

IMAGE class-attribute instance-attribute

IMAGE = 'image'

LATENT class-attribute instance-attribute

LATENT = 'latent'

TEXT class-attribute instance-attribute

TEXT = 'text'

TensorAccumulationStrategy

Bases: Enum

Strategy for merging incremental multimodal tensors.

APPEND_LIST class-attribute instance-attribute

APPEND_LIST = 'append_list'

Append to a list (no tensor concatenation).

CONCAT_DIM0 class-attribute instance-attribute

CONCAT_DIM0 = 'concat_dim0'

Concatenate along dimension 0. Used for image/latent tensors.

CONCAT_LAST class-attribute instance-attribute

CONCAT_LAST = 'concat_last'

Concatenate along the last dimension. Used for audio waveforms.

REPLACE class-attribute instance-attribute

REPLACE = 'replace'

Replace previous tensor entirely with the latest one.

get_accumulation_strategy

get_accumulation_strategy(
    modality: OutputModality,
) -> TensorAccumulationStrategy

Determine tensor merge strategy from the multimodal flags.