Skip to content

vllm_omni.utils.mm_outputs

Utilities for handling multimodal outputs / building multimodal output payloads, most of which are shared by the prefix cache / no prefix cache path.

logger module-attribute

logger = init_logger(__name__)

build_mm_cpu

build_mm_cpu(multimodal_outputs: dict) -> dict[str, object]

Pre-copies multimodal tensor to CPU once (not per-request) to avoid redundant D2H transfers when gpu_resident_buffer_keys keeps them on GPU.

In the case of prefix caching, the multimodal outputs provided will only contain the passthrough data.

Parameters:

Name Type Description Default
multimodal_outputs dict

Multimodal dict mapping strings to objects.

required

to_payload_element

to_payload_element(
    element: object,
    idx: int,
    start: int,
    end: int,
    pass_lists_through: bool = False,
    seq_len: int | None = None,
)

Build an mm payload element corresponding to one request index from an element containing 0 or more CPU tensors.

Parameters:

Name Type Description Default
element object

The object to be added to the payload.

required
idx int

The index of the request.

required
start int

The start index corresponding to the request idx.

required
end int

The end index corresponding to the request idx.

required
pass_lists_through bool

bool Whether or not lists should be treated as passthrough data; this should be False in normal cases, but True if we need to avoid splitting nonempty lists prior to calling postprocess, which is the case for prefix cache.

False
seq_len int | None

Optional sequence length (i.e., dim 0 of hidden states). When set, a tensor whose first dimension equals seq_len is sliced per request. The prefix cache passthrough also passes the total scheduled token count here so 1D (seq_len,) metadata that is intentionally not cached is still split per request.

None