vllm_omni.utils.mm_outputs ¶
Utilities for handling multimodal outputs / building multimodal output payloads, most of which are shared by the prefix cache / no prefix cache path.
build_mm_cpu ¶
Pre-copies multimodal tensor to CPU once (not per-request) to avoid redundant D2H transfers when gpu_resident_buffer_keys keeps them on GPU.
In the case of prefix caching, the multimodal outputs provided will only contain the passthrough data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
multimodal_outputs | dict | Multimodal dict mapping strings to objects. | required |
to_payload_element ¶
to_payload_element(
element: object,
idx: int,
start: int,
end: int,
pass_lists_through: bool = False,
seq_len: int | None = None,
)
Build an mm payload element corresponding to one request index from an element containing 0 or more CPU tensors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
element | object | The object to be added to the payload. | required |
idx | int | The index of the request. | required |
start | int | The start index corresponding to the request idx. | required |
end | int | The end index corresponding to the request idx. | required |
pass_lists_through | bool | bool Whether or not lists should be treated as passthrough data; this should be False in normal cases, but True if we need to avoid splitting nonempty lists prior to calling postprocess, which is the case for prefix cache. | False |
seq_len | int | None | Optional sequence length (i.e., dim 0 of hidden states). When set, a tensor whose first dimension equals seq_len is sliced per request. The prefix cache passthrough also passes the total scheduled token count here so 1D (seq_len,) metadata that is intentionally not cached is still split per request. | None |