Input Definitions#

User-facing inputs#

vllm.multimodal.inputs.MultiModalDataDict[source]#

A dictionary containing an entry for each modality type to input.

The built-in modalities are defined by MultiModalDataBuiltins.

alias of Mapping[str, Union[Any, list[Any]]]

Internal data structures#

class vllm.multimodal.inputs.PlaceholderRange[source][source]#

Bases: TypedDict

Placeholder location information for multi-modal data.

Example

Prompt: AAAA BBBB What is in these images?

Images A and B will have:

A: { "offset": 0, "length": 4 }
B: { "offset": 5, "length": 4 }
length: int[source]#

The length of the placeholder.

offset: int[source]#

The start index of the placeholder in the prompt.

vllm.multimodal.inputs.NestedTensors[source]#

Uses a list instead of a tensor if the dimensions of each element do not match.

alias of Union[list[NestedTensors], list[Tensor], Tensor, tuple[Tensor, …]]

class vllm.multimodal.inputs.MultiModalFieldElem(field: BaseMultiModalField, data: list[typing.Union[list[ForwardRef('NestedTensors')], list[torch.Tensor], torch.Tensor, tuple[torch.Tensor, ...]]] | list[torch.Tensor] | torch.Tensor | tuple[torch.Tensor, ...])[source][source]#

Contains metadata and data of an item in MultiModalKwargs.

class vllm.multimodal.inputs.MultiModalFieldConfig(field_cls: type[vllm.multimodal.inputs.BaseMultiModalField], modality: str, **field_config: Any)[source][source]#
class vllm.multimodal.inputs.MultiModalKwargsItem(dict=None, /, **kwargs)[source][source]#

Bases: UserDict[str, MultiModalFieldElem]

A collection of MultiModalFieldElem corresponding to a data item in MultiModalDataItems.

class vllm.multimodal.inputs.MultiModalKwargs(data: ]], *, items: Sequence[MultiModalKwargsItem] | None = None)[source][source]#

Bases: UserDict[str, Union[list[NestedTensors], list[Tensor], Tensor, tuple[Tensor, …]]]

A dictionary that represents the keyword arguments to forward().

The metadata items enables us to obtain the keyword arguments corresponding to each data item in MultiModalDataItems, via get_item() and get_items().

static batch(inputs_list: list[vllm.multimodal.inputs.MultiModalKwargs]) ]][source][source]#

Batch multiple inputs together into a dictionary.

The resulting dictionary has the same keys as the inputs. If the corresponding value from each input is a tensor and they all share the same shape, the output value is a single batched tensor; otherwise, the output value is a list containing the original value from each input.

static from_items(items: Sequence[MultiModalKwargsItem])[source][source]#

Construct a new MultiModalKwargs from multiple items.

get_item(modality: str, item_index: int) MultiModalKwargsItem[source][source]#

Get the keyword arguments corresponding to an item identified by its modality and index.

get_item_count(modality: str) int[source][source]#

Get the number of items belonging to a modality.

get_items(modality: str) Sequence[MultiModalKwargsItem][source][source]#

Get the keyword arguments corresponding to each item belonging to a modality.

class vllm.multimodal.inputs.MultiModalInputsV2[source][source]#

Bases: TypedDict

Represents the outputs of vllm.multimodal.processing.BaseMultiModalProcessor, ready to be passed to vLLM internals.

mm_hashes: NotRequired[MultiModalHashDict | None][source]#

The hashes of the multi-modal data.

mm_kwargs: MultiModalKwargs[source]#

Keyword arguments to be directly passed to the model after batching.

mm_placeholders: Mapping[str, Sequence[PlaceholderRange]][source]#

For each modality, information about the placeholder tokens in prompt_token_ids.

prompt: str[source]#

The processed prompt text.

prompt_token_ids: list[int][source]#

The processed token IDs which includes placeholder tokens.

token_type_ids: NotRequired[list[int]][source]#

The token type IDs of the prompt.

type: Literal['multimodal'][source]#

The type of inputs.