Input Definitions#

User-facing inputs#

vllm.multimodal.inputs.MultiModalDataDict[source]#

A dictionary containing an entry for each modality type to input.

The built-in modalities are defined by MultiModalDataBuiltins.

alias of Mapping[str, Union[Any, list[Any]]]

Internal data structures#

class vllm.multimodal.inputs.PlaceholderRange[source]#

Bases: TypedDict

Placeholder location information for multi-modal data.

Example

Prompt: AAAA BBBB What is in these images?

Images A and B will have:

A: { "offset": 0, "length": 4 }
B: { "offset": 5, "length": 4 }
length: int[source]#

The length of the placeholder.

offset: int[source]#

The start index of the placeholder in the prompt.

vllm.multimodal.inputs.NestedTensors[source]#

Uses a list instead of a tensor if the dimensions of each element do not match.

alias of Union[list[NestedTensors], list[Tensor], Tensor, tuple[Tensor, …]]

class vllm.multimodal.inputs.MultiModalFieldElem(modality: str, key: str, data: list[typing.Union[list[ForwardRef('NestedTensors')], list[torch.Tensor], torch.Tensor, tuple[torch.Tensor, ...]]] | list[torch.Tensor] | torch.Tensor | tuple[torch.Tensor, ...], field: BaseMultiModalField)[source]#

Represents a keyword argument corresponding to a multi-modal item in MultiModalKwargs.

data: list[typing.Union[list[ForwardRef('NestedTensors')], list[torch.Tensor], torch.Tensor, tuple[torch.Tensor, ...]]] | list[torch.Tensor] | torch.Tensor | tuple[torch.Tensor, ...][source]#

The tensor data of this field in MultiModalKwargs, i.e. the value of the keyword argument to be passed to the model.

field: BaseMultiModalField[source]#

Defines how to combine the tensor data of this field with others in order to batch multi-modal items together for model inference.

key: str[source]#

The key of this field in MultiModalKwargs, i.e. the name of the keyword argument to be passed to the model.

modality: str[source]#

The modality of the corresponding multi-modal item. Each multi-modal item can consist of multiple keyword arguments.

class vllm.multimodal.inputs.MultiModalFieldConfig(field: BaseMultiModalField, modality: str)[source]#
static batched(modality: str)[source]#

Defines a field where an element in the batch is obtained by indexing into the first dimension of the underlying data.

Parameters:

modality – The modality of the multi-modal item that uses this keyword argument.

Example

Input:
    Data: [[AAAA]
        [BBBB]
        [CCCC]]

Output:
    Element 1: [AAAA]
    Element 2: [BBBB]
    Element 3: [CCCC]
static flat(modality: str, slices: Sequence[slice])[source]#

Defines a field where an element in the batch is obtained by slicing along the first dimension of the underlying data.

Parameters:
  • modality – The modality of the multi-modal item that uses this keyword argument.

  • slices – For each multi-modal item, a slice that is used to extract the data corresponding to it.

Example

Given:
    slices: [slice(0, 3), slice(3, 7), slice(7, 9)]

Input:
    Data: [AAABBBBCC]

Output:
    Element 1: [AAA]
    Element 2: [BBBB]
    Element 3: [CC]
static flat_from_sizes(modality: str, size_per_item: torch.Tensor)[source]#

Defines a field where an element in the batch is obtained by slicing along the first dimension of the underlying data.

Parameters:
  • modality – The modality of the multi-modal item that uses this keyword argument.

  • slices – For each multi-modal item, the size of the slice that is used to extract the data corresponding to it.

Example

Given:
    size_per_item: [3, 4, 2]

Input:
    Data: [AAABBBBCC]

Output:
    Element 1: [AAA]
    Element 2: [BBBB]
    Element 3: [CC]
static shared(modality: str, batch_size: int)[source]#

Defines a field where an element in the batch is obtained by taking the entirety of the underlying data.

This means that the data is the same for each element in the batch.

Parameters:
  • modality – The modality of the multi-modal item that uses this keyword argument.

  • batch_size – The number of multi-modal items which share this data.

Example

Given:
    batch_size: 4

Input:
    Data: [XYZ]

Output:
    Element 1: [XYZ]
    Element 2: [XYZ]
    Element 3: [XYZ]
    Element 4: [XYZ]
class vllm.multimodal.inputs.MultiModalKwargsItem(dict=None, /, **kwargs)[source]#

Bases: UserDict[str, MultiModalFieldElem]

A collection of MultiModalFieldElem corresponding to a data item in MultiModalDataItems.

class vllm.multimodal.inputs.MultiModalKwargs(data: ]], *, items: Sequence[MultiModalKwargsItem] | None = None)[source]#

Bases: UserDict[str, Union[list[NestedTensors], list[Tensor], Tensor, tuple[Tensor, …]]]

A dictionary that represents the keyword arguments to forward().

The metadata items enables us to obtain the keyword arguments corresponding to each data item in MultiModalDataItems, via get_item() and get_items().

static batch(inputs_list: list[vllm.multimodal.inputs.MultiModalKwargs]) ]][source]#

Batch multiple inputs together into a dictionary.

The resulting dictionary has the same keys as the inputs. If the corresponding value from each input is a tensor and they all share the same shape, the output value is a single batched tensor; otherwise, the output value is a list containing the original value from each input.

static from_items(items: Sequence[MultiModalKwargsItem])[source]#

Construct a new MultiModalKwargs from multiple items.

get_item(modality: str, item_index: int) MultiModalKwargsItem[source]#

Get the keyword arguments corresponding to an item identified by its modality and index.

get_item_count(modality: str) int[source]#

Get the number of items belonging to a modality.

get_items(modality: str) Sequence[MultiModalKwargsItem][source]#

Get the keyword arguments corresponding to each item belonging to a modality.

class vllm.multimodal.inputs.MultiModalInputs[source]#

Bases: TypedDict

Represents the outputs of vllm.multimodal.processing.BaseMultiModalProcessor, ready to be passed to vLLM internals.

mm_hashes: MultiModalHashDict | None[source]#

The hashes of the multi-modal data.

mm_kwargs: MultiModalKwargs[source]#

Keyword arguments to be directly passed to the model after batching.

mm_placeholders: Mapping[str, Sequence[PlaceholderRange]][source]#

For each modality, information about the placeholder tokens in prompt_token_ids.

prompt: str[source]#

The processed prompt text.

prompt_token_ids: list[int][source]#

The processed token IDs which includes placeholder tokens.

token_type_ids: NotRequired[list[int]][source]#

The token type IDs of the prompt.

type: Literal['multimodal'][source]#

The type of inputs.