Multi-Modality#
vLLM provides experimental support for multi-modal models through the vllm.multimodal
package.
Multi-modal inputs can be passed alongside text and token prompts to supported models
via the multi_modal_data
field in vllm.inputs.PromptType
.
Currently, vLLM only has built-in support for image data. You can extend vLLM to process additional modalities by following this guide.
Looking to add your own multi-modal model? Please follow the instructions listed here.
Guides#
Module Contents#
Registry#
- vllm.multimodal.MULTIMODAL_REGISTRY = <vllm.multimodal.registry.MultiModalRegistry object>[source]#
The global
MultiModalRegistry
is used by model runners to dispatch data processing according to its modality and the target model.See also
- class vllm.multimodal.MultiModalRegistry(*, plugins: Sequence[MultiModalPlugin] = DEFAULT_PLUGINS)[source][source]#
A registry that dispatches data processing to the
MultiModalPlugin
for each modality.- create_input_mapper(model_config: ModelConfig)[source][source]#
Create an input mapper (see
map_input()
) for a specific model.
- create_processor(model_config: ModelConfig, tokenizer: transformers.PreTrainedTokenizer | transformers.PreTrainedTokenizerFast | MistralTokenizer) BaseMultiModalProcessor [source][source]#
Create a multi-modal processor for a specific model and tokenizer.
- get_max_multimodal_tokens(model_config: ModelConfig) int [source][source]#
Get the maximum number of multi-modal tokens for profiling the memory usage of a model.
See
MultiModalPlugin.get_max_multimodal_tokens()
for more details.Note
This should be called after
init_mm_limits_per_prompt()
.
- get_max_tokens_by_modality(model_config: ModelConfig) Mapping[str, int] [source][source]#
Get the maximum number of tokens from each modality for profiling the memory usage of a model.
See
MultiModalPlugin.get_max_multimodal_tokens()
for more details.Note
This should be called after
init_mm_limits_per_prompt()
.
- get_mm_limits_per_prompt(model_config: ModelConfig) Mapping[str, int] [source][source]#
Get the maximum number of multi-modal input instances for each modality that are allowed per prompt for a model class.
Note
This should be called after
init_mm_limits_per_prompt()
.
- has_processor(model_config: ModelConfig) bool [source][source]#
Test whether a multi-modal processor is defined for a specific model.
- init_mm_limits_per_prompt(model_config: ModelConfig) None [source][source]#
Initialize the maximum number of multi-modal input instances for each modality that are allowed per prompt for a model class.
- map_input(model_config: ModelConfig, data: Mapping[str, Any | List[Any]], mm_processor_kwargs: Dict[str, Any] | None = None) MultiModalKwargs [source][source]#
Apply an input mapper to the data passed to the model.
The data belonging to each modality is passed to the corresponding plugin which in turn converts the data into into keyword arguments via the input mapper registered for that model.
See
MultiModalPlugin.map_input()
for more details.Note
This should be called after
init_mm_limits_per_prompt()
.
- register_image_input_mapper(mapper: Callable[[InputContext, object | List[object]], MultiModalKwargs] | None = None)[source][source]#
Register an input mapper for image data to a model class.
See
MultiModalPlugin.register_input_mapper()
for more details.
- register_input_mapper(data_type_key: str, mapper: Callable[[InputContext, object | List[object]], MultiModalKwargs] | None = None)[source][source]#
Register an input mapper for a specific modality to a model class.
See
MultiModalPlugin.register_input_mapper()
for more details.
- register_max_image_tokens(max_mm_tokens: int | Callable[[InputContext], int] | None = None)[source][source]#
Register the maximum number of image tokens, corresponding to a single image, that are passed to the language model for a model class.
- register_max_multimodal_tokens(data_type_key: str, max_mm_tokens: int | Callable[[InputContext], int] | None = None)[source][source]#
Register the maximum number of tokens, corresponding to a single instance of multimodal data belonging to a specific modality, that are passed to the language model for a model class.
- register_plugin(plugin: MultiModalPlugin) None [source][source]#
Register a multi-modal plugin so it can be recognized by vLLM.
See also
- register_processor(factory: Callable[[InputProcessingContext], BaseMultiModalProcessor])[source][source]#
Register a multi-modal processor to a model class. The processor is constructed lazily, hence a factory method should be passed.
When the model receives multi-modal data, the provided function is invoked to transform the data into a dictionary of model inputs.
Base Classes#
- vllm.multimodal.NestedTensors[source]#
alias of
Union
[List
[NestedTensors
],List
[Tensor
],Tensor
,Tuple
[Tensor
, …]]
- vllm.multimodal.BatchedTensorInputs[source]#
alias of
Dict
[str
,Union
[List
[NestedTensors
],List
[Tensor
],Tensor
,Tuple
[Tensor
, …]]]
- final class vllm.multimodal.MultiModalDataBuiltins[source][source]#
Bases:
TypedDict
Type annotations for modality types predefined by vLLM.
- audio: ndarray | List[float] | Tuple[ndarray, float] | List[ndarray | List[float] | Tuple[ndarray, float]][source]#
The input audio(s).
- image: Image | ndarray | torch.Tensor | List[Image | ndarray | torch.Tensor][source]#
The input image(s).
- video: List[Image] | ndarray | torch.Tensor | List[ndarray] | List[torch.Tensor] | List[List[Image] | ndarray | torch.Tensor | List[ndarray] | List[torch.Tensor]][source]#
The input video(s).
- class vllm.multimodal.MultiModalKwargs(dict=None, /, **kwargs)[source][source]#
Bases:
UserDict
[str
,Union
[List
[NestedTensors
],List
[Tensor
],Tensor
,Tuple
[Tensor
, …]]]A dictionary that represents the keyword arguments to
forward()
.- static batch(inputs_list: List[MultiModalKwargs]) Dict[str, List[List[NestedTensors] | List[torch.Tensor] | torch.Tensor | Tuple[torch.Tensor, ...]] | List[torch.Tensor] | torch.Tensor | Tuple[torch.Tensor, ...]] [source][source]#
Batch multiple inputs together into a dictionary.
The resulting dictionary has the same keys as the inputs. If the corresponding value from each input is a tensor and they all share the same shape, the output value is a single batched tensor; otherwise, the output value is a list containing the original value from each input.
- class vllm.multimodal.MultiModalPlugin[source][source]#
Bases:
ABC
Base class that defines data processing logic for a specific modality.
In particular, we adopt a registry pattern to dispatch data processing according to the model being used (considering that different models may process the same data differently). This registry is in turn used by
MultiModalRegistry
which acts at a higher level (i.e., the modality of the data).See also
- get_max_multimodal_tokens(model_config: ModelConfig) int [source][source]#
Get the maximum number of multi-modal tokens for profiling the memory usage of a model.
If this registry is not applicable to the model, 0 is returned.
The model is identified by
model_config
.See also
- map_input(model_config: ModelConfig, data: Any | List[Any], mm_processor_kwargs: Dict[str, Any] | None) MultiModalKwargs [source][source]#
Transform the data into a dictionary of model inputs using the input mapper registered for that model.
The model is identified by
model_config
.- Raises:
TypeError – If the data type is not supported.
- register_input_mapper(mapper: Callable[[InputContext, object | List[object]], MultiModalKwargs] | None = None)[source][source]#
Register an input mapper to a model class.
When the model receives input data that matches the modality served by this plugin (see
get_data_key()
), the provided function is invoked to transform the data into a dictionary of model inputs.If None is provided, then the default input mapper is used instead.
- register_max_multimodal_tokens(max_mm_tokens: int | Callable[[InputContext], int] | None = None)[source][source]#
Register the maximum number of tokens, corresponding to a single instance of multimodal data, that are passed to the language model for a model class.
If None is provided, then the default calculation is used instead.
See also