vllm.multimodal.processing
#
Module Contents#
Classes#
Abstract base class to process multi-modal inputs to be used in vLLM. |
|
Base class to provide the information necessary for data processing. |
|
A |
|
Resolves to an index in the prompt. |
|
Defines how to insert placeholder tokens into a prompt. |
|
Defines how to replace portions of an input prompt with placeholder tokens. |
|
Defines how to update a prompt with placeholder tokens. |
|
Details about the token sequence or text that are part of the update. |
|
Functions#
Apply the updates in |
|
Apply the updates in |
|
Return each target of |
|
Return each target of |
|
Convenience function to apply |
|
Yield each occurrence of |
|
Replace each occurrence of |
Data#
A collection of hashes with a similar structure as |
|
A token sequence (list of token IDs) or text. |
|
The token sequence or text to update. |
|
Given the index of the processed item within |
|
The token sequence or text that are part of the update. |
|
API#
- class vllm.multimodal.processing.BaseMultiModalProcessor(info: vllm.multimodal.processing._I, dummy_inputs: BaseDummyInputsBuilder[_I], *, cache: Optional[vllm.multimodal.processing.ProcessingCache] = None)[source]#
Bases:
abc.ABC
,typing.Generic
[vllm.multimodal.processing._I
]Abstract base class to process multi-modal inputs to be used in vLLM.
Not to be confused with
transformers.ProcessorMixin
.Initialization
- apply(prompt: Union[str, list[int]], mm_data: vllm.multimodal.inputs.MultiModalDataDict, hf_processor_mm_kwargs: collections.abc.Mapping[str, object], return_mm_hashes: bool = False) vllm.multimodal.inputs.MultiModalInputs [source]#
Process multi-modal inputs to be used in vLLM.
The main steps are:
Apply HF Processor on prompt text and multi-modal data together, outputting token IDs and processed tensors.
Find and update sequences in the token IDs with placeholder tokens. The number of placeholder tokens equals the feature size of the multi-modal data outputted by the multi-modal encoder.
Extract information about the placeholder tokens from the processed token IDs.
- class vllm.multimodal.processing.BaseProcessingInfo(ctx: vllm.inputs.InputProcessingContext)[source]#
Base class to provide the information necessary for data processing.
Initialization
- get_allowed_mm_limits() collections.abc.Mapping[str, int] [source]#
Return the maximum allowed number of items for each modality.
- get_hf_processor(**kwargs: object) transformers.processing_utils.ProcessorMixin [source]#
Subclasses can override this method to handle specific kwargs from model config or user inputs.
- class vllm.multimodal.processing.BoundPromptUpdate[source]#
A
PromptUpdate
bound to a tokenizer to automatically converttarget
and the result ofget_content()
between token sequence and text representations.- property content: vllm.multimodal.processing.PromptUpdateContent[source]#
The placeholder tokens that are part of the update.
- get_content(item_idx: int) vllm.multimodal.processing._BoundPromptContent [source]#
Given the index of the processed item within
modality
, output the token sequence (or text) to update.
- property mode: vllm.multimodal.processing.UpdateMode[source]#
Defines how to update the prompt.
- property target: Union[vllm.multimodal.processing._BoundPromptSequence, vllm.multimodal.processing.PromptIndex][source]#
The token sequence (or text) to update.
- class vllm.multimodal.processing.EncDecMultiModalProcessor(info: vllm.multimodal.processing._I, dummy_inputs: BaseDummyInputsBuilder[_I], *, cache: Optional[vllm.multimodal.processing.ProcessingCache] = None)[source]#
Bases:
vllm.multimodal.processing.BaseMultiModalProcessor
[vllm.multimodal.processing._I
]- apply(prompt: Union[str, list[int]], mm_data: vllm.multimodal.inputs.MultiModalDataDict, hf_processor_mm_kwargs: collections.abc.Mapping[str, object], return_mm_hashes: bool = False) vllm.multimodal.inputs.MultiModalEncDecInputs [source]#
Process multi-modal inputs to be used in vLLM. The main processing steps are modified to fit encoder-decoder model:
Create encoder prompt from input prompt text.
Apply the HF processor on encoder prompt.
Copy the input prompt text as decoder prompt inputs.
- create_decoder_prompt(prompt: Union[str, list[int]], mm_data: vllm.multimodal.inputs.MultiModalDataDict) Union[str, list[int]] [source]#
Create input prompt for the decoder.
- vllm.multimodal.processing.MultiModalHashes[source]#
None
A collection of hashes with a similar structure as
MultiModalKwargs
.
- class vllm.multimodal.processing.PlaceholderFeaturesInfo[source]#
- is_embed: Optional[torch.Tensor][source]#
None
- to_range() vllm.multimodal.inputs.PlaceholderRange [source]#
- class vllm.multimodal.processing.ProcessingCache(capacity_gb: float, *, debug_cache_hit_ratio_steps: Optional[int] = None)[source]#
Initialization
- get(model_id: str, modality: str, input_item: object, input_kwargs: collections.abc.Mapping[str, object]) Optional[vllm.multimodal.inputs.MultiModalKwargsItem] [source]#
Get a processed multi-modal item from the cache according to its dependencies, including:
The model ID
The modality of the item
The original data item passed to the HF processor
The configuration options of the HF processor
- get_item(model_id: str, modality: str, input_item: object, input_kwargs: collections.abc.Mapping[str, object]) vllm.multimodal.processing.ProcessingCacheOptionalItem [source]#
- static get_lru_cache(capacity_gb: float, value_type: type[vllm.multimodal.processing._V], *, debug: bool = False) vllm.utils.LRUCache[str, vllm.multimodal.processing._V] [source]#
- put(model_id: str, modality: str, input_item: object, input_kwargs: collections.abc.Mapping[str, object], output_kwargs: vllm.multimodal.inputs.MultiModalKwargsItem) None [source]#
Put a processed multi-modal item into the cache according to its dependencies (see
get()
).
- put_item(item: vllm.multimodal.processing.ProcessingCacheItem) None [source]#
- class vllm.multimodal.processing.ProcessingCacheItem[source]#
Bases:
typing.NamedTuple
- class vllm.multimodal.processing.ProcessingCacheOptionalItem[source]#
Bases:
typing.NamedTuple
- class vllm.multimodal.processing.PromptIndex[source]#
Resolves to an index in the prompt.
- get_match_index: collections.abc.Callable[[vllm.transformers_utils.tokenizer.AnyTokenizer, vllm.multimodal.processing.PromptSeq], Optional[int]][source]#
None
- class vllm.multimodal.processing.PromptIndexTargets[source]#
- static end() vllm.multimodal.processing.PromptIndex [source]#
Resolves to the end of the prompt (after the last token).
This results in a match even if the prompt is empty.
- static prefix(seq: vllm.multimodal.processing.PromptSeq) vllm.multimodal.processing.PromptIndex [source]#
Resolves to a location in the prompt after the given prefix.
- static start() vllm.multimodal.processing.PromptIndex [source]#
Resolves to the start of the prompt (before the first token).
This results in a match even if the prompt is empty.
- class vllm.multimodal.processing.PromptInsertion[source]#
Bases:
vllm.multimodal.processing.PromptUpdate
Defines how to insert placeholder tokens into a prompt.
Example:
For each image, insert a number of
<image>
feature placeholders equal to the feature size of the vision encoder after the<s>
token:PromptInsertion( modality="image", target="<s>", insertion="<image>" * image_feature_size, )
Insert these tokens at the start of the prompt:
PromptInsertion( modality="image", target=PromptIndexTargets.start(), insertion="<image>" * image_feature_size, )
Insert these tokens after a prefix
Images:
:PromptInsertion( modality="image", target=PromptIndexTargets.prefix("Images:"), insertion="<image>" * image_feature_size, )
Insert these tokens at the end of the prompt:
PromptInsertion( modality="image", target=PromptIndexTargets.end(), insertion="<image>" * image_feature_size, )
- insertion: vllm.multimodal.processing.PromptUpdateContent[source]#
‘field(…)’
Given the index of the processed item within
modality
, output the token sequence (or text) to insert right aftertarget
.For convenience, you can directly pass in the token sequence (or text) instead of a function if it does not depend on the input.
- property mode: vllm.multimodal.processing.UpdateMode[source]#
- class vllm.multimodal.processing.PromptReplacement[source]#
Bases:
vllm.multimodal.processing.PromptUpdate
Defines how to replace portions of an input prompt with placeholder tokens.
Example:
For each image, replace one
<image>
input placeholder in the prompt with a number of<image>
feature placeholders equal to the feature size of the vision encoder:PromptReplacement( modality="image", target="<image>", replacement="<image>" * image_feature_size, )
As above, but further pad the feature placeholders with
<image_bos>
and `<image_eos>``, which are not supposed to be passed to the vision encoder:PromptReplacement( modality="image", target="<image>", replacement=PromptUpdateDetails( full="".join([ "<image_bos>", "<image>" * image_feature_size, "<image_eos>", ]), features="<image>" * image_feature_size, ), )
To avoid unnecessary tokenization during prompt replacement, we recommended passing token sequences instead of text:
PromptReplacement( modality="image", target=[image_token_id], replacement=PromptUpdateDetails( full=([image_bos_id] + [image_token_id] * image_feature_size + [image_eos_id]), features=[image_token_id] * image_feature_size, ), )
- property mode: vllm.multimodal.processing.UpdateMode[source]#
- replacement: vllm.multimodal.processing.PromptUpdateContent[source]#
‘field(…)’
Given the index of the processed item within
modality
, output the token sequence (or text) to replacetarget
.For convenience, you can directly pass in the token sequence (or text) instead of a function if it does not depend on the input.
- class vllm.multimodal.processing.PromptUpdate[source]#
Bases:
abc.ABC
Defines how to update a prompt with placeholder tokens.
- bind(tokenizer: vllm.transformers_utils.tokenizer.AnyTokenizer) vllm.multimodal.processing.BoundPromptUpdate [source]#
- abstract property content: vllm.multimodal.processing.PromptUpdateContent[source]#
The placeholder tokens that are part of the update.
- abstract property mode: vllm.multimodal.processing.UpdateMode[source]#
Defines how to update the prompt.
- vllm.multimodal.processing.PromptUpdateContent[source]#
None
Given the index of the processed item within
modality
, output the corresponding token sequence (or text).For convenience, you can directly pass in the token sequence (or text) instead of a function if it does not depend on the input.
- class vllm.multimodal.processing.PromptUpdateDetails[source]#
Bases:
typing.Generic
[vllm.multimodal.processing._S
]Details about the token sequence or text that are part of the update.
- static from_seq(seq: vllm.multimodal.processing._S) PromptUpdateDetails[_S] [source]#
- is_embed: Optional[collections.abc.Callable[[_BoundPromptSequence], torch.Tensor]][source]#
None
Given
full
, return a boolean mask of shape(len(full),)
indicating which positions offull
to assign embeddings to.None
(default) means to assign embeddings to all positions offull
.The embeddings are obtained by calling
SupportsMultiModal.get_multimodal_embeddings
.
- static select_text(seq: vllm.multimodal.processing._S, embed_text: str) PromptUpdateDetails[_S] [source]#
- static select_token_id(seq: vllm.multimodal.processing._S, embed_token_id: int) PromptUpdateDetails[_S] [source]#
- vllm.multimodal.processing.PromptUpdateInfo[source]#
None
The token sequence or text that are part of the update.
If only part of the content corresponds to feature placeholders, you can use
PromptUpdateDetails
to specify which part.
- vllm.multimodal.processing.apply_text_matches(prompt: str, mm_matches: collections.abc.Mapping[str, collections.abc.Sequence[vllm.multimodal.processing.PromptTargetMatch]], mm_item_counts: collections.abc.Mapping[str, int]) str [source]#
Apply the updates in
mm_matches
toprompt
.
- vllm.multimodal.processing.apply_token_matches(prompt: list[int], mm_matches: collections.abc.Mapping[str, collections.abc.Sequence[vllm.multimodal.processing.PromptTargetMatch]], mm_item_counts: collections.abc.Mapping[str, int]) list[int] [source]#
Apply the updates in
mm_matches
toprompt
.
- vllm.multimodal.processing.find_mm_placeholders(mm_prompt_updates: collections.abc.Mapping[str, collections.abc.Sequence[vllm.multimodal.processing.BoundPromptUpdate]], prompt: list[int], mm_item_counts: collections.abc.Mapping[str, int]) collections.abc.Mapping[str, list[vllm.multimodal.processing.PlaceholderFeaturesInfo]] [source]#
- vllm.multimodal.processing.find_text_matches(prompt: str, prompt_updates: collections.abc.Sequence[vllm.multimodal.processing.BoundPromptUpdate]) collections.abc.Sequence[vllm.multimodal.processing.PromptTargetMatch] [source]#
Return each target of
prompt_updates
found inprompt
.
- vllm.multimodal.processing.find_token_matches(prompt: list[int], prompt_updates: collections.abc.Sequence[vllm.multimodal.processing.BoundPromptUpdate]) collections.abc.Sequence[vllm.multimodal.processing.PromptTargetMatch] [source]#
Return each target of
prompt_updates
found inprompt
.
- vllm.multimodal.processing.full_groupby_modality(values: collections.abc.Iterable[vllm.multimodal.processing._M]) collections.abc.ItemsView[str, list[vllm.multimodal.processing._M]] [source]#
Convenience function to apply
full_groupby()
based on modality.