Data Processing#
Module Contents#
- class vllm.multimodal.processing.PromptReplacementDetails(full: str | list[int], features: str | list[int])[source]#
Details about the replacement token sequence or text.
- vllm.multimodal.processing.PromptRepl[source]#
The replacement token sequence or text.
If only part of the replacement corresponds to feature placeholders, you can use
PromptReplacementDetailsto specify which part.alias of
Union[str,list[int],PromptReplacementDetails]
- class vllm.multimodal.processing.PromptReplacement(modality: str, target: str | list[int], replacement: Callable[[int], str | list[int] | PromptReplacementDetails] | str | list[int] | PromptReplacementDetails)[source]#
Defines how to replace portions of an input prompt with placeholder tokens.
Example
For each image, replace one
<image>input placeholder in the prompt with a number of<image>feature placeholders equal to the feature size of the vision encoder:PromptReplacement( modality="image", target="<image>", replacement="<image>" * image_feature_size, )
As above, but further pad the feature placeholders with
<image_bos>and <image_eos>`, which are not supposed to be passed to the vision encoder:PromptReplacement( modality="image", target="<image>", replacement=PromptReplacementDetails( full="".join([ "<image_bos>", "<image>" * image_feature_size, "<image_eos>", ]), features="<image>" * image_feature_size, ), )
To avoid unnecessary tokenization during prompt replacement, we recommended passing token sequences instead of text:
PromptReplacement( modality="image", target=[image_token_id], replacement=PromptReplacementDetails( full=([image_bos_id] + [image_token_id] * image_feature_size + [image_eos_id]), features=[image_token_id] * image_feature_size, ), )
- replacement: Callable[[int], str | list[int] | PromptReplacementDetails] | str | list[int] | PromptReplacementDetails[source]#
Given the index of the processed item within
modality, output the replacement token sequence (or text).For convenience, you can directly pass in the replacement token sequence (or text) instead of a function if it does not depend on the input.
- vllm.multimodal.processing.full_groupby_modality(values: Iterable[_M]) ItemsView[str, list[_M]][source]#
Convenience function to apply
full_groupby()based on modality.
- class vllm.multimodal.processing.BoundPromptReplacement(tokenizer: transformers.PreTrainedTokenizer | transformers.PreTrainedTokenizerFast | TokenizerBase, modality: str, _target: str | list[int], _replacement: Callable[[int], str | list[int] | PromptReplacementDetails] | str | list[int] | PromptReplacementDetails)[source]#
A
PromptReplacementbound to a tokenizer to automatically converttargetand the result ofget_replacement()between token sequence and text representations.
- vllm.multimodal.processing.iter_token_matches(token_ids: list[int], match_ids: list[int]) Generator[_TokenMatch][source]#
Yield each occurrence of
match_idsintoken_ids.Note that empty matches are ignored.
- class vllm.multimodal.processing.PlaceholderFeaturesInfo(modality: str, item_idx: int, start_idx: int, tokens: list[int])[source]#
- vllm.multimodal.processing.find_token_matches(prompt: list[int], prompt_repls: Sequence[BoundPromptReplacement]) list[vllm.multimodal.processing._PromptReplacementTokenMatch][source]#
Return each target of
prompt_replsfound inprompt.
- vllm.multimodal.processing.find_text_matches(prompt: str, prompt_repls: Sequence[BoundPromptReplacement]) list[vllm.multimodal.processing._PromptReplacementTextMatch][source]#
Return each target of
prompt_replsfound inprompt.
- vllm.multimodal.processing.replace_token_matches(prompt: list[int], mm_matches: Mapping[str, Sequence[_PromptReplacementTokenMatch]], mm_item_counts: Mapping[str, int]) list[int][source]#
Apply the replacements in
mm_matchestoprompt.
- vllm.multimodal.processing.replace_text_matches(prompt: str, mm_matches: Mapping[str, Sequence[_PromptReplacementTextMatch]], mm_item_counts: Mapping[str, int]) str[source]#
Apply the replacements in
mm_matchestoprompt.
- class vllm.multimodal.processing.BaseProcessingInfo(ctx: InputProcessingContext)[source]#
Base class to provide the information necessary for data processing.
- get_hf_processor(**kwargs: object) transformers.ProcessorMixin[source]#
Subclasses can override this method to handle specific kwargs from model config or user inputs.
- class vllm.multimodal.processing.BaseMultiModalProcessor(info: _I, dummy_inputs: BaseDummyInputsBuilder[_I], *, cache: ProcessingCache | None = None, enable_sanity_checks: bool = True)[source]#
Abstract base class to process multi-modal inputs to be used in vLLM.
Not to be confused with
transformers.ProcessorMixin.- apply(prompt: str | list[int], mm_data: Mapping[str, Any | list[Any]], hf_processor_mm_kwargs: Mapping[str, object]) MultiModalInputs[source]#
Process multi-modal inputs to be used in vLLM.
The main steps are:
Apply HF Processor on prompt text and multi-modal data together, outputting token IDs and processed tensors.
Find and replace sequences in the token IDs with placeholder tokens. The number of placeholder tokens equals the feature size of the multi-modal data outputted by the multi-modal encoder.
Extract information about the placeholder tokens from the processed token IDs.
- class vllm.multimodal.processing.EncDecMultiModalProcessor(info: _I, dummy_inputs: BaseDummyInputsBuilder[_I], *, cache: ProcessingCache | None = None, enable_sanity_checks: bool = True)[source]#
- abstract create_encoder_prompt(prompt: str | list[int], mm_data: Mapping[str, Any | list[Any]]) str | list[int][source]#
Create input prompt for the encoder.
- apply(prompt: str | list[int], mm_data: Mapping[str, Any | list[Any]], hf_processor_mm_kwargs: Mapping[str, object]) MultiModalEncDecInputs[source]#
Process multi-modal inputs to be used in vLLM. The main processing steps are modified to fit encoder-decoder model: 1. Create encoder prompt from input prompt text. 2. Apply the HF processor on encoder prompt. 3. Copy the input prompt text as decoder prompt inputs.