Data Processing

Data Processing#

Module Contents#

vllm.multimodal.processing.PromptSeq[source]#

A token sequence (list of token IDs) or text.

alias of Union[str, list[int]]

class vllm.multimodal.processing.PromptIndex(get_match_index: Callable[[transformers.PreTrainedTokenizer | transformers.PreTrainedTokenizerFast | TokenizerBase, str | list[int]], int | None])[source]#: Resolves to an index in the prompt.

vllm.multimodal.processing.PromptTarget[source]#

The token sequence or text to update.

alias of Union[str, list[int], PromptIndex]

class vllm.multimodal.processing.PromptUpdateDetails(full: str | list[int], features: str | list[int])[source]#

Details about the token sequence or text that are part of the update.

full: str | list[int][source]#: The full content.

features: str | list[int][source]#: The part of the content that corresponds to feature placeholders; this will be replaced by the output of the vision encoder during model inference.

vllm.multimodal.processing.PromptUpdateInfo[source]#

The token sequence or text that are part of the update.

If only part of the content corresponds to feature placeholders, you can use PromptUpdateDetails to specify which part.

alias of Union[str, list[int], PromptUpdateDetails]

vllm.multimodal.processing.PromptUpdateContent[source]#

Given the index of the processed item within modality, output the corresponding token sequence (or text).

For convenience, you can directly pass in the token sequence (or text) instead of a function if it does not depend on the input.

alias of Union[Callable[int, Union[str, list[int], PromptUpdateDetails]], str, list[int], PromptUpdateDetails]

class vllm.multimodal.processing.UpdateMode(value, names=_not_given, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

class vllm.multimodal.processing.PromptUpdate(modality: str, target: str | list[int] | PromptIndex)[source]#

Defines how to update a prompt with placeholder tokens.

modality: str[source]#: The modality for which the update is made.

target: str | list[int] | PromptIndex[source]#: The token sequence (or text) to update.

abstract property content: Callable[[int], str | list[int] | PromptUpdateDetails] | str | list[int] | PromptUpdateDetails[source]#: The placeholder tokens that are part of the update.

abstract property mode: UpdateMode[source]#: Defines how to update the prompt.

Defines how to insert placeholder tokens into a prompt.

Example

For each image, insert a number of <image> feature placeholders equal to the feature size of the vision encoder after the <s> token:

PromptInsertion(
    modality="image",
    target="<s>",
    insertion="<image>" * image_feature_size,
)

Insert these tokens at the start of the prompt:

PromptInsertion(
    modality="image",
    target=PromptIndexTargets.start(),
    insertion="<image>" * image_feature_size,
)

Insert these tokens after a prefix Images::

PromptInsertion(
    modality="image",
    target=PromptIndexTargets.prefix("Images:"),
    insertion="<image>" * image_feature_size,
)

Insert these tokens at the end of the prompt:

PromptInsertion(
    modality="image",
    target=PromptIndexTargets.end(),
    insertion="<image>" * image_feature_size,
)

Given the index of the processed item within modality, output the token sequence (or text) to insert right after target.

For convenience, you can directly pass in the token sequence (or text) instead of a function if it does not depend on the input.

property content: Callable[[int], str | list[int] | PromptUpdateDetails] | str | list[int] | PromptUpdateDetails[source]#: The placeholder tokens that are part of the update.

property mode: UpdateMode[source]#: Defines how to update the prompt.

Defines how to replace portions of an input prompt with placeholder tokens.

Example

For each image, replace one <image> input placeholder in the prompt with a number of <image> feature placeholders equal to the feature size of the vision encoder:

PromptReplacement(
    modality="image",
    target="<image>",
    replacement="<image>" * image_feature_size,
)

As above, but further pad the feature placeholders with <image_bos> and <image_eos>`, which are not supposed to be passed to the vision encoder:

PromptReplacement(
    modality="image",
    target="<image>",
    replacement=PromptUpdateDetails(
        full="".join([
            "<image_bos>",
            "<image>" * image_feature_size,
            "<image_eos>",
        ]),
        features="<image>" * image_feature_size,
    ),
)

To avoid unnecessary tokenization during prompt replacement, we recommended passing token sequences instead of text:

PromptReplacement(
    modality="image",
    target=[image_token_id],
    replacement=PromptUpdateDetails(
        full=([image_bos_id] + [image_token_id] * image_feature_size
              + [image_eos_id]),
        features=[image_token_id] * image_feature_size,
    ),
)

Given the index of the processed item within modality, output the token sequence (or text) to replace target.

For convenience, you can directly pass in the token sequence (or text) instead of a function if it does not depend on the input.

property content: Callable[[int], str | list[int] | PromptUpdateDetails] | str | list[int] | PromptUpdateDetails[source]#: The placeholder tokens that are part of the update.

property mode: UpdateMode[source]#: Defines how to update the prompt.

vllm.multimodal.processing.full_groupby_modality(values: Iterable[_M]) → ItemsView[str, list[_M]][source]#: Convenience function to apply full_groupby() based on modality.

class vllm.multimodal.processing.BoundPromptUpdate(_origin: PromptUpdate, tokenizer: transformers.PreTrainedTokenizer | transformers.PreTrainedTokenizerFast | TokenizerBase)[source]#

A PromptUpdate bound to a tokenizer to automatically convert target and the result of get_content() between token sequence and text representations.

property target: _BoundPromptSequence | PromptIndex[source]#: The token sequence (or text) to update.

property content: Callable[[int], str | list[int] | PromptUpdateDetails] | str | list[int] | PromptUpdateDetails[source]#: The placeholder tokens that are part of the update.

property mode: UpdateMode[source]#: Defines how to update the prompt.

get_content(item_idx: int) → _BoundPromptContent[source]#: Given the index of the processed item within modality, output the token sequence (or text) to update.

vllm.multimodal.processing.iter_token_matches(token_ids: list[int], match_ids: list[int]) → Generator[_TokenMatch][source]#

Yield each occurrence of match_ids in token_ids.

Note that empty matches are ignored.

class vllm.multimodal.processing.PlaceholderFeaturesInfo(modality: str, item_idx: int, start_idx: int, tokens: list[int])[source]#

vllm.multimodal.processing.find_token_matches(prompt: list[int], prompt_updates: Sequence[BoundPromptUpdate]) → Sequence[_PromptTargetMatch][source]#: Return each target of prompt_updates found in prompt.

vllm.multimodal.processing.find_text_matches(prompt: str, prompt_updates: Sequence[BoundPromptUpdate]) → Sequence[_PromptTargetMatch][source]#: Return each target of prompt_updates found in prompt.

vllm.multimodal.processing.apply_token_matches(prompt: list[int], mm_matches: Mapping[str, Sequence[_PromptTargetMatch]], mm_item_counts: Mapping[str, int]) → list[int][source]#: Apply the updates in mm_matches to prompt.

vllm.multimodal.processing.apply_text_matches(prompt: str, mm_matches: Mapping[str, Sequence[_PromptTargetMatch]], mm_item_counts: Mapping[str, int]) → str[source]#: Apply the updates in mm_matches to prompt.

class vllm.multimodal.processing.BaseProcessingInfo(ctx: InputProcessingContext)[source]#

Base class to provide the information necessary for data processing.

get_hf_processor(**kwargs: object) → transformers.ProcessorMixin[source]#: Subclasses can override this method to handle specific kwargs from model config or user inputs.

abstract get_supported_mm_limits() → Mapping[str, int | None][source]#

Return the maximum supported number of items for each modality.

A value of None means unlimited number of items.

Omitting a modality from the returned dictionary means that it is not supported at all.

abstract get_mm_max_tokens_per_item(seq_len: int, mm_counts: Mapping[str, int]) → Mapping[str, int][source]#

Get the maximum possible number of tokens per data item for each modality.

The dictionary returned by this method should have the same keys as that returned by get_supported_mm_limits().

class vllm.multimodal.processing.BaseMultiModalProcessor(info: _I, dummy_inputs: BaseDummyInputsBuilder[_I], *, cache: ProcessingCache | None = None, enable_sanity_checks: bool = True)[source]#

Abstract base class to process multi-modal inputs to be used in vLLM.

Not to be confused with transformers.ProcessorMixin.

apply(prompt: str | list[int], mm_data: Mapping[str, Any | list[Any]], hf_processor_mm_kwargs: Mapping[str, object], return_mm_hashes: bool = False) → MultiModalInputs[source]#

Process multi-modal inputs to be used in vLLM.

The main steps are:

Apply HF Processor on prompt text and multi-modal data together, outputting token IDs and processed tensors.
Find and update sequences in the token IDs with placeholder tokens. The number of placeholder tokens equals the feature size of the multi-modal data outputted by the multi-modal encoder.
Extract information about the placeholder tokens from the processed token IDs.

class vllm.multimodal.processing.EncDecMultiModalProcessor(info: _I, dummy_inputs: BaseDummyInputsBuilder[_I], *, cache: ProcessingCache | None = None, enable_sanity_checks: bool = True)[source]#

abstract create_encoder_prompt(prompt: str | list[int], mm_data: Mapping[str, Any | list[Any]]) → str | list[int][source]#: Create input prompt for the encoder. HF processor will be applied on this prompt during profiling and generation.

create_decoder_prompt(prompt: str | list[int], mm_data: Mapping[str, Any | list[Any]]) → str | list[int][source]#: Create input prompt for the decoder.

apply(prompt: str | list[int], mm_data: Mapping[str, Any | list[Any]], hf_processor_mm_kwargs: Mapping[str, object], return_mm_hashes: bool = False) → MultiModalEncDecInputs[source]#: Process multi-modal inputs to be used in vLLM. The main processing steps are modified to fit encoder-decoder model: 1. Create encoder prompt from input prompt text. 2. Apply the HF processor on encoder prompt. 3. Copy the input prompt text as decoder prompt inputs.