vllm.multimodal.processing

Contents

vllm.multimodal.processing#

Module Contents#

Classes#

BaseMultiModalProcessor

Abstract base class to process multi-modal inputs to be used in vLLM.

BaseProcessingInfo

Base class to provide the information necessary for data processing.

BoundPromptUpdate

A PromptUpdate bound to a tokenizer to automatically convert target and the result of get_content() between token sequence and text representations.

EncDecMultiModalProcessor

PlaceholderFeaturesInfo

ProcessingCache

ProcessingCacheItem

ProcessingCacheOptionalItem

PromptIndex

Resolves to an index in the prompt.

PromptIndexTargets

PromptInsertion

Defines how to insert placeholder tokens into a prompt.

PromptReplacement

Defines how to replace portions of an input prompt with placeholder tokens.

PromptTargetMatch

PromptUpdate

Defines how to update a prompt with placeholder tokens.

PromptUpdateDetails

Details about the token sequence or text that are part of the update.

UpdateMode

Functions#

apply_text_matches

Apply the updates in mm_matches to prompt.

apply_token_matches

Apply the updates in mm_matches to prompt.

find_mm_placeholders

find_text_matches

Return each target of prompt_updates found in prompt.

find_token_matches

Return each target of prompt_updates found in prompt.

full_groupby_modality

Convenience function to apply full_groupby() based on modality.

iter_token_matches

Yield each occurrence of match_ids in token_ids.

replace_token_matches

Replace each occurrence of match_ids in token_ids with new_ids.

Data#

MultiModalHashes

A collection of hashes with a similar structure as MultiModalKwargs.

PromptSeq

A token sequence (list of token IDs) or text.

PromptTarget

The token sequence or text to update.

PromptUpdateContent

Given the index of the processed item within modality, output the corresponding token sequence (or text).

PromptUpdateInfo

The token sequence or text that are part of the update.

logger

API#

class vllm.multimodal.processing.BaseMultiModalProcessor(info: vllm.multimodal.processing._I, dummy_inputs: BaseDummyInputsBuilder[_I], *, cache: Optional[vllm.multimodal.processing.ProcessingCache] = None)[source]#

Bases: abc.ABC, typing.Generic[vllm.multimodal.processing._I]

Abstract base class to process multi-modal inputs to be used in vLLM.

Not to be confused with transformers.ProcessorMixin.

Initialization

apply(prompt: Union[str, list[int]], mm_data: vllm.multimodal.inputs.MultiModalDataDict, hf_processor_mm_kwargs: collections.abc.Mapping[str, object], return_mm_hashes: bool = False) vllm.multimodal.inputs.MultiModalInputs[source]#

Process multi-modal inputs to be used in vLLM.

The main steps are:

  1. Apply HF Processor on prompt text and multi-modal data together, outputting token IDs and processed tensors.

  2. Find and update sequences in the token IDs with placeholder tokens. The number of placeholder tokens equals the feature size of the multi-modal data outputted by the multi-modal encoder.

  3. Extract information about the placeholder tokens from the processed token IDs.

class vllm.multimodal.processing.BaseProcessingInfo(ctx: vllm.inputs.InputProcessingContext)[source]#

Base class to provide the information necessary for data processing.

Initialization

get_allowed_mm_limits() collections.abc.Mapping[str, int][source]#

Return the maximum allowed number of items for each modality.

get_hf_config() transformers.configuration_utils.PretrainedConfig[source]#
get_hf_processor(**kwargs: object) transformers.processing_utils.ProcessorMixin[source]#

Subclasses can override this method to handle specific kwargs from model config or user inputs.

abstract get_supported_mm_limits() collections.abc.Mapping[str, Optional[int]][source]#

Return the maximum supported number of items for each modality.

A value of None means unlimited number of items.

Omitting a modality from the returned dictionary means that it is not supported at all.

get_tokenizer() vllm.transformers_utils.tokenizer.AnyTokenizer[source]#
property model_id: str[source]#
class vllm.multimodal.processing.BoundPromptUpdate[source]#

A PromptUpdate bound to a tokenizer to automatically convert target and the result of get_content() between token sequence and text representations.

property content: vllm.multimodal.processing.PromptUpdateContent[source]#

The placeholder tokens that are part of the update.

get_content(item_idx: int) vllm.multimodal.processing._BoundPromptContent[source]#

Given the index of the processed item within modality, output the token sequence (or text) to update.

property modality: str[source]#
property mode: vllm.multimodal.processing.UpdateMode[source]#

Defines how to update the prompt.

property target: Union[vllm.multimodal.processing._BoundPromptSequence, vllm.multimodal.processing.PromptIndex][source]#

The token sequence (or text) to update.

tokenizer: vllm.transformers_utils.tokenizer.AnyTokenizer[source]#

‘field(…)’

class vllm.multimodal.processing.EncDecMultiModalProcessor(info: vllm.multimodal.processing._I, dummy_inputs: BaseDummyInputsBuilder[_I], *, cache: Optional[vllm.multimodal.processing.ProcessingCache] = None)[source]#

Bases: vllm.multimodal.processing.BaseMultiModalProcessor[vllm.multimodal.processing._I]

apply(prompt: Union[str, list[int]], mm_data: vllm.multimodal.inputs.MultiModalDataDict, hf_processor_mm_kwargs: collections.abc.Mapping[str, object], return_mm_hashes: bool = False) vllm.multimodal.inputs.MultiModalEncDecInputs[source]#

Process multi-modal inputs to be used in vLLM. The main processing steps are modified to fit encoder-decoder model:

  1. Create encoder prompt from input prompt text.

  2. Apply the HF processor on encoder prompt.

  3. Copy the input prompt text as decoder prompt inputs.

create_decoder_prompt(prompt: Union[str, list[int]], mm_data: vllm.multimodal.inputs.MultiModalDataDict) Union[str, list[int]][source]#

Create input prompt for the decoder.

abstract create_encoder_prompt(prompt: Union[str, list[int]], mm_data: vllm.multimodal.inputs.MultiModalDataDict) Union[str, list[int]][source]#

Create input prompt for the encoder. HF processor will be applied on this prompt during profiling and generation.

property pad_dummy_encoder_prompt: bool[source]#
vllm.multimodal.processing.MultiModalHashes[source]#

None

A collection of hashes with a similar structure as MultiModalKwargs.

class vllm.multimodal.processing.PlaceholderFeaturesInfo[source]#
is_embed: Optional[torch.Tensor][source]#

None

item_idx: int[source]#

None

property length: int[source]#
modality: str[source]#

None

start_idx: int[source]#

None

to_range() vllm.multimodal.inputs.PlaceholderRange[source]#
tokens: list[int][source]#

None

class vllm.multimodal.processing.ProcessingCache(capacity_gb: float, *, debug_cache_hit_ratio_steps: Optional[int] = None)[source]#

Initialization

get(model_id: str, modality: str, input_item: object, input_kwargs: collections.abc.Mapping[str, object]) Optional[vllm.multimodal.inputs.MultiModalKwargsItem][source]#

Get a processed multi-modal item from the cache according to its dependencies, including:

  • The model ID

  • The modality of the item

  • The original data item passed to the HF processor

  • The configuration options of the HF processor

get_item(model_id: str, modality: str, input_item: object, input_kwargs: collections.abc.Mapping[str, object]) vllm.multimodal.processing.ProcessingCacheOptionalItem[source]#
static get_lru_cache(capacity_gb: float, value_type: type[vllm.multimodal.processing._V], *, debug: bool = False) vllm.utils.LRUCache[str, vllm.multimodal.processing._V][source]#
put(model_id: str, modality: str, input_item: object, input_kwargs: collections.abc.Mapping[str, object], output_kwargs: vllm.multimodal.inputs.MultiModalKwargsItem) None[source]#

Put a processed multi-modal item into the cache according to its dependencies (see get()).

put_item(item: vllm.multimodal.processing.ProcessingCacheItem) None[source]#
reset() bool[source]#
class vllm.multimodal.processing.ProcessingCacheItem[source]#

Bases: typing.NamedTuple

key: str[source]#

None

value: vllm.multimodal.inputs.MultiModalKwargsItem[source]#

None

class vllm.multimodal.processing.ProcessingCacheOptionalItem[source]#

Bases: typing.NamedTuple

key: str[source]#

None

value: Optional[vllm.multimodal.inputs.MultiModalKwargsItem][source]#

None

class vllm.multimodal.processing.PromptIndex[source]#

Resolves to an index in the prompt.

get_match_index: collections.abc.Callable[[vllm.transformers_utils.tokenizer.AnyTokenizer, vllm.multimodal.processing.PromptSeq], Optional[int]][source]#

None

class vllm.multimodal.processing.PromptIndexTargets[source]#
static end() vllm.multimodal.processing.PromptIndex[source]#

Resolves to the end of the prompt (after the last token).

This results in a match even if the prompt is empty.

static prefix(seq: vllm.multimodal.processing.PromptSeq) vllm.multimodal.processing.PromptIndex[source]#

Resolves to a location in the prompt after the given prefix.

static start() vllm.multimodal.processing.PromptIndex[source]#

Resolves to the start of the prompt (before the first token).

This results in a match even if the prompt is empty.

class vllm.multimodal.processing.PromptInsertion[source]#

Bases: vllm.multimodal.processing.PromptUpdate

Defines how to insert placeholder tokens into a prompt.

Example:

For each image, insert a number of <image> feature placeholders equal to the feature size of the vision encoder after the <s> token:

PromptInsertion(
    modality="image",
    target="<s>",
    insertion="<image>" * image_feature_size,
)

Insert these tokens at the start of the prompt:

PromptInsertion(
    modality="image",
    target=PromptIndexTargets.start(),
    insertion="<image>" * image_feature_size,
)

Insert these tokens after a prefix Images::

PromptInsertion(
    modality="image",
    target=PromptIndexTargets.prefix("Images:"),
    insertion="<image>" * image_feature_size,
)

Insert these tokens at the end of the prompt:

PromptInsertion(
    modality="image",
    target=PromptIndexTargets.end(),
    insertion="<image>" * image_feature_size,
)
property content: vllm.multimodal.processing.PromptUpdateContent[source]#
insertion: vllm.multimodal.processing.PromptUpdateContent[source]#

‘field(…)’

Given the index of the processed item within modality, output the token sequence (or text) to insert right after target.

For convenience, you can directly pass in the token sequence (or text) instead of a function if it does not depend on the input.

property mode: vllm.multimodal.processing.UpdateMode[source]#
class vllm.multimodal.processing.PromptReplacement[source]#

Bases: vllm.multimodal.processing.PromptUpdate

Defines how to replace portions of an input prompt with placeholder tokens.

Example:

For each image, replace one <image> input placeholder in the prompt with a number of <image> feature placeholders equal to the feature size of the vision encoder:

PromptReplacement(
    modality="image",
    target="<image>",
    replacement="<image>" * image_feature_size,
)

As above, but further pad the feature placeholders with <image_bos> and `<image_eos>``, which are not supposed to be passed to the vision encoder:

PromptReplacement(
    modality="image",
    target="<image>",
    replacement=PromptUpdateDetails(
        full="".join([
            "<image_bos>",
            "<image>" * image_feature_size,
            "<image_eos>",
        ]),
        features="<image>" * image_feature_size,
    ),
)

To avoid unnecessary tokenization during prompt replacement, we recommended passing token sequences instead of text:

PromptReplacement(
    modality="image",
    target=[image_token_id],
    replacement=PromptUpdateDetails(
        full=([image_bos_id] + [image_token_id] * image_feature_size
                + [image_eos_id]),
        features=[image_token_id] * image_feature_size,
    ),
)
property content: vllm.multimodal.processing.PromptUpdateContent[source]#
property mode: vllm.multimodal.processing.UpdateMode[source]#
replacement: vllm.multimodal.processing.PromptUpdateContent[source]#

‘field(…)’

Given the index of the processed item within modality, output the token sequence (or text) to replace target.

For convenience, you can directly pass in the token sequence (or text) instead of a function if it does not depend on the input.

vllm.multimodal.processing.PromptSeq[source]#

None

A token sequence (list of token IDs) or text.

vllm.multimodal.processing.PromptTarget[source]#

None

The token sequence or text to update.

class vllm.multimodal.processing.PromptTargetMatch[source]#

Bases: abc.ABC

abstract property end_idx: int[source]#
property modality: str[source]#
abstract property start_idx: int[source]#
class vllm.multimodal.processing.PromptUpdate[source]#

Bases: abc.ABC

Defines how to update a prompt with placeholder tokens.

bind(tokenizer: vllm.transformers_utils.tokenizer.AnyTokenizer) vllm.multimodal.processing.BoundPromptUpdate[source]#
abstract property content: vllm.multimodal.processing.PromptUpdateContent[source]#

The placeholder tokens that are part of the update.

modality: str[source]#

None

The modality for which the update is made.

abstract property mode: vllm.multimodal.processing.UpdateMode[source]#

Defines how to update the prompt.

target: vllm.multimodal.processing.PromptTarget[source]#

None

The token sequence (or text) to update.

vllm.multimodal.processing.PromptUpdateContent[source]#

None

Given the index of the processed item within modality, output the corresponding token sequence (or text).

For convenience, you can directly pass in the token sequence (or text) instead of a function if it does not depend on the input.

class vllm.multimodal.processing.PromptUpdateDetails[source]#

Bases: typing.Generic[vllm.multimodal.processing._S]

Details about the token sequence or text that are part of the update.

static from_seq(seq: vllm.multimodal.processing._S) PromptUpdateDetails[_S][source]#
full: vllm.multimodal.processing._S[source]#

None

The full content.

is_embed: Optional[collections.abc.Callable[[_BoundPromptSequence], torch.Tensor]][source]#

None

Given full, return a boolean mask of shape (len(full),) indicating which positions of full to assign embeddings to.

None (default) means to assign embeddings to all positions of full.

The embeddings are obtained by calling SupportsMultiModal.get_multimodal_embeddings.

static select_text(seq: vllm.multimodal.processing._S, embed_text: str) PromptUpdateDetails[_S][source]#
static select_token_id(seq: vllm.multimodal.processing._S, embed_token_id: int) PromptUpdateDetails[_S][source]#
vllm.multimodal.processing.PromptUpdateInfo[source]#

None

The token sequence or text that are part of the update.

If only part of the content corresponds to feature placeholders, you can use PromptUpdateDetails to specify which part.

class vllm.multimodal.processing.UpdateMode[source]#

Bases: str, enum.Enum

INSERT[source]#

‘insert’

REPLACE[source]#

‘replace’

vllm.multimodal.processing.apply_text_matches(prompt: str, mm_matches: collections.abc.Mapping[str, collections.abc.Sequence[vllm.multimodal.processing.PromptTargetMatch]], mm_item_counts: collections.abc.Mapping[str, int]) str[source]#

Apply the updates in mm_matches to prompt.

vllm.multimodal.processing.apply_token_matches(prompt: list[int], mm_matches: collections.abc.Mapping[str, collections.abc.Sequence[vllm.multimodal.processing.PromptTargetMatch]], mm_item_counts: collections.abc.Mapping[str, int]) list[int][source]#

Apply the updates in mm_matches to prompt.

vllm.multimodal.processing.find_mm_placeholders(mm_prompt_updates: collections.abc.Mapping[str, collections.abc.Sequence[vllm.multimodal.processing.BoundPromptUpdate]], prompt: list[int], mm_item_counts: collections.abc.Mapping[str, int]) collections.abc.Mapping[str, list[vllm.multimodal.processing.PlaceholderFeaturesInfo]][source]#
vllm.multimodal.processing.find_text_matches(prompt: str, prompt_updates: collections.abc.Sequence[vllm.multimodal.processing.BoundPromptUpdate]) collections.abc.Sequence[vllm.multimodal.processing.PromptTargetMatch][source]#

Return each target of prompt_updates found in prompt.

vllm.multimodal.processing.find_token_matches(prompt: list[int], prompt_updates: collections.abc.Sequence[vllm.multimodal.processing.BoundPromptUpdate]) collections.abc.Sequence[vllm.multimodal.processing.PromptTargetMatch][source]#

Return each target of prompt_updates found in prompt.

vllm.multimodal.processing.full_groupby_modality(values: collections.abc.Iterable[vllm.multimodal.processing._M]) collections.abc.ItemsView[str, list[vllm.multimodal.processing._M]][source]#

Convenience function to apply full_groupby() based on modality.

vllm.multimodal.processing.iter_token_matches(token_ids: list[int], match_ids: list[int]) collections.abc.Generator[vllm.multimodal.processing._TokenMatch][source]#

Yield each occurrence of match_ids in token_ids.

Note that empty matches are ignored.

vllm.multimodal.processing.logger[source]#

‘init_logger(…)’

vllm.multimodal.processing.replace_token_matches(token_ids: list[int], match_ids: list[int], new_ids: list[int]) list[int][source]#

Replace each occurrence of match_ids in token_ids with new_ids.

Note that empty matches are ignored.