Optional Interfaces#

Module Contents#

class vllm.model_executor.models.interfaces.SupportsMultiModal(*args, **kwargs)[source][source]#

The interface required for all multi-modal models.

supports_multimodal: ClassVar[Literal[True]] = True[source]#

A flag that indicates this model supports multi-modal inputs.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

get_multimodal_embeddings(**kwargs) T | None[source][source]#

Returns multimodal embeddings generated from multimodal kwargs to be merged with text embeddings.

The output embeddings must be one of the following formats:

  • A list or tuple of 2D tensors, where each tensor corresponds to each input multimodal data item (e.g, image).

  • A single 3D tensor, with the batch dimension grouping the 2D tensors.

Note

The returned multimodal embeddings must be in the same order as the appearances of their corresponding multimodal data item in the input prompt.

get_input_embeddings(input_ids: torch.Tensor, multimodal_embeddings: T | None = None, attn_metadata: 'AttentionMetadata' | None = None) torch.Tensor[source]#
get_input_embeddings(input_ids: torch.Tensor, multimodal_embeddings: T | None = None) torch.Tensor

Helper for @overload to raise when called.

class vllm.model_executor.models.interfaces.SupportsLoRA(*args, **kwargs)[source][source]#

The interface required for all models that support LoRA.

supports_lora: ClassVar[Literal[True]] = True[source]#

A flag that indicates this model supports LoRA.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

class vllm.model_executor.models.interfaces.SupportsPP(*args, **kwargs)[source][source]#

The interface required for all models that support pipeline parallel.

supports_pp: ClassVar[Literal[True]] = True[source]#

A flag that indicates this model supports pipeline parallel.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

make_empty_intermediate_tensors(batch_size: int, dtype: torch.dtype, device: torch.device) IntermediateTensors[source][source]#

Called when PP rank > 0 for profiling purposes.

forward(*, intermediate_tensors: IntermediateTensors | None) torch.Tensor | IntermediateTensors[source][source]#

Accept IntermediateTensors when PP rank > 0.

Return IntermediateTensors only for the last PP rank.

class vllm.model_executor.models.interfaces.HasInnerState(*args, **kwargs)[source][source]#

The interface required for all models that has inner state.

has_inner_state: ClassVar[Literal[True]] = True[source]#

A flag that indicates this model has inner state. Models that has inner state usually need access to the scheduler_config for max_num_seqs, etc. True for e.g. both Mamba and Jamba.

class vllm.model_executor.models.interfaces.IsAttentionFree(*args, **kwargs)[source][source]#

The interface required for all models like Mamba that lack attention, but do have state whose size is constant wrt the number of tokens.

is_attention_free: ClassVar[Literal[True]] = True[source]#

A flag that indicates this model has no attention. Used for block manager and attention backend selection. True for Mamba but not Jamba.

class vllm.model_executor.models.interfaces.IsHybrid(*args, **kwargs)[source][source]#

The interface required for all models like Jamba that have both attention and mamba blocks, indicates that hf_config has ‘layers_block_type’

is_hybrid: ClassVar[Literal[True]] = True[source]#

A flag that indicates this model has both mamba and attention blocks , also indicates that the model’s hf_config has ‘layers_block_type’

class vllm.model_executor.models.interfaces.SupportsCrossEncoding(*args, **kwargs)[source][source]#

The interface required for all models that support cross encoding.