vllm.model_executor.models.interfaces ¶

MultiModalEmbeddings `module-attribute` ¶

MultiModalEmbeddings: TypeAlias = (
    list[Tensor] | Tensor | tuple[Tensor, ...]
)

The output embeddings must be one of the following formats:

A list or tuple of 2D tensors, where each tensor corresponds to each input multimodal data item (e.g, image).
A single 3D tensor, with the batch dimension grouping the 2D tensors.

HasInnerState ¶

Bases: Protocol

The interface required for all models that has inner state.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class HasInnerState(Protocol):
    """The interface required for all models that has inner state."""

    has_inner_state: ClassVar[Literal[True]] = True
    """
        A flag that indicates this model has inner state.
        Models that has inner state usually need access to the scheduler_config
        for max_num_seqs, etc. True for e.g. both Mamba and Jamba.
    """

has_inner_state `class-attribute` ¶

has_inner_state: Literal[True] = True

A flag that indicates this model has inner state. Models that has inner state usually need access to the scheduler_config for max_num_seqs, etc. True for e.g. both Mamba and Jamba.

IsAttentionFree ¶

Bases: Protocol

The interface required for all models like Mamba that lack attention, but do have state whose size is constant wrt the number of tokens.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class IsAttentionFree(Protocol):
    """The interface required for all models like Mamba that lack attention,
    but do have state whose size is constant wrt the number of tokens."""

    is_attention_free: ClassVar[Literal[True]] = True
    """
        A flag that indicates this model has no attention.
        Used for block manager and attention backend selection.
        True for Mamba but not Jamba.
    """

is_attention_free `class-attribute` ¶

is_attention_free: Literal[True] = True

A flag that indicates this model has no attention. Used for block manager and attention backend selection. True for Mamba but not Jamba.

IsHybrid ¶

Bases: Protocol

The interface required for all models like Jamba that have both attention and mamba blocks, indicates that hf_config has 'layers_block_type'

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class IsHybrid(Protocol):
    """The interface required for all models like Jamba that have both
    attention and mamba blocks, indicates that
    hf_config has 'layers_block_type'"""

    is_hybrid: ClassVar[Literal[True]] = True
    """
        A flag that indicates this model has both mamba and attention blocks
        , also indicates that the model's hf_config has 
        'layers_block_type' """

    @classmethod
    def get_mamba_state_shape_from_config(
        cls,
        vllm_config: VllmConfig,
    ) -> tuple[tuple[int, int], tuple[int, int, int]]:
        """Calculate shapes for Mamba's convolutional and state caches.

        Args:
            vllm_config: vLLM config

        Returns:
            Tuple containing:
            - conv_state_shape: Shape for convolutional state cache
            - temporal_state_shape: Shape for state space model cache
        """
        ...

    @classmethod
    def get_mamba_state_copy_func(cls) -> tuple[MambaStateCopyFunc, ...]:
        """Calculate copy-function callables for each Mamba state.

        Returns:
            A tuple of MambaStateCopyFunc callables that correspond, in order,
            to the Mamba states produced by the model. Each callable accepts
            (state, block_ids, cur_block_idx, num_accepted_tokens) and returns
            a MambaCopySpec describing the memory-copy parameters for prefix
            caching in align mode.
        """
        ...

is_hybrid `class-attribute` ¶

is_hybrid: Literal[True] = True

A flag that indicates this model has both mamba and attention blocks , also indicates that the model's hf_config has 'layers_block_type'

get_mamba_state_copy_func `classmethod` ¶

get_mamba_state_copy_func() -> tuple[
    MambaStateCopyFunc, ...
]

Calculate copy-function callables for each Mamba state.

Returns:

Type	Description
`MambaStateCopyFunc`	A tuple of MambaStateCopyFunc callables that correspond, in order,
`...`	to the Mamba states produced by the model. Each callable accepts
`tuple[MambaStateCopyFunc, ...]`	(state, block_ids, cur_block_idx, num_accepted_tokens) and returns
`tuple[MambaStateCopyFunc, ...]`	a MambaCopySpec describing the memory-copy parameters for prefix
`tuple[MambaStateCopyFunc, ...]`	caching in align mode.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def get_mamba_state_copy_func(cls) -> tuple[MambaStateCopyFunc, ...]:
    """Calculate copy-function callables for each Mamba state.

    Returns:
        A tuple of MambaStateCopyFunc callables that correspond, in order,
        to the Mamba states produced by the model. Each callable accepts
        (state, block_ids, cur_block_idx, num_accepted_tokens) and returns
        a MambaCopySpec describing the memory-copy parameters for prefix
        caching in align mode.
    """
    ...

get_mamba_state_shape_from_config `classmethod` ¶

get_mamba_state_shape_from_config(
    vllm_config: VllmConfig,
) -> tuple[tuple[int, int], tuple[int, int, int]]

Calculate shapes for Mamba's convolutional and state caches.

Parameters:

Name	Type	Description	Default
`vllm_config`	`VllmConfig`	vLLM config	required

Returns:

Type	Description
`tuple[int, int]`	Tuple containing:
`tuple[int, int, int]`	conv_state_shape: Shape for convolutional state cache
`tuple[tuple[int, int], tuple[int, int, int]]`	temporal_state_shape: Shape for state space model cache

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def get_mamba_state_shape_from_config(
    cls,
    vllm_config: VllmConfig,
) -> tuple[tuple[int, int], tuple[int, int, int]]:
    """Calculate shapes for Mamba's convolutional and state caches.

    Args:
        vllm_config: vLLM config

    Returns:
        Tuple containing:
        - conv_state_shape: Shape for convolutional state cache
        - temporal_state_shape: Shape for state space model cache
    """
    ...

MixtureOfExperts ¶

Bases: Protocol

Check if the model is a mixture of experts (MoE) model.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class MixtureOfExperts(Protocol):
    """
    Check if the model is a mixture of experts (MoE) model.
    """

    expert_weights: MutableSequence[Sequence[Tensor]]
    """
    Expert weights saved in this rank.

    The first dimension is the layer, and the second dimension is different
    parameters in the layer, e.g. up/down projection weights.
    """

    num_moe_layers: int
    """Number of MoE layers in this model."""

    num_expert_groups: int
    """Number of expert groups in this model."""

    num_logical_experts: int
    """Number of logical experts in this model."""

    num_physical_experts: int
    """Number of physical experts in this model."""

    num_local_physical_experts: int
    """Number of local physical experts in this model."""

    num_routed_experts: int
    """Number of routed experts in this model."""

    num_shared_experts: int
    """Number of shared experts in this model."""

    num_redundant_experts: int
    """Number of redundant experts in this model."""

    moe_layers: Iterable[nn.Module]
    """List of MoE layers in this model."""

    def set_eplb_state(
        self,
        expert_load_view: Tensor,
        logical_to_physical_map: Tensor,
        logical_replica_count: Tensor,
    ) -> None:
        """
        Register the EPLB state in the MoE model.

        Since these are views of the actual EPLB state, any changes made by
        the EPLB algorithm are automatically reflected in the model's behavior
        without requiring additional method calls to set new states.

        You should also collect model's `expert_weights` here instead of in
        the weight loader, since after initial weight loading, further
        processing like quantization may be applied to the weights.

        Args:
            expert_load_view: A view of the expert load metrics tensor.
            logical_to_physical_map: Mapping from logical to physical experts.
            logical_replica_count: Count of replicas for each logical expert.
        """
        for layer_idx, layer in enumerate(self.moe_layers):
            # Register the expert weights.
            self.expert_weights.append(layer.get_expert_weights())
            layer.set_eplb_state(
                moe_layer_idx=layer_idx,
                expert_load_view=expert_load_view,
                logical_to_physical_map=logical_to_physical_map,
                logical_replica_count=logical_replica_count,
            )

    def update_physical_experts_metadata(
        self,
        num_physical_experts: int,
        num_local_physical_experts: int,
    ) -> None: ...

expert_weights `instance-attribute` ¶

expert_weights: MutableSequence[Sequence[Tensor]]

Expert weights saved in this rank.

The first dimension is the layer, and the second dimension is different parameters in the layer, e.g. up/down projection weights.

moe_layers `instance-attribute` ¶

moe_layers: Iterable[Module]

List of MoE layers in this model.

num_expert_groups `instance-attribute` ¶

num_expert_groups: int

Number of expert groups in this model.

num_local_physical_experts `instance-attribute` ¶

num_local_physical_experts: int

Number of local physical experts in this model.

num_logical_experts `instance-attribute` ¶

num_logical_experts: int

Number of logical experts in this model.

num_moe_layers `instance-attribute` ¶

num_moe_layers: int

Number of MoE layers in this model.

num_physical_experts `instance-attribute` ¶

num_physical_experts: int

Number of physical experts in this model.

num_redundant_experts `instance-attribute` ¶

num_redundant_experts: int

Number of redundant experts in this model.

num_routed_experts `instance-attribute` ¶

num_routed_experts: int

Number of routed experts in this model.

num_shared_experts `instance-attribute` ¶

num_shared_experts: int

Number of shared experts in this model.

set_eplb_state ¶

set_eplb_state(
    expert_load_view: Tensor,
    logical_to_physical_map: Tensor,
    logical_replica_count: Tensor,
) -> None

Register the EPLB state in the MoE model.

Since these are views of the actual EPLB state, any changes made by the EPLB algorithm are automatically reflected in the model's behavior without requiring additional method calls to set new states.

You should also collect model's expert_weights here instead of in the weight loader, since after initial weight loading, further processing like quantization may be applied to the weights.

Parameters:

Name	Type	Description	Default
`expert_load_view`	`Tensor`	A view of the expert load metrics tensor.	required
`logical_to_physical_map`	`Tensor`	Mapping from logical to physical experts.	required
`logical_replica_count`	`Tensor`	Count of replicas for each logical expert.	required

Source code in vllm/model_executor/models/interfaces.py

def set_eplb_state(
    self,
    expert_load_view: Tensor,
    logical_to_physical_map: Tensor,
    logical_replica_count: Tensor,
) -> None:
    """
    Register the EPLB state in the MoE model.

    Since these are views of the actual EPLB state, any changes made by
    the EPLB algorithm are automatically reflected in the model's behavior
    without requiring additional method calls to set new states.

    You should also collect model's `expert_weights` here instead of in
    the weight loader, since after initial weight loading, further
    processing like quantization may be applied to the weights.

    Args:
        expert_load_view: A view of the expert load metrics tensor.
        logical_to_physical_map: Mapping from logical to physical experts.
        logical_replica_count: Count of replicas for each logical expert.
    """
    for layer_idx, layer in enumerate(self.moe_layers):
        # Register the expert weights.
        self.expert_weights.append(layer.get_expert_weights())
        layer.set_eplb_state(
            moe_layer_idx=layer_idx,
            expert_load_view=expert_load_view,
            logical_to_physical_map=logical_to_physical_map,
            logical_replica_count=logical_replica_count,
        )

SupportsCrossEncoding ¶

Bases: Protocol

The interface required for all models that support cross encoding.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsCrossEncoding(Protocol):
    """The interface required for all models that support cross encoding."""

    score_type: ClassVar[ScoreType] = "cross-encoder"

SupportsEagle ¶

Bases: SupportsEagleBase, Protocol

The interface required for models that support EAGLE-1 and EAGLE-2 speculative decoding.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsEagle(SupportsEagleBase, Protocol):
    """The interface required for models that support
    EAGLE-1 and EAGLE-2 speculative decoding."""

    supports_eagle: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports EAGLE-1 and EAGLE-2 
    speculative decoding.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

supports_eagle `class-attribute` ¶

supports_eagle: Literal[True] = True

A flag that indicates this model supports EAGLE-1 and EAGLE-2 speculative decoding.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

SupportsEagle3 ¶

Bases: SupportsEagleBase, Protocol

The interface required for models that support EAGLE-3 speculative decoding.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsEagle3(SupportsEagleBase, Protocol):
    """The interface required for models that support
    EAGLE-3 speculative decoding."""

    supports_eagle3: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports EAGLE-3 
    speculative decoding.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

    def set_aux_hidden_state_layers(self, layers: tuple[int, ...]) -> None:
        """
        Set which layers should output auxiliary hidden states for EAGLE-3.

        Args:
            layers: Tuple of layer indices that should output auxiliary
                hidden states.
        """
        parent_ref = self
        if hasattr(self, "get_language_model"):
            parent_ref = self.get_language_model()
        elif hasattr(self, "language_model"):
            parent_ref = self.language_model
        assert hasattr(parent_ref, "model"), (
            "Model instance must have 'model' attribute to set number of layers"
        )
        assert isinstance(parent_ref.model, EagleModelMixin), (
            "Model instance must inherit from EagleModelMixin to set auxiliary layers"
        )
        parent_ref.model._set_aux_hidden_state_layers(layers)

    def get_eagle3_default_aux_hidden_state_layers(self) -> tuple[int, ...]:
        """
        Get the default layer indices that should output auxiliary hidden states
        for EAGLE-3 for this model. Models can override this method to provide
        different default layers based on their architecture, but it is encouraged
        to instead include the layer specification in the model's config if possible.

        Returns:
            Tuple of layer indices for auxiliary hidden state outputs.
        """
        parent_ref = self
        if hasattr(self, "get_language_model"):
            parent_ref = self.get_language_model()
        elif hasattr(self, "language_model"):
            parent_ref = self.language_model
        assert hasattr(parent_ref, "model"), (
            "Model instance must have 'model' attribute to get number of layers"
        )
        assert hasattr(parent_ref.model, "layers"), (
            "Model instance must have 'layers' attribute to get number of layers"
        )
        num_layers = len(parent_ref.model.layers)
        return (2, num_layers // 2, num_layers - 3)

supports_eagle3 `class-attribute` ¶

supports_eagle3: Literal[True] = True

A flag that indicates this model supports EAGLE-3 speculative decoding.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

get_eagle3_default_aux_hidden_state_layers ¶

get_eagle3_default_aux_hidden_state_layers() -> tuple[
    int, ...
]

Get the default layer indices that should output auxiliary hidden states for EAGLE-3 for this model. Models can override this method to provide different default layers based on their architecture, but it is encouraged to instead include the layer specification in the model's config if possible.

Returns:

Type	Description
`tuple[int, ...]`	Tuple of layer indices for auxiliary hidden state outputs.

Source code in vllm/model_executor/models/interfaces.py

def get_eagle3_default_aux_hidden_state_layers(self) -> tuple[int, ...]:
    """
    Get the default layer indices that should output auxiliary hidden states
    for EAGLE-3 for this model. Models can override this method to provide
    different default layers based on their architecture, but it is encouraged
    to instead include the layer specification in the model's config if possible.

    Returns:
        Tuple of layer indices for auxiliary hidden state outputs.
    """
    parent_ref = self
    if hasattr(self, "get_language_model"):
        parent_ref = self.get_language_model()
    elif hasattr(self, "language_model"):
        parent_ref = self.language_model
    assert hasattr(parent_ref, "model"), (
        "Model instance must have 'model' attribute to get number of layers"
    )
    assert hasattr(parent_ref.model, "layers"), (
        "Model instance must have 'layers' attribute to get number of layers"
    )
    num_layers = len(parent_ref.model.layers)
    return (2, num_layers // 2, num_layers - 3)

set_aux_hidden_state_layers ¶

set_aux_hidden_state_layers(
    layers: tuple[int, ...],
) -> None

Set which layers should output auxiliary hidden states for EAGLE-3.

Parameters:

Name	Type	Description	Default
`layers`	`tuple[int, ...]`	Tuple of layer indices that should output auxiliary hidden states.	required

Source code in vllm/model_executor/models/interfaces.py

def set_aux_hidden_state_layers(self, layers: tuple[int, ...]) -> None:
    """
    Set which layers should output auxiliary hidden states for EAGLE-3.

    Args:
        layers: Tuple of layer indices that should output auxiliary
            hidden states.
    """
    parent_ref = self
    if hasattr(self, "get_language_model"):
        parent_ref = self.get_language_model()
    elif hasattr(self, "language_model"):
        parent_ref = self.language_model
    assert hasattr(parent_ref, "model"), (
        "Model instance must have 'model' attribute to set number of layers"
    )
    assert isinstance(parent_ref.model, EagleModelMixin), (
        "Model instance must inherit from EagleModelMixin to set auxiliary layers"
    )
    parent_ref.model._set_aux_hidden_state_layers(layers)

SupportsEagleBase ¶

Bases: Protocol

Base interface for models that support EAGLE-based speculative decoding.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsEagleBase(Protocol):
    """Base interface for models that support EAGLE-based speculative decoding."""

    has_own_lm_head: bool = False
    """
    A flag that indicates this model has trained its own lm_head.
    """

    has_own_embed_tokens: bool = False
    """
    A flag that indicates this model has trained its own input embeddings.
    """

has_own_embed_tokens `class-attribute` `instance-attribute` ¶

has_own_embed_tokens: bool = False

A flag that indicates this model has trained its own input embeddings.

has_own_lm_head `class-attribute` `instance-attribute` ¶

has_own_lm_head: bool = False

A flag that indicates this model has trained its own lm_head.

SupportsEncoderCudaGraph ¶

Bases: Protocol

Interface for models whose vision encoder supports CUDA graph capture/replay.

Models implement these methods to provide the :class:EncoderCudaGraphManager with all model-specific logic (input handling, metadata computation, forward pass) without the manager needing to know model internals.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsEncoderCudaGraph(Protocol):
    """Interface for models whose vision encoder supports CUDA graph
    capture/replay.

    Models implement these methods to provide the
    :class:`EncoderCudaGraphManager` with all model-specific logic
    (input handling, metadata computation, forward pass) without the
    manager needing to know model internals.
    """

    supports_encoder_cudagraph: ClassVar[Literal[True]] = True

    def get_encoder_cudagraph_config(self) -> "EncoderCudaGraphConfig": ...

    def get_input_modality(
        self,
        mm_kwargs: dict[str, Any],
    ) -> str:
        """Return the modality of the inputs."""
        ...

    def get_max_frames_per_video(
        self,
    ) -> int:
        """Return model-specific max frames per video."""
        ...

    def get_encoder_cudagraph_budget_range(
        self,
        vllm_config: "VllmConfig",
    ) -> tuple[int, int]:
        """Return (min_token_budget, max_token_budget) for auto-inference.

        - min_token_budget: estimated smallest possible encoder input
          (e.g. 64 for a 224x224 image)
        - max_token_budget: estimated largest budget worth capturing
          (e.g. max_num_batched_tokens)

        Used when ``encoder_cudagraph_token_budgets`` and/or
        ``encoder_cudagraph_max_vision_items_per_batch`` are not explicitly
        specified by the user.
        """
        ...

    def get_encoder_cudagraph_item_specs(
        self,
        mm_kwargs: dict[str, Any],
    ) -> list["EncoderItemSpec"]:
        """Return specs describing each item in the batch.

        Replaces the former separate methods for num_items,
        per_item_output_tokens, and per_item_input_sizes.
        The manager derives all three from this single return value.
        """
        ...

    def select_encoder_cudagraph_items(
        self,
        mm_kwargs: dict[str, Any],
        indices: list[int],
    ) -> dict[str, Any]:
        """Select a subset of items and return mm_kwargs for the sub-batch.

        Called by the manager during greedy packing and DP sharding to
        extract inputs for a specific set of items (e.g. images at
        indices [0, 3, 5]).  The implementation is model-specific
        because input formats differ:

        - Qwen-family: slice concatenated pixel_values by cumulative
          patch offsets, subset grid_thw by indices.
        - Batched models (CLIP): index pixel_values along dim 0.
        """
        ...

    def postprocess_encoder_output(
        self,
        output: torch.Tensor,
        indices: list[int],
        per_item_out_tokens: list[int],
        dest: dict[int, torch.Tensor] | list[torch.Tensor | None],
        clone: bool = False,
        batch_mm_kwargs: dict[str, Any] | None = None,
    ) -> None:
        """
        Post-process encoder output, directly call scatter_output_slices by default.

        By default, delegates directly to scatter_output_slices.
        Override this for models that require additional processing on the raw
        encoder output prior to scattering, e.g. Step3-VL, which merges features
        according to dynamic patch counts before scattering.
        """
        from vllm.model_executor.models.utils import scatter_output_slices

        scatter_output_slices(output, indices, per_item_out_tokens, dest, clone)

    def prepare_encoder_cudagraph_capture_inputs(
        self,
        token_budget: int,
        max_batch_size: int,
        max_frames_per_batch: int,
        device: torch.device,
        dtype: torch.dtype,
    ) -> "EncoderCudaGraphCaptureInputs":
        """Create dummy inputs and buffers for CUDA graph capture."""
        ...

    def prepare_encoder_cudagraph_replay_buffers(
        self,
        mm_kwargs: dict[str, Any],
        max_batch_size: int,
        max_frames_per_batch: int,
    ) -> "EncoderCudaGraphReplayBuffers":
        """Compute buffer values from actual batch inputs for replay."""
        ...

    def encoder_cudagraph_forward(
        self,
        inputs: dict[str, torch.Tensor],
    ) -> torch.Tensor:
        """Run the encoder forward pass with precomputed buffers.

        Used during both CUDA graph capture and replay.
        """
        ...

    def encoder_eager_forward(
        self,
        mm_kwargs: dict[str, Any],
    ) -> torch.Tensor:
        """Run the encoder forward pass without precomputed buffers.

        Used as eager fallback when inputs exceed all budgets.
        """
        ...

encoder_cudagraph_forward ¶

encoder_cudagraph_forward(
    inputs: dict[str, Tensor],
) -> Tensor

Run the encoder forward pass with precomputed buffers.

Used during both CUDA graph capture and replay.

Source code in vllm/model_executor/models/interfaces.py

def encoder_cudagraph_forward(
    self,
    inputs: dict[str, torch.Tensor],
) -> torch.Tensor:
    """Run the encoder forward pass with precomputed buffers.

    Used during both CUDA graph capture and replay.
    """
    ...

encoder_eager_forward ¶

encoder_eager_forward(mm_kwargs: dict[str, Any]) -> Tensor

Run the encoder forward pass without precomputed buffers.

Used as eager fallback when inputs exceed all budgets.

Source code in vllm/model_executor/models/interfaces.py

def encoder_eager_forward(
    self,
    mm_kwargs: dict[str, Any],
) -> torch.Tensor:
    """Run the encoder forward pass without precomputed buffers.

    Used as eager fallback when inputs exceed all budgets.
    """
    ...

get_encoder_cudagraph_budget_range ¶

get_encoder_cudagraph_budget_range(
    vllm_config: VllmConfig,
) -> tuple[int, int]

Return (min_token_budget, max_token_budget) for auto-inference.

min_token_budget: estimated smallest possible encoder input (e.g. 64 for a 224x224 image)
max_token_budget: estimated largest budget worth capturing (e.g. max_num_batched_tokens)

Used when encoder_cudagraph_token_budgets and/or encoder_cudagraph_max_vision_items_per_batch are not explicitly specified by the user.

Source code in vllm/model_executor/models/interfaces.py

def get_encoder_cudagraph_budget_range(
    self,
    vllm_config: "VllmConfig",
) -> tuple[int, int]:
    """Return (min_token_budget, max_token_budget) for auto-inference.

    - min_token_budget: estimated smallest possible encoder input
      (e.g. 64 for a 224x224 image)
    - max_token_budget: estimated largest budget worth capturing
      (e.g. max_num_batched_tokens)

    Used when ``encoder_cudagraph_token_budgets`` and/or
    ``encoder_cudagraph_max_vision_items_per_batch`` are not explicitly
    specified by the user.
    """
    ...

get_encoder_cudagraph_item_specs ¶

get_encoder_cudagraph_item_specs(
    mm_kwargs: dict[str, Any],
) -> list[EncoderItemSpec]

Return specs describing each item in the batch.

Replaces the former separate methods for num_items, per_item_output_tokens, and per_item_input_sizes. The manager derives all three from this single return value.

Source code in vllm/model_executor/models/interfaces.py

def get_encoder_cudagraph_item_specs(
    self,
    mm_kwargs: dict[str, Any],
) -> list["EncoderItemSpec"]:
    """Return specs describing each item in the batch.

    Replaces the former separate methods for num_items,
    per_item_output_tokens, and per_item_input_sizes.
    The manager derives all three from this single return value.
    """
    ...

get_input_modality ¶

get_input_modality(mm_kwargs: dict[str, Any]) -> str

Return the modality of the inputs.

Source code in vllm/model_executor/models/interfaces.py

def get_input_modality(
    self,
    mm_kwargs: dict[str, Any],
) -> str:
    """Return the modality of the inputs."""
    ...

get_max_frames_per_video ¶

get_max_frames_per_video() -> int

Return model-specific max frames per video.

Source code in vllm/model_executor/models/interfaces.py

def get_max_frames_per_video(
    self,
) -> int:
    """Return model-specific max frames per video."""
    ...

postprocess_encoder_output ¶

postprocess_encoder_output(
    output: Tensor,
    indices: list[int],
    per_item_out_tokens: list[int],
    dest: dict[int, Tensor] | list[Tensor | None],
    clone: bool = False,
    batch_mm_kwargs: dict[str, Any] | None = None,
) -> None

Post-process encoder output, directly call scatter_output_slices by default.

By default, delegates directly to scatter_output_slices. Override this for models that require additional processing on the raw encoder output prior to scattering, e.g. Step3-VL, which merges features according to dynamic patch counts before scattering.

Source code in vllm/model_executor/models/interfaces.py

def postprocess_encoder_output(
    self,
    output: torch.Tensor,
    indices: list[int],
    per_item_out_tokens: list[int],
    dest: dict[int, torch.Tensor] | list[torch.Tensor | None],
    clone: bool = False,
    batch_mm_kwargs: dict[str, Any] | None = None,
) -> None:
    """
    Post-process encoder output, directly call scatter_output_slices by default.

    By default, delegates directly to scatter_output_slices.
    Override this for models that require additional processing on the raw
    encoder output prior to scattering, e.g. Step3-VL, which merges features
    according to dynamic patch counts before scattering.
    """
    from vllm.model_executor.models.utils import scatter_output_slices

    scatter_output_slices(output, indices, per_item_out_tokens, dest, clone)

prepare_encoder_cudagraph_capture_inputs ¶

prepare_encoder_cudagraph_capture_inputs(
    token_budget: int,
    max_batch_size: int,
    max_frames_per_batch: int,
    device: device,
    dtype: dtype,
) -> EncoderCudaGraphCaptureInputs

Create dummy inputs and buffers for CUDA graph capture.

Source code in vllm/model_executor/models/interfaces.py

def prepare_encoder_cudagraph_capture_inputs(
    self,
    token_budget: int,
    max_batch_size: int,
    max_frames_per_batch: int,
    device: torch.device,
    dtype: torch.dtype,
) -> "EncoderCudaGraphCaptureInputs":
    """Create dummy inputs and buffers for CUDA graph capture."""
    ...

prepare_encoder_cudagraph_replay_buffers ¶

prepare_encoder_cudagraph_replay_buffers(
    mm_kwargs: dict[str, Any],
    max_batch_size: int,
    max_frames_per_batch: int,
) -> EncoderCudaGraphReplayBuffers

Compute buffer values from actual batch inputs for replay.

Source code in vllm/model_executor/models/interfaces.py

def prepare_encoder_cudagraph_replay_buffers(
    self,
    mm_kwargs: dict[str, Any],
    max_batch_size: int,
    max_frames_per_batch: int,
) -> "EncoderCudaGraphReplayBuffers":
    """Compute buffer values from actual batch inputs for replay."""
    ...

select_encoder_cudagraph_items ¶

select_encoder_cudagraph_items(
    mm_kwargs: dict[str, Any], indices: list[int]
) -> dict[str, Any]

Select a subset of items and return mm_kwargs for the sub-batch.

Called by the manager during greedy packing and DP sharding to extract inputs for a specific set of items (e.g. images at indices [0, 3, 5]). The implementation is model-specific because input formats differ:

Qwen-family: slice concatenated pixel_values by cumulative patch offsets, subset grid_thw by indices.
Batched models (CLIP): index pixel_values along dim 0.

Source code in vllm/model_executor/models/interfaces.py

def select_encoder_cudagraph_items(
    self,
    mm_kwargs: dict[str, Any],
    indices: list[int],
) -> dict[str, Any]:
    """Select a subset of items and return mm_kwargs for the sub-batch.

    Called by the manager during greedy packing and DP sharding to
    extract inputs for a specific set of items (e.g. images at
    indices [0, 3, 5]).  The implementation is model-specific
    because input formats differ:

    - Qwen-family: slice concatenated pixel_values by cumulative
      patch offsets, subset grid_thw by indices.
    - Batched models (CLIP): index pixel_values along dim 0.
    """
    ...

SupportsLateInteraction ¶

Bases: Protocol

The interface required for all models that support late interaction.

Late interaction models (like ColBERT) encode queries and documents separately into per-token embeddings, then compute similarity via MaxSim (max over document tokens, sum over query tokens).

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsLateInteraction(Protocol):
    """The interface required for all models that support late interaction.

    Late interaction models (like ColBERT) encode queries and documents
    separately into per-token embeddings, then compute similarity via
    MaxSim (max over document tokens, sum over query tokens).
    """

    score_type: ClassVar[ScoreType] = "late-interaction"

SupportsLoRA ¶

Bases: Protocol

The interface required for all models that support LoRA.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsLoRA(Protocol):
    """The interface required for all models that support LoRA."""

    supports_lora: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports LoRA.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """
    is_3d_moe_weight: ClassVar[bool] = False
    is_non_gated_moe: ClassVar[bool] = False
    # The `embedding_module` and `embedding_padding_modules`
    # are empty by default.
    embedding_modules: ClassVar[dict[str, str]] = {}
    packed_modules_mapping: dict[str, list[str]] = {}
    # Module prefixes to skip during LoRA loading (e.g., ["mtp."] for MTP layers)
    lora_skip_prefixes: ClassVar[list[str]] = []

supports_lora `class-attribute` ¶

supports_lora: Literal[True] = True

A flag that indicates this model supports LoRA.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

SupportsMRoPE ¶

Bases: Protocol

The interface required for all models that support M-RoPE.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsMRoPE(Protocol):
    """The interface required for all models that support M-RoPE."""

    supports_mrope: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports M-RoPE.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

    def get_mrope_input_positions(
        self,
        input_tokens: list[int],
        mm_features: list["MultiModalFeatureSpec"],
    ) -> tuple[torch.Tensor, int]:
        """
        Get M-RoPE input positions and delta value for this specific model.

        This method should be implemented by each model that supports M-RoPE
        to provide model-specific logic for computing input positions.

        Args:
            input_tokens: List of input token IDs
            mm_features: Information about each multi-modal data item

        Returns:
            Tuple of `(llm_positions, mrope_position_delta)`
            - llm_positions: Tensor of shape `[3, num_tokens]` with T/H/W positions
            - mrope_position_delta: Delta for position calculations
        """
        ...

supports_mrope `class-attribute` ¶

supports_mrope: Literal[True] = True

A flag that indicates this model supports M-RoPE.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

get_mrope_input_positions ¶

get_mrope_input_positions(
    input_tokens: list[int],
    mm_features: list[MultiModalFeatureSpec],
) -> tuple[Tensor, int]

Get M-RoPE input positions and delta value for this specific model.

This method should be implemented by each model that supports M-RoPE to provide model-specific logic for computing input positions.

Parameters:

Name	Type	Description	Default
`input_tokens`	`list[int]`	List of input token IDs	required
`mm_features`	`list[MultiModalFeatureSpec]`	Information about each multi-modal data item	required

Returns:

Type	Description
`Tensor`	Tuple of `(llm_positions, mrope_position_delta)`
`int`	llm_positions: Tensor of shape `[3, num_tokens]` with T/H/W positions
`tuple[Tensor, int]`	mrope_position_delta: Delta for position calculations

Source code in vllm/model_executor/models/interfaces.py

def get_mrope_input_positions(
    self,
    input_tokens: list[int],
    mm_features: list["MultiModalFeatureSpec"],
) -> tuple[torch.Tensor, int]:
    """
    Get M-RoPE input positions and delta value for this specific model.

    This method should be implemented by each model that supports M-RoPE
    to provide model-specific logic for computing input positions.

    Args:
        input_tokens: List of input token IDs
        mm_features: Information about each multi-modal data item

    Returns:
        Tuple of `(llm_positions, mrope_position_delta)`
        - llm_positions: Tensor of shape `[3, num_tokens]` with T/H/W positions
        - mrope_position_delta: Delta for position calculations
    """
    ...

SupportsMambaPrefixCaching ¶

Bases: Protocol

The interface for models whose mamba layers support prefix caching.

This is currently experimental.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsMambaPrefixCaching(Protocol):
    """The interface for models whose mamba layers support prefix caching.

    This is currently experimental.
    """

    supports_mamba_prefix_caching: ClassVar[Literal[True]] = True

SupportsMultiModal ¶

Bases: Protocol

The interface required for all multi-modal models.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsMultiModal(Protocol):
    """The interface required for all multi-modal models."""

    supports_multimodal: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports multi-modal inputs.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

    supports_multimodal_raw_input_only: ClassVar[bool] = False
    """
    A flag that indicates this model supports multi-modal inputs and processes
    them in their raw form and not embeddings.
    """

    supports_encoder_tp_data: ClassVar[bool] = False
    """
    A flag that indicates whether this model supports
    `multimodal_config.mm_encoder_tp_mode="data"`.
    """

    requires_raw_input_tokens: ClassVar[bool] = False
    """
    A flag that indicates this model processes input id tokens
    in their raw form and not input embeddings.
    """

    _processor_factory: ClassVar[_ProcessorFactories]
    """
    Set internally by `MultiModalRegistry.register_processor`.
    """

    _language_model_names: list[str] = []
    """
    Set internally by `_mark_language_model`.
    """

    _tower_model_names: list[str] = []
    """
    Set internally by `_mark_tower_model`.
    """

    _has_oov_mm_tokens: bool = False
    """
    In general, this should be set at init time by invoking
    `configure_mm_token_handling` models & passing all potentially
    OOV multimodal tokens.
    """

    @classmethod
    def get_placeholder_str(cls, modality: str, i: int) -> str | None:
        """
        Get the placeholder text for the `i`th `modality` item in the prompt.
        """
        ...

    def embed_multimodal(self, **kwargs: object) -> MultiModalEmbeddings:
        """
        Returns multimodal embeddings generated from multimodal kwargs
        to be merged with text embeddings.

        Note:
            The returned multimodal embeddings must be in the same order as
            the appearances of their corresponding multimodal data item in the
            input prompt.
        """
        ...

    def configure_mm_token_handling(self, vocab_size: int, mm_token_ids: list[int]):
        """Check if any multimodal tokens are out of vocabulary. If so, we will
        explicitly mask all multimodal tokens out when computing text embeddings,
        since the multimodal embeddings will be scattered over the results.
        """
        self._has_oov_mm_tokens = any(tok_id >= vocab_size for tok_id in mm_token_ids)
        logger.info(
            "Contains out of vocabulary multimodal tokens? %s",
            self._has_oov_mm_tokens,
        )

    def get_language_model(self) -> VllmModel:
        """
        Returns the underlying language model used for text generation.

        This is typically the `torch.nn.Module` instance responsible for
        processing the merged multimodal embeddings and producing hidden states

        Returns:
            torch.nn.Module: The core language model component.
        """
        # Cached
        if self in _language_model_by_module:
            return _language_model_by_module[self]

        if self._language_model_names:
            mod = self
            for attr in common_prefix(
                [name.split(".") for name in self._language_model_names]
            ):
                if attr:
                    mod = getattr(mod, attr)

            if mod is not self and hasattr(mod, "embed_input_ids"):
                _language_model_by_module[self] = mod
                return mod

        # Fallback
        for mod in self.children():
            if hasattr(mod, "embed_input_ids"):
                _language_model_by_module[self] = mod
                return mod

        raise NotImplementedError(
            f"No language model found in {type(self).__name__}! "
            "You should initialize it via `_mark_language_model`, "
            "and make sure `embed_input_ids` is implemented."
        )

    @contextmanager
    def _mark_language_model(
        self,
        vllm_config: VllmConfig,
        *,
        targets: type[nn.Module] | tuple[type[nn.Module], ...] | None = None,
    ):
        """
        Mark each child module that was assigned to this model during this context
        as a language model component.

        Language model components are automatically skipped in `--mm-encoder-only`
        mode.

        If `targets` is set, instead include descendants that are an instance
        of `targets`, even if they aren't direct children.
        """
        from .utils import StageMissingLayer, collect_children, no_init_weights

        mm_config = vllm_config.model_config.multimodal_config

        with collect_children(self, targets=targets) as children_names:  # noqa: SIM117
            with (
                no_init_weights(
                    self,
                    lambda mod: StageMissingLayer("language_model", mod),
                    targets=targets,
                )
                if mm_config.mm_encoder_only
                else nullcontext()
            ):
                yield

        self._language_model_names = children_names

    @contextmanager
    def _mark_tower_model(
        self,
        vllm_config: VllmConfig,
        modalities: set[str] | str,
        *,
        targets: type[nn.Module] | tuple[type[nn.Module], ...] | None = None,
    ):
        """
        Mark each child module that was assigned to this model during this context
        as a tower model component.

        Tower model components are automatically skipped when `--limit-mm-per-prompt`
        is set to zero for all of their modalities.

        If `targets` is set, instead include descendants that are an instance
        of `targets`, even if they aren't direct children.
        """
        from .utils import StageMissingLayer, collect_children, no_init_weights

        if isinstance(modalities, str):
            modalities = {modalities}

        if modalities == {"image", "video"}:
            stage_name = "vision_tower"
        else:
            stage_name = "_".join([*modalities, "tower"])

        mm_config = vllm_config.model_config.multimodal_config

        with collect_children(self, targets=targets) as children_names:  # noqa: SIM117
            with (
                no_init_weights(
                    self,
                    lambda mod: StageMissingLayer(stage_name, mod),
                    targets=targets,
                )
                if all(mm_config.get_limit_per_prompt(m) == 0 for m in modalities)
                else nullcontext()
            ):
                yield

        self._tower_model_names = children_names

    @contextmanager
    def _mark_composite_model(
        self,
        vllm_config: VllmConfig,
        *,
        language_targets: type[nn.Module] | tuple[type[nn.Module], ...],
        tower_targets: dict[str, type[nn.Module] | tuple[type[nn.Module], ...]],
    ):
        """
        Composite wrapper over `_mark_language_model` and
        `_mark_tower_model` by modality.
        """
        with ExitStack() as stack:
            stack.enter_context(
                self._mark_language_model(
                    vllm_config,
                    targets=language_targets,
                )
            )

            for modality, modality_targets in tower_targets.items():
                stack.enter_context(
                    self._mark_tower_model(
                        vllm_config,
                        modality,
                        targets=modality_targets,
                    )
                )

            yield

    def get_num_mm_encoder_tokens(self, num_image_tokens: int) -> int:
        """
        Implement this function to enable LoRA support
        for the tower module of the multi-modal model.
        Given the number of image tokens, output the number of
        multi-modal encoder tokens.
        """
        ...

    def get_num_mm_connector_tokens(self, num_vision_tokens: int) -> int:
        """
        Implement this function to enable LoRA support
        for the connector module of the multi-modal model.
        Given the number of vision tokens, output the number of
        multi-modal connector tokens.
        """
        ...

    @overload
    def embed_input_ids(self, input_ids: Tensor) -> Tensor: ...

    @overload
    def embed_input_ids(
        self,
        input_ids: Tensor,
        multimodal_embeddings: MultiModalEmbeddings,
        *,
        is_multimodal: torch.Tensor,
    ) -> Tensor: ...

    def _embed_text_input_ids(
        self,
        input_ids: Tensor,
        embed_input_ids: Callable[[Tensor], Tensor],
        *,
        is_multimodal: Tensor | None,
    ) -> Tensor:
        if is_multimodal is not None and self._has_oov_mm_tokens:
            # Force all input IDs to be in vocab; we do this instead of squeezing
            # to ensure that any external configuration requiring offset tracking,
            # e.g., LoRA, are applied correctly regardless of whether or not
            # we have multimodal tokens.
            in_vocab_ids = input_ids.masked_fill(
                is_multimodal.to(device=input_ids.device, non_blocking=True), 0
            )
            return embed_input_ids(in_vocab_ids)

        return embed_input_ids(input_ids)

    def embed_input_ids(
        self,
        input_ids: Tensor,
        multimodal_embeddings: MultiModalEmbeddings | None = None,
        *,
        is_multimodal: Tensor | None = None,
    ) -> Tensor:
        """
        Apply token embeddings to `input_ids`.

        If `multimodal_embeddings` is passed, scatter them into
        `input_ids` according to the mask `is_multimodal`.

        NOTE: If this model has multimodal tokens that are of vocabulary
        (i.e., self._has_oov_mm_tokens=True), the input_ids will be copied
        and masked to 0 during the forward pass for the text embeddings.
        """
        from .utils import _merge_multimodal_embeddings

        # Get text embeddings first; multimodal embeddings will clobber
        # any invalid contents in the indices of multimodal embeddings
        # for the in vocabulary and out of vocabulary case.
        inputs_embeds = self._embed_text_input_ids(
            input_ids,
            self.get_language_model().embed_input_ids,
            is_multimodal=is_multimodal,
        )

        if multimodal_embeddings is None or len(multimodal_embeddings) == 0:
            return inputs_embeds

        return _merge_multimodal_embeddings(
            inputs_embeds=inputs_embeds,
            multimodal_embeddings=multimodal_embeddings,
            is_multimodal=_require_is_multimodal(is_multimodal),
        )

_has_oov_mm_tokens `class-attribute` `instance-attribute` ¶

_has_oov_mm_tokens: bool = False

In general, this should be set at init time by invoking configure_mm_token_handling models & passing all potentially OOV multimodal tokens.

_language_model_names `class-attribute` `instance-attribute` ¶

_language_model_names: list[str] = []

Set internally by _mark_language_model.

_processor_factory `class-attribute` ¶

_processor_factory: _ProcessorFactories

Set internally by MultiModalRegistry.register_processor.

_tower_model_names `class-attribute` `instance-attribute` ¶

_tower_model_names: list[str] = []

Set internally by _mark_tower_model.

requires_raw_input_tokens `class-attribute` ¶

requires_raw_input_tokens: bool = False

A flag that indicates this model processes input id tokens in their raw form and not input embeddings.

supports_encoder_tp_data `class-attribute` ¶

supports_encoder_tp_data: bool = False

A flag that indicates whether this model supports multimodal_config.mm_encoder_tp_mode="data".

supports_multimodal `class-attribute` ¶

supports_multimodal: Literal[True] = True

A flag that indicates this model supports multi-modal inputs.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

supports_multimodal_raw_input_only `class-attribute` ¶

supports_multimodal_raw_input_only: bool = False

A flag that indicates this model supports multi-modal inputs and processes them in their raw form and not embeddings.

_mark_composite_model ¶

_mark_composite_model(
    vllm_config: VllmConfig,
    *,
    language_targets: type[Module]
    | tuple[type[Module], ...],
    tower_targets: dict[
        str, type[Module] | tuple[type[Module], ...]
    ],
)

Composite wrapper over _mark_language_model and _mark_tower_model by modality.

Source code in vllm/model_executor/models/interfaces.py

@contextmanager
def _mark_composite_model(
    self,
    vllm_config: VllmConfig,
    *,
    language_targets: type[nn.Module] | tuple[type[nn.Module], ...],
    tower_targets: dict[str, type[nn.Module] | tuple[type[nn.Module], ...]],
):
    """
    Composite wrapper over `_mark_language_model` and
    `_mark_tower_model` by modality.
    """
    with ExitStack() as stack:
        stack.enter_context(
            self._mark_language_model(
                vllm_config,
                targets=language_targets,
            )
        )

        for modality, modality_targets in tower_targets.items():
            stack.enter_context(
                self._mark_tower_model(
                    vllm_config,
                    modality,
                    targets=modality_targets,
                )
            )

        yield

_mark_language_model ¶

_mark_language_model(
    vllm_config: VllmConfig,
    *,
    targets: type[Module]
    | tuple[type[Module], ...]
    | None = None,
)

Mark each child module that was assigned to this model during this context as a language model component.

Language model components are automatically skipped in --mm-encoder-only mode.

If targets is set, instead include descendants that are an instance of targets, even if they aren't direct children.

Source code in vllm/model_executor/models/interfaces.py

@contextmanager
def _mark_language_model(
    self,
    vllm_config: VllmConfig,
    *,
    targets: type[nn.Module] | tuple[type[nn.Module], ...] | None = None,
):
    """
    Mark each child module that was assigned to this model during this context
    as a language model component.

    Language model components are automatically skipped in `--mm-encoder-only`
    mode.

    If `targets` is set, instead include descendants that are an instance
    of `targets`, even if they aren't direct children.
    """
    from .utils import StageMissingLayer, collect_children, no_init_weights

    mm_config = vllm_config.model_config.multimodal_config

    with collect_children(self, targets=targets) as children_names:  # noqa: SIM117
        with (
            no_init_weights(
                self,
                lambda mod: StageMissingLayer("language_model", mod),
                targets=targets,
            )
            if mm_config.mm_encoder_only
            else nullcontext()
        ):
            yield

    self._language_model_names = children_names

_mark_tower_model ¶

_mark_tower_model(
    vllm_config: VllmConfig,
    modalities: set[str] | str,
    *,
    targets: type[Module]
    | tuple[type[Module], ...]
    | None = None,
)

Mark each child module that was assigned to this model during this context as a tower model component.

Tower model components are automatically skipped when --limit-mm-per-prompt is set to zero for all of their modalities.

If targets is set, instead include descendants that are an instance of targets, even if they aren't direct children.

Source code in vllm/model_executor/models/interfaces.py

@contextmanager
def _mark_tower_model(
    self,
    vllm_config: VllmConfig,
    modalities: set[str] | str,
    *,
    targets: type[nn.Module] | tuple[type[nn.Module], ...] | None = None,
):
    """
    Mark each child module that was assigned to this model during this context
    as a tower model component.

    Tower model components are automatically skipped when `--limit-mm-per-prompt`
    is set to zero for all of their modalities.

    If `targets` is set, instead include descendants that are an instance
    of `targets`, even if they aren't direct children.
    """
    from .utils import StageMissingLayer, collect_children, no_init_weights

    if isinstance(modalities, str):
        modalities = {modalities}

    if modalities == {"image", "video"}:
        stage_name = "vision_tower"
    else:
        stage_name = "_".join([*modalities, "tower"])

    mm_config = vllm_config.model_config.multimodal_config

    with collect_children(self, targets=targets) as children_names:  # noqa: SIM117
        with (
            no_init_weights(
                self,
                lambda mod: StageMissingLayer(stage_name, mod),
                targets=targets,
            )
            if all(mm_config.get_limit_per_prompt(m) == 0 for m in modalities)
            else nullcontext()
        ):
            yield

    self._tower_model_names = children_names

configure_mm_token_handling ¶

configure_mm_token_handling(
    vocab_size: int, mm_token_ids: list[int]
)

Check if any multimodal tokens are out of vocabulary. If so, we will explicitly mask all multimodal tokens out when computing text embeddings, since the multimodal embeddings will be scattered over the results.

Source code in vllm/model_executor/models/interfaces.py

def configure_mm_token_handling(self, vocab_size: int, mm_token_ids: list[int]):
    """Check if any multimodal tokens are out of vocabulary. If so, we will
    explicitly mask all multimodal tokens out when computing text embeddings,
    since the multimodal embeddings will be scattered over the results.
    """
    self._has_oov_mm_tokens = any(tok_id >= vocab_size for tok_id in mm_token_ids)
    logger.info(
        "Contains out of vocabulary multimodal tokens? %s",
        self._has_oov_mm_tokens,
    )

embed_input_ids ¶

embed_input_ids(input_ids: Tensor) -> Tensor

embed_input_ids(
    input_ids: Tensor,
    multimodal_embeddings: MultiModalEmbeddings,
    *,
    is_multimodal: Tensor,
) -> Tensor

embed_input_ids(
    input_ids: Tensor,
    multimodal_embeddings: MultiModalEmbeddings
    | None = None,
    *,
    is_multimodal: Tensor | None = None,
) -> Tensor

Apply token embeddings to input_ids.

If multimodal_embeddings is passed, scatter them into input_ids according to the mask is_multimodal.

NOTE: If this model has multimodal tokens that are of vocabulary (i.e., self._has_oov_mm_tokens=True), the input_ids will be copied and masked to 0 during the forward pass for the text embeddings.

Source code in vllm/model_executor/models/interfaces.py

def embed_input_ids(
    self,
    input_ids: Tensor,
    multimodal_embeddings: MultiModalEmbeddings | None = None,
    *,
    is_multimodal: Tensor | None = None,
) -> Tensor:
    """
    Apply token embeddings to `input_ids`.

    If `multimodal_embeddings` is passed, scatter them into
    `input_ids` according to the mask `is_multimodal`.

    NOTE: If this model has multimodal tokens that are of vocabulary
    (i.e., self._has_oov_mm_tokens=True), the input_ids will be copied
    and masked to 0 during the forward pass for the text embeddings.
    """
    from .utils import _merge_multimodal_embeddings

    # Get text embeddings first; multimodal embeddings will clobber
    # any invalid contents in the indices of multimodal embeddings
    # for the in vocabulary and out of vocabulary case.
    inputs_embeds = self._embed_text_input_ids(
        input_ids,
        self.get_language_model().embed_input_ids,
        is_multimodal=is_multimodal,
    )

    if multimodal_embeddings is None or len(multimodal_embeddings) == 0:
        return inputs_embeds

    return _merge_multimodal_embeddings(
        inputs_embeds=inputs_embeds,
        multimodal_embeddings=multimodal_embeddings,
        is_multimodal=_require_is_multimodal(is_multimodal),
    )

embed_multimodal ¶

embed_multimodal(**kwargs: object) -> MultiModalEmbeddings

Returns multimodal embeddings generated from multimodal kwargs to be merged with text embeddings.

Note

The returned multimodal embeddings must be in the same order as the appearances of their corresponding multimodal data item in the input prompt.

Source code in vllm/model_executor/models/interfaces.py

def embed_multimodal(self, **kwargs: object) -> MultiModalEmbeddings:
    """
    Returns multimodal embeddings generated from multimodal kwargs
    to be merged with text embeddings.

    Note:
        The returned multimodal embeddings must be in the same order as
        the appearances of their corresponding multimodal data item in the
        input prompt.
    """
    ...

get_language_model ¶

get_language_model() -> VllmModel

Returns the underlying language model used for text generation.

This is typically the torch.nn.Module instance responsible for processing the merged multimodal embeddings and producing hidden states

Returns:

Type	Description
`VllmModel`	torch.nn.Module: The core language model component.

Source code in vllm/model_executor/models/interfaces.py

def get_language_model(self) -> VllmModel:
    """
    Returns the underlying language model used for text generation.

    This is typically the `torch.nn.Module` instance responsible for
    processing the merged multimodal embeddings and producing hidden states

    Returns:
        torch.nn.Module: The core language model component.
    """
    # Cached
    if self in _language_model_by_module:
        return _language_model_by_module[self]

    if self._language_model_names:
        mod = self
        for attr in common_prefix(
            [name.split(".") for name in self._language_model_names]
        ):
            if attr:
                mod = getattr(mod, attr)

        if mod is not self and hasattr(mod, "embed_input_ids"):
            _language_model_by_module[self] = mod
            return mod

    # Fallback
    for mod in self.children():
        if hasattr(mod, "embed_input_ids"):
            _language_model_by_module[self] = mod
            return mod

    raise NotImplementedError(
        f"No language model found in {type(self).__name__}! "
        "You should initialize it via `_mark_language_model`, "
        "and make sure `embed_input_ids` is implemented."
    )

get_num_mm_connector_tokens ¶

get_num_mm_connector_tokens(num_vision_tokens: int) -> int

Implement this function to enable LoRA support for the connector module of the multi-modal model. Given the number of vision tokens, output the number of multi-modal connector tokens.

Source code in vllm/model_executor/models/interfaces.py

def get_num_mm_connector_tokens(self, num_vision_tokens: int) -> int:
    """
    Implement this function to enable LoRA support
    for the connector module of the multi-modal model.
    Given the number of vision tokens, output the number of
    multi-modal connector tokens.
    """
    ...

get_num_mm_encoder_tokens ¶

get_num_mm_encoder_tokens(num_image_tokens: int) -> int

Implement this function to enable LoRA support for the tower module of the multi-modal model. Given the number of image tokens, output the number of multi-modal encoder tokens.

Source code in vllm/model_executor/models/interfaces.py

def get_num_mm_encoder_tokens(self, num_image_tokens: int) -> int:
    """
    Implement this function to enable LoRA support
    for the tower module of the multi-modal model.
    Given the number of image tokens, output the number of
    multi-modal encoder tokens.
    """
    ...

get_placeholder_str `classmethod` ¶

get_placeholder_str(modality: str, i: int) -> str | None

Get the placeholder text for the ith modality item in the prompt.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def get_placeholder_str(cls, modality: str, i: int) -> str | None:
    """
    Get the placeholder text for the `i`th `modality` item in the prompt.
    """
    ...

SupportsMultiModalPruning ¶

Bases: Protocol

The interface required for models that support returning both input embeddings and positions. Model may require custom positions for dynamic pruning of multimodal embeddings.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsMultiModalPruning(Protocol):
    """The interface required for models that support returning both input
    embeddings and positions. Model may require custom positions for dynamic
    pruning of multimodal embeddings.
    """

    supports_multimodal_pruning: ClassVar[Literal[True]] = True

    def recompute_mrope_positions(
        self,
        input_ids: list[int],
        multimodal_embeddings: MultiModalEmbeddings,
        mrope_positions: torch.LongTensor,
        num_computed_tokens: int,
    ) -> tuple[MultiModalEmbeddings, Tensor, int]:
        """
        Update part of input mrope positions (starting with
        num_computed_tokens index). Original mrope_positions are computed
        for unpruned sequence and becomes incorrect once pruning occurs,
        so once we prune media tokens we should reflect this in the
        mrope_positions before we feed it to LLM.

        Args:
            input_ids: (N,) All input tokens of the prompt containing
                entire sequence.
            multimodal_embeddings: Tuple of multimodal embeddings that
                fits into the prefill chunk that is being processed.
            mrope_positions: Existing mrope positions (3, N) for entire
                sequence
            num_computed_tokens: A number of computed tokens so far.

        Returns:
            Tuple of (multimodal_embeddings, mrope_positions,
                mrope_position_delta).
        """
        ...

recompute_mrope_positions ¶

recompute_mrope_positions(
    input_ids: list[int],
    multimodal_embeddings: MultiModalEmbeddings,
    mrope_positions: LongTensor,
    num_computed_tokens: int,
) -> tuple[MultiModalEmbeddings, Tensor, int]

Update part of input mrope positions (starting with num_computed_tokens index). Original mrope_positions are computed for unpruned sequence and becomes incorrect once pruning occurs, so once we prune media tokens we should reflect this in the mrope_positions before we feed it to LLM.

Parameters:

Name	Type	Description	Default
`input_ids`	`list[int]`	(N,) All input tokens of the prompt containing entire sequence.	required
`multimodal_embeddings`	`MultiModalEmbeddings`	Tuple of multimodal embeddings that fits into the prefill chunk that is being processed.	required
`mrope_positions`	`LongTensor`	Existing mrope positions (3, N) for entire sequence	required
`num_computed_tokens`	`int`	A number of computed tokens so far.	required

Returns:

Type	Description
`tuple[MultiModalEmbeddings, Tensor, int]`	Tuple of (multimodal_embeddings, mrope_positions, mrope_position_delta).

Source code in vllm/model_executor/models/interfaces.py

def recompute_mrope_positions(
    self,
    input_ids: list[int],
    multimodal_embeddings: MultiModalEmbeddings,
    mrope_positions: torch.LongTensor,
    num_computed_tokens: int,
) -> tuple[MultiModalEmbeddings, Tensor, int]:
    """
    Update part of input mrope positions (starting with
    num_computed_tokens index). Original mrope_positions are computed
    for unpruned sequence and becomes incorrect once pruning occurs,
    so once we prune media tokens we should reflect this in the
    mrope_positions before we feed it to LLM.

    Args:
        input_ids: (N,) All input tokens of the prompt containing
            entire sequence.
        multimodal_embeddings: Tuple of multimodal embeddings that
            fits into the prefill chunk that is being processed.
        mrope_positions: Existing mrope positions (3, N) for entire
            sequence
        num_computed_tokens: A number of computed tokens so far.

    Returns:
        Tuple of (multimodal_embeddings, mrope_positions,
            mrope_position_delta).
    """
    ...

SupportsPP ¶

Bases: Protocol

The interface required for all models that support pipeline parallel.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsPP(Protocol):
    """The interface required for all models that support pipeline parallel."""

    supports_pp: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports pipeline parallel.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

    def make_empty_intermediate_tensors(
        self,
        batch_size: int,
        dtype: torch.dtype,
        device: torch.device,
    ) -> IntermediateTensors:
        """Called when PP rank > 0 for profiling purposes."""
        ...

    def forward(
        self,
        input_ids: Tensor | None,
        positions: Tensor,
        *,
        intermediate_tensors: IntermediateTensors | None,
    ) -> IntermediateTensors | None:
        """
        Accept [`IntermediateTensors`][vllm.sequence.IntermediateTensors] when
        PP rank > 0.

        Return [`IntermediateTensors`][vllm.sequence.IntermediateTensors] only
        for the last PP rank.
        """
        ...

supports_pp `class-attribute` ¶

supports_pp: Literal[True] = True

A flag that indicates this model supports pipeline parallel.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

forward ¶

forward(
    input_ids: Tensor | None,
    positions: Tensor,
    *,
    intermediate_tensors: IntermediateTensors | None,
) -> IntermediateTensors | None

Accept IntermediateTensors when PP rank > 0.

Return IntermediateTensors only for the last PP rank.

Source code in vllm/model_executor/models/interfaces.py

def forward(
    self,
    input_ids: Tensor | None,
    positions: Tensor,
    *,
    intermediate_tensors: IntermediateTensors | None,
) -> IntermediateTensors | None:
    """
    Accept [`IntermediateTensors`][vllm.sequence.IntermediateTensors] when
    PP rank > 0.

    Return [`IntermediateTensors`][vllm.sequence.IntermediateTensors] only
    for the last PP rank.
    """
    ...

make_empty_intermediate_tensors ¶

make_empty_intermediate_tensors(
    batch_size: int, dtype: dtype, device: device
) -> IntermediateTensors

Called when PP rank > 0 for profiling purposes.

Source code in vllm/model_executor/models/interfaces.py

def make_empty_intermediate_tensors(
    self,
    batch_size: int,
    dtype: torch.dtype,
    device: torch.device,
) -> IntermediateTensors:
    """Called when PP rank > 0 for profiling purposes."""
    ...

SupportsQuant ¶

The interface required for all models that support quantization.

Source code in vllm/model_executor/models/interfaces.py

class SupportsQuant:
    """The interface required for all models that support quantization."""

    hf_to_vllm_mapper: ClassVar[WeightsMapper | None] = None
    packed_modules_mapping: ClassVar[dict[str, list[str]] | None] = None
    quant_config: QuantizationConfig | None = None

    def __new__(cls, *args, **kwargs) -> Self:
        instance = super().__new__(cls)

        # find config passed in arguments and attach it to model for general use
        instance.quant_config = cls._find_quant_config(*args, **kwargs)

        cls._maybe_apply_model_mapping(instance)

        return instance

    @staticmethod
    def _find_quant_config(*args, **kwargs) -> QuantizationConfig | None:
        """Find quant config passed through model constructor args"""
        from vllm.config import VllmConfig  # avoid circular import

        args_values = list(args) + list(kwargs.values())
        for arg in args_values:
            if isinstance(arg, VllmConfig):
                return arg.quant_config

            if isinstance(arg, QuantizationConfig):
                return arg

        return None

    def _maybe_apply_model_mapping(self):
        """Apply model mappings to config for proper config-model matching"""
        if self.quant_config is None:
            return
        if (hf_to_vllm_mapper := self.hf_to_vllm_mapper) is not None:
            self.quant_config.apply_vllm_mapper(hf_to_vllm_mapper)
        if self.packed_modules_mapping is not None:
            self.quant_config.packed_modules_mapping.update(self.packed_modules_mapping)

_find_quant_config `staticmethod` ¶

_find_quant_config(
    *args, **kwargs
) -> QuantizationConfig | None

Find quant config passed through model constructor args

Source code in vllm/model_executor/models/interfaces.py

@staticmethod
def _find_quant_config(*args, **kwargs) -> QuantizationConfig | None:
    """Find quant config passed through model constructor args"""
    from vllm.config import VllmConfig  # avoid circular import

    args_values = list(args) + list(kwargs.values())
    for arg in args_values:
        if isinstance(arg, VllmConfig):
            return arg.quant_config

        if isinstance(arg, QuantizationConfig):
            return arg

    return None

_maybe_apply_model_mapping ¶

_maybe_apply_model_mapping()

Apply model mappings to config for proper config-model matching

Source code in vllm/model_executor/models/interfaces.py

def _maybe_apply_model_mapping(self):
    """Apply model mappings to config for proper config-model matching"""
    if self.quant_config is None:
        return
    if (hf_to_vllm_mapper := self.hf_to_vllm_mapper) is not None:
        self.quant_config.apply_vllm_mapper(hf_to_vllm_mapper)
    if self.packed_modules_mapping is not None:
        self.quant_config.packed_modules_mapping.update(self.packed_modules_mapping)

SupportsRealtime ¶

Bases: Protocol

The interface required for all models that support transcription.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsRealtime(Protocol):
    """The interface required for all models that support transcription."""

    supports_realtime: ClassVar[Literal[True]] = True

    realtime_max_tokens: ClassVar[int] = 1
    """Maximum tokens to generate per streaming audio segment.
    Override in subclasses based on the model's expected output length."""

    @classmethod
    async def buffer_realtime_audio(
        cls,
        audio_stream: AsyncGenerator[np.ndarray, None],
        input_stream: asyncio.Queue[list[int]],
        model_config: ModelConfig,
    ) -> AsyncGenerator[PromptType, None]: ...

realtime_max_tokens `class-attribute` ¶

realtime_max_tokens: int = 1

Maximum tokens to generate per streaming audio segment. Override in subclasses based on the model's expected output length.

SupportsScoreTemplate ¶

Bases: Protocol

The interface required for all models that support score template.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsScoreTemplate(Protocol):
    """The interface required for all models that support score template."""

    supports_score_template: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports score template.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

    @classmethod
    def get_score_template(cls, query: str, document: str) -> str | None:
        """
        Generate a full prompt by populating the score template with query and document content.
        """  # noqa: E501
        ...

    @classmethod
    def post_process_tokens(cls, prompt: TokensPrompt) -> None:
        """
        Perform architecture-specific manipulations on the input tokens.
        """
        ...

supports_score_template `class-attribute` ¶

supports_score_template: Literal[True] = True

A flag that indicates this model supports score template.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

get_score_template `classmethod` ¶

get_score_template(query: str, document: str) -> str | None

Generate a full prompt by populating the score template with query and document content.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def get_score_template(cls, query: str, document: str) -> str | None:
    """
    Generate a full prompt by populating the score template with query and document content.
    """  # noqa: E501
    ...

post_process_tokens `classmethod` ¶

post_process_tokens(prompt: TokensPrompt) -> None

Perform architecture-specific manipulations on the input tokens.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def post_process_tokens(cls, prompt: TokensPrompt) -> None:
    """
    Perform architecture-specific manipulations on the input tokens.
    """
    ...

SupportsTranscription ¶

Bases: Protocol

The interface required for all models that support transcription.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsTranscription(Protocol):
    """The interface required for all models that support transcription."""

    # Mapping from ISO639_1 language codes: language names
    supported_languages: ClassVar[Mapping[str, str]]

    supports_transcription: ClassVar[Literal[True]] = True

    supports_transcription_only: ClassVar[bool] = False
    """
    Transcription models can opt out of text generation by setting this to
    `True`.
    """
    supports_segment_timestamp: ClassVar[bool] = False
    """
    Enables the segment timestamp option for supported models by setting this to `True`.
    """

    supports_explicit_language_detection: ClassVar[bool] = False
    """
    Transcription models that require an explicit language detection step
    (e.g. Whisper needs a separate forward pass to predict the language
    token) should set this to ``True`` and implement
    :meth:`get_language_detection_prompt` and
    :meth:`parse_language_detection_output` and
    :meth:`get_language_token_ids`.
    """

    no_space_languages: ClassVar[set[str]] = {"ja", "zh"}
    """
    Languages that don't need a space between words.
    For example, Japanese (ja) and Chinese (zh) don't need a space between words.
    """

    def __init_subclass__(cls, **kwargs):
        super().__init_subclass__(**kwargs)
        # language codes in supported_languages
        # that don't exist in the full language map
        invalid = set(cls.supported_languages) - set(LANGUAGES.keys())
        if invalid:
            raise ValueError(
                f"{cls.__name__}.supported_languages contains invalid "
                f"language codes: {sorted(invalid)}\n. "
                f"Valid choices are: {sorted(LANGUAGES.keys())}"
            )

    @classmethod
    def get_generation_prompt(
        cls,
        stt_params: SpeechToTextParams,
    ) -> PromptType:
        """Get the prompt for the ASR model.
        The model has control over the construction, as long as it
        returns a valid PromptType."""
        ...

    @classmethod
    def get_other_languages(cls) -> Mapping[str, str]:
        # other possible language codes from the whisper map
        return {k: v for k, v in LANGUAGES.items() if k not in cls.supported_languages}

    @classmethod
    def validate_language(cls, language: str | None) -> str | None:
        """
        Ensure the language specified in the transcription request
        is a valid ISO 639-1 language code. If the request language is
        valid, but not natively supported by the model, trigger a
        warning (but not an exception).
        """
        if language is None or language in cls.supported_languages:
            return language
        elif language in cls.get_other_languages():
            logger.warning(
                "Language %r is not natively supported by %s; "
                "results may be less accurate. Supported languages: %r",
                language,
                cls.__name__,
                list(cls.supported_languages.keys()),
            )
            return language
        else:
            raise ValueError(
                f"Unsupported language: {language!r}.  Must be one of "
                f"{list(cls.supported_languages.keys())}."
            )

    @classmethod
    def get_speech_to_text_config(
        cls, model_config: ModelConfig, task_type: Literal["transcribe", "translate"]
    ) -> SpeechToTextConfig:
        """Get the speech to text config for the ASR model."""
        ...

    @classmethod
    def get_num_audio_tokens(
        cls,
        audio_duration_s: float,
        stt_config: SpeechToTextConfig,
        model_config: ModelConfig,
    ) -> int | None:
        """
        Map from audio duration to number of audio tokens produced by the ASR
        model, without running a forward pass.
        This is used for estimating the amount of processing for this audio.
        """
        return None

    @classmethod
    def post_process_output(cls, text: str) -> str:
        """
        Post-process the raw model output text.

        Some ASR models output structured formats (e.g., language tags,
        special tokens) that need to be stripped before returning to the user.

        Args:
            text: Raw decoded text from the model.

        Returns:
            Cleaned transcription text.
        """
        return text

    @classmethod
    def get_language_detection_prompt(
        cls,
        audio: np.ndarray,
        stt_config: SpeechToTextConfig,
    ) -> PromptType:
        """Return a prompt that triggers language detection.

        Only needs to be implemented when
        ``supports_explicit_language_detection`` is ``True``.
        """
        raise NotImplementedError

    @classmethod
    def parse_language_detection_output(
        cls,
        token_ids: list[int],
        tokenizer: object,
    ) -> str:
        """Parse the detected language from model output token IDs.

        Only needs to be implemented when
        ``supports_explicit_language_detection`` is ``True``.
        """
        raise NotImplementedError

    @classmethod
    def get_language_token_ids(
        cls,
        tokenizer: object,
    ) -> list[int] | None:
        """Return token IDs that represent valid language tokens.

        Used to constrain language detection to only produce valid language tokens.

        Only needs to be implemented when
        ``supports_explicit_language_detection`` is ``True``.
        """
        raise NotImplementedError

no_space_languages `class-attribute` ¶

no_space_languages: set[str] = {'ja', 'zh'}

Languages that don't need a space between words. For example, Japanese (ja) and Chinese (zh) don't need a space between words.

supports_explicit_language_detection `class-attribute` ¶

supports_explicit_language_detection: bool = False

Transcription models that require an explicit language detection step (e.g. Whisper needs a separate forward pass to predict the language token) should set this to True and implement :meth:get_language_detection_prompt and :meth:parse_language_detection_output and :meth:get_language_token_ids.

supports_segment_timestamp `class-attribute` ¶

supports_segment_timestamp: bool = False

Enables the segment timestamp option for supported models by setting this to True.

supports_transcription_only `class-attribute` ¶

supports_transcription_only: bool = False

Transcription models can opt out of text generation by setting this to True.

get_generation_prompt `classmethod` ¶

get_generation_prompt(
    stt_params: SpeechToTextParams,
) -> PromptType

Get the prompt for the ASR model. The model has control over the construction, as long as it returns a valid PromptType.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def get_generation_prompt(
    cls,
    stt_params: SpeechToTextParams,
) -> PromptType:
    """Get the prompt for the ASR model.
    The model has control over the construction, as long as it
    returns a valid PromptType."""
    ...

get_language_detection_prompt `classmethod` ¶

get_language_detection_prompt(
    audio: ndarray, stt_config: SpeechToTextConfig
) -> PromptType

Return a prompt that triggers language detection.

Only needs to be implemented when supports_explicit_language_detection is True.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def get_language_detection_prompt(
    cls,
    audio: np.ndarray,
    stt_config: SpeechToTextConfig,
) -> PromptType:
    """Return a prompt that triggers language detection.

    Only needs to be implemented when
    ``supports_explicit_language_detection`` is ``True``.
    """
    raise NotImplementedError

get_language_token_ids `classmethod` ¶

get_language_token_ids(
    tokenizer: object,
) -> list[int] | None

Return token IDs that represent valid language tokens.

Used to constrain language detection to only produce valid language tokens.

Only needs to be implemented when supports_explicit_language_detection is True.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def get_language_token_ids(
    cls,
    tokenizer: object,
) -> list[int] | None:
    """Return token IDs that represent valid language tokens.

    Used to constrain language detection to only produce valid language tokens.

    Only needs to be implemented when
    ``supports_explicit_language_detection`` is ``True``.
    """
    raise NotImplementedError

get_num_audio_tokens `classmethod` ¶

get_num_audio_tokens(
    audio_duration_s: float,
    stt_config: SpeechToTextConfig,
    model_config: ModelConfig,
) -> int | None

Map from audio duration to number of audio tokens produced by the ASR model, without running a forward pass. This is used for estimating the amount of processing for this audio.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def get_num_audio_tokens(
    cls,
    audio_duration_s: float,
    stt_config: SpeechToTextConfig,
    model_config: ModelConfig,
) -> int | None:
    """
    Map from audio duration to number of audio tokens produced by the ASR
    model, without running a forward pass.
    This is used for estimating the amount of processing for this audio.
    """
    return None

get_speech_to_text_config `classmethod` ¶

get_speech_to_text_config(
    model_config: ModelConfig,
    task_type: Literal["transcribe", "translate"],
) -> SpeechToTextConfig

Get the speech to text config for the ASR model.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def get_speech_to_text_config(
    cls, model_config: ModelConfig, task_type: Literal["transcribe", "translate"]
) -> SpeechToTextConfig:
    """Get the speech to text config for the ASR model."""
    ...

parse_language_detection_output `classmethod` ¶

parse_language_detection_output(
    token_ids: list[int], tokenizer: object
) -> str

Parse the detected language from model output token IDs.

Only needs to be implemented when supports_explicit_language_detection is True.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def parse_language_detection_output(
    cls,
    token_ids: list[int],
    tokenizer: object,
) -> str:
    """Parse the detected language from model output token IDs.

    Only needs to be implemented when
    ``supports_explicit_language_detection`` is ``True``.
    """
    raise NotImplementedError

post_process_output `classmethod` ¶

post_process_output(text: str) -> str

Post-process the raw model output text.

Some ASR models output structured formats (e.g., language tags, special tokens) that need to be stripped before returning to the user.

Parameters:

Name	Type	Description	Default
`text`	`str`	Raw decoded text from the model.	required

Returns:

Type	Description
`str`	Cleaned transcription text.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def post_process_output(cls, text: str) -> str:
    """
    Post-process the raw model output text.

    Some ASR models output structured formats (e.g., language tags,
    special tokens) that need to be stripped before returning to the user.

    Args:
        text: Raw decoded text from the model.

    Returns:
        Cleaned transcription text.
    """
    return text

validate_language `classmethod` ¶

validate_language(language: str | None) -> str | None

Ensure the language specified in the transcription request is a valid ISO 639-1 language code. If the request language is valid, but not natively supported by the model, trigger a warning (but not an exception).

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def validate_language(cls, language: str | None) -> str | None:
    """
    Ensure the language specified in the transcription request
    is a valid ISO 639-1 language code. If the request language is
    valid, but not natively supported by the model, trigger a
    warning (but not an exception).
    """
    if language is None or language in cls.supported_languages:
        return language
    elif language in cls.get_other_languages():
        logger.warning(
            "Language %r is not natively supported by %s; "
            "results may be less accurate. Supported languages: %r",
            language,
            cls.__name__,
            list(cls.supported_languages.keys()),
        )
        return language
    else:
        raise ValueError(
            f"Unsupported language: {language!r}.  Must be one of "
            f"{list(cls.supported_languages.keys())}."
        )

SupportsXDRoPE ¶

Bases: Protocol

The interface required for all models that support XD-RoPE.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsXDRoPE(Protocol):
    """The interface required for all models that support XD-RoPE."""

    supports_xdrope: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports XD-RoPE.

    Note:
        There is no need to redefine this flag if this class is in the
        XDRope of your model class.
    """

    def get_xdrope_input_positions(
        self,
        input_tokens: list[int],
        mm_features: list["MultiModalFeatureSpec"],
    ) -> torch.Tensor:
        """
        Get XD-RoPE input positions and delta value for this specific model.

        This method should be implemented by each model that supports XD-RoPE
        to provide model-specific logic for computing input positions.

        Args:
            input_tokens: List of input token IDs
            mm_features: Information about each multi-modal data item

        Returns:
            llm_positions: Tensor of shape `[xdrope_dim, num_tokens]` with
            4D(P/W/H/T) or 3D(W/H/T) positions.
        """
        ...

supports_xdrope `class-attribute` ¶

supports_xdrope: Literal[True] = True

A flag that indicates this model supports XD-RoPE.

Note

There is no need to redefine this flag if this class is in the XDRope of your model class.

get_xdrope_input_positions ¶

get_xdrope_input_positions(
    input_tokens: list[int],
    mm_features: list[MultiModalFeatureSpec],
) -> Tensor

Get XD-RoPE input positions and delta value for this specific model.

This method should be implemented by each model that supports XD-RoPE to provide model-specific logic for computing input positions.

Parameters:

Name	Type	Description	Default
`input_tokens`	`list[int]`	List of input token IDs	required
`mm_features`	`list[MultiModalFeatureSpec]`	Information about each multi-modal data item	required

Returns:

Name	Type	Description
`llm_positions`	`Tensor`	Tensor of shape `[xdrope_dim, num_tokens]` with
	`Tensor`	4D(P/W/H/T) or 3D(W/H/T) positions.

Source code in vllm/model_executor/models/interfaces.py

def get_xdrope_input_positions(
    self,
    input_tokens: list[int],
    mm_features: list["MultiModalFeatureSpec"],
) -> torch.Tensor:
    """
    Get XD-RoPE input positions and delta value for this specific model.

    This method should be implemented by each model that supports XD-RoPE
    to provide model-specific logic for computing input positions.

    Args:
        input_tokens: List of input token IDs
        mm_features: Information about each multi-modal data item

    Returns:
        llm_positions: Tensor of shape `[xdrope_dim, num_tokens]` with
        4D(P/W/H/T) or 3D(W/H/T) positions.
    """
    ...

_require_is_multimodal ¶

_require_is_multimodal(
    is_multimodal: Tensor | None,
) -> Tensor

A helper function to be used in the context of vllm.model_executor.models.interfaces.SupportsMultiModal.embed_input_ids to provide a better error message.

Source code in vllm/model_executor/models/interfaces.py

def _require_is_multimodal(is_multimodal: Tensor | None) -> Tensor:
    """
    A helper function to be used in the context of
    [vllm.model_executor.models.interfaces.SupportsMultiModal.embed_input_ids][]
    to provide a better error message.
    """
    if is_multimodal is None:
        raise ValueError(
            "`embed_input_ids` now requires `is_multimodal` arg, "
            "please update your model runner according to "
            "https://github.com/vllm-project/vllm/pull/16229."
        )

    return is_multimodal

supports_any_eagle ¶

supports_any_eagle(
    model: type[object],
) -> TypeIs[type[SupportsEagleBase]]

supports_any_eagle(
    model: object,
) -> TypeIs[SupportsEagleBase]

supports_any_eagle(
    model: type[object] | object,
) -> (
    TypeIs[type[SupportsEagleBase]]
    | TypeIs[SupportsEagleBase]
)

Check if model supports any EAGLE variant (1, 2, or 3).

Source code in vllm/model_executor/models/interfaces.py

def supports_any_eagle(
    model: type[object] | object,
) -> TypeIs[type[SupportsEagleBase]] | TypeIs[SupportsEagleBase]:
    """Check if model supports any EAGLE variant (1, 2, or 3)."""
    return supports_eagle(model) or supports_eagle3(model)

vllm.model_executor.models.interfaces ¶

MultiModalEmbeddings module-attribute ¶

HasInnerState ¶

has_inner_state class-attribute ¶

IsAttentionFree ¶

is_attention_free class-attribute ¶

IsHybrid ¶

is_hybrid class-attribute ¶

get_mamba_state_copy_func classmethod ¶

get_mamba_state_shape_from_config classmethod ¶

MixtureOfExperts ¶

expert_weights instance-attribute ¶

moe_layers instance-attribute ¶

num_expert_groups instance-attribute ¶

num_local_physical_experts instance-attribute ¶

num_logical_experts instance-attribute ¶

num_moe_layers instance-attribute ¶

num_physical_experts instance-attribute ¶

num_redundant_experts instance-attribute ¶

num_routed_experts instance-attribute ¶

num_shared_experts instance-attribute ¶

set_eplb_state ¶

SupportsCrossEncoding ¶

SupportsEagle ¶

supports_eagle class-attribute ¶

SupportsEagle3 ¶

supports_eagle3 class-attribute ¶

get_eagle3_default_aux_hidden_state_layers ¶

set_aux_hidden_state_layers ¶

SupportsEagleBase ¶

has_own_embed_tokens class-attribute instance-attribute ¶

has_own_lm_head class-attribute instance-attribute ¶

SupportsEncoderCudaGraph ¶

encoder_cudagraph_forward ¶

encoder_eager_forward ¶

get_encoder_cudagraph_budget_range ¶

get_encoder_cudagraph_item_specs ¶

get_input_modality ¶

get_max_frames_per_video ¶

postprocess_encoder_output ¶

prepare_encoder_cudagraph_capture_inputs ¶

prepare_encoder_cudagraph_replay_buffers ¶

select_encoder_cudagraph_items ¶

SupportsLateInteraction ¶

SupportsLoRA ¶

supports_lora class-attribute ¶

SupportsMRoPE ¶

supports_mrope class-attribute ¶

get_mrope_input_positions ¶

SupportsMambaPrefixCaching ¶

SupportsMultiModal ¶

_has_oov_mm_tokens class-attribute instance-attribute ¶

_language_model_names class-attribute instance-attribute ¶

_processor_factory class-attribute ¶

_tower_model_names class-attribute instance-attribute ¶

requires_raw_input_tokens class-attribute ¶

supports_encoder_tp_data class-attribute ¶

supports_multimodal class-attribute ¶

supports_multimodal_raw_input_only class-attribute ¶

_mark_composite_model ¶

_mark_language_model ¶

_mark_tower_model ¶

configure_mm_token_handling ¶

embed_input_ids ¶

embed_multimodal ¶

get_language_model ¶

get_num_mm_connector_tokens ¶

get_num_mm_encoder_tokens ¶

get_placeholder_str classmethod ¶

SupportsMultiModalPruning ¶

recompute_mrope_positions ¶

SupportsPP ¶

supports_pp class-attribute ¶

forward ¶

make_empty_intermediate_tensors ¶

SupportsQuant ¶

_find_quant_config staticmethod ¶

_maybe_apply_model_mapping ¶

SupportsRealtime ¶

realtime_max_tokens class-attribute ¶

MultiModalEmbeddings `module-attribute` ¶

has_inner_state `class-attribute` ¶

is_attention_free `class-attribute` ¶

is_hybrid `class-attribute` ¶

get_mamba_state_copy_func `classmethod` ¶

get_mamba_state_shape_from_config `classmethod` ¶

expert_weights `instance-attribute` ¶

moe_layers `instance-attribute` ¶

num_expert_groups `instance-attribute` ¶

num_local_physical_experts `instance-attribute` ¶

num_logical_experts `instance-attribute` ¶

num_moe_layers `instance-attribute` ¶

num_physical_experts `instance-attribute` ¶

num_redundant_experts `instance-attribute` ¶

num_routed_experts `instance-attribute` ¶

num_shared_experts `instance-attribute` ¶

supports_eagle `class-attribute` ¶

supports_eagle3 `class-attribute` ¶

has_own_embed_tokens `class-attribute` `instance-attribute` ¶

has_own_lm_head `class-attribute` `instance-attribute` ¶

supports_lora `class-attribute` ¶

supports_mrope `class-attribute` ¶

_has_oov_mm_tokens `class-attribute` `instance-attribute` ¶

_language_model_names `class-attribute` `instance-attribute` ¶

_processor_factory `class-attribute` ¶

_tower_model_names `class-attribute` `instance-attribute` ¶

requires_raw_input_tokens `class-attribute` ¶

supports_encoder_tp_data `class-attribute` ¶

supports_multimodal `class-attribute` ¶

supports_multimodal_raw_input_only `class-attribute` ¶

get_placeholder_str `classmethod` ¶

supports_pp `class-attribute` ¶

_find_quant_config `staticmethod` ¶

realtime_max_tokens `class-attribute` ¶

supports_score_template `class-attribute` ¶

get_score_template `classmethod` ¶

post_process_tokens `classmethod` ¶

no_space_languages `class-attribute` ¶

supports_explicit_language_detection `class-attribute` ¶

supports_segment_timestamp `class-attribute` ¶

supports_transcription_only `class-attribute` ¶

get_generation_prompt `classmethod` ¶

get_language_detection_prompt `classmethod` ¶

get_language_token_ids `classmethod` ¶

get_num_audio_tokens `classmethod` ¶

get_speech_to_text_config `classmethod` ¶

parse_language_detection_output `classmethod` ¶

post_process_output `classmethod` ¶

validate_language `classmethod` ¶

supports_xdrope `class-attribute` ¶