vllm_omni.model_executor.layers.rotary_embedding.mrope ¶
Omni-extended MRotaryEmbedding with multimodal position computation methods.
This module extends the upstream vLLM MRotaryEmbedding with additional methods for computing input positions for various multimodal scenarios including: - Image/Video inputs (Qwen2.5-VL style) - Audio inputs (Qwen2.5-Omni style) - Audio-in-video interleaved inputs - GLM4V style inputs
OmniMRotaryEmbedding ¶
Bases: MRotaryEmbedding
Omni-extended MRotaryEmbedding with multimodal position computation.
Extends the upstream MRotaryEmbedding with additional class methods for computing input positions for various multimodal scenarios.
get_input_positions classmethod ¶
get_input_positions(
input_tokens: list[int],
hf_config: PretrainedConfig,
image_grid_thw: list[list[int]] | Tensor | None,
video_grid_thw: list[list[int]] | Tensor | None,
second_per_grid_ts: list[float] | None,
context_len: int = 0,
seq_len: int | None = None,
audio_feature_lengths: Tensor | None = None,
use_audio_in_video: bool = False,
) -> tuple[list[list[int]], int]
Get mrope input positions and delta value.
get_input_positions_tensor classmethod ¶
get_input_positions_tensor(
input_tokens: list[int],
hf_config: PretrainedConfig,
image_grid_thw: list[list[int]] | Tensor,
video_grid_thw: list[list[int]] | Tensor,
second_per_grid_ts: list[float],
context_len: int = 0,
seq_len: int | None = None,
audio_feature_lengths: Tensor | None = None,
use_audio_in_video: bool = False,
) -> tuple[Tensor, int]
omni_get_updates_use_audio_in_video classmethod ¶
omni_get_updates_use_audio_in_video(
thinker_config: PretrainedConfig,
audio_len: int,
video_grid_thw: list[int] | Tensor,
video_second_per_grid_t: float,
) -> list[int]
Get video prompt updates when use_audio_in_video is True.
In this case, audio and vision update ids will be split into chunks and interleaved (details in _omni_get_input_positions_tensor).
<|video_bos|><|VIDEO|><|video_eos|> => <|video_bos|><|audio_bos|>(... chunks ...)<|audio_eos|><|video_eos|>