Skip to content

vllm_omni.model_executor.layers.rotary_embedding.mrope

Omni-extended MRotaryEmbedding with multimodal position computation methods.

This module extends the upstream vLLM MRotaryEmbedding with additional methods for computing input positions for various multimodal scenarios including: - Image/Video inputs (Qwen2.5-VL style) - Audio inputs (Qwen2.5-Omni style) - Audio-in-video interleaved inputs - GLM4V style inputs

logger module-attribute

logger = init_logger(__name__)

OmniMRotaryEmbedding

Bases: MRotaryEmbedding

Omni-extended MRotaryEmbedding with multimodal position computation.

Extends the upstream MRotaryEmbedding with additional class methods for computing input positions for various multimodal scenarios.

get_input_positions classmethod

get_input_positions(
    input_tokens: list[int],
    hf_config: PretrainedConfig,
    image_grid_thw: list[list[int]] | Tensor | None,
    video_grid_thw: list[list[int]] | Tensor | None,
    second_per_grid_ts: list[float] | None,
    context_len: int = 0,
    seq_len: int | None = None,
    audio_feature_lengths: Tensor | None = None,
    use_audio_in_video: bool = False,
) -> tuple[list[list[int]], int]

Get mrope input positions and delta value.

get_input_positions_tensor classmethod

get_input_positions_tensor(
    input_tokens: list[int],
    hf_config: PretrainedConfig,
    image_grid_thw: list[list[int]] | Tensor,
    video_grid_thw: list[list[int]] | Tensor,
    second_per_grid_ts: list[float],
    context_len: int = 0,
    seq_len: int | None = None,
    audio_feature_lengths: Tensor | None = None,
    use_audio_in_video: bool = False,
) -> tuple[Tensor, int]

omni_get_updates_use_audio_in_video classmethod

omni_get_updates_use_audio_in_video(
    thinker_config: PretrainedConfig,
    audio_len: int,
    video_grid_thw: list[int] | Tensor,
    video_second_per_grid_t: float,
) -> list[int]

Get video prompt updates when use_audio_in_video is True.

In this case, audio and vision update ids will be split into chunks and interleaved (details in _omni_get_input_positions_tensor).

<|video_bos|><|VIDEO|><|video_eos|> => <|video_bos|><|audio_bos|>(... chunks ...)<|audio_eos|><|video_eos|>