Skip to content

vllm_omni.model_executor.models.bagel

Modules:

Name Description
bagel
pipeline

BAGEL-7B-MoT pipeline topologies (frozen).

OmniBagelForConditionalGeneration

Bases: BagelForConditionalGeneration

Omni version of BagelForConditionalGeneration.

Extends the base model with a VAE encoder so that img2img can embed both VAE latents and ViT features within the AR stage, producing a combined KV cache that is then transferred to the DiT stage.

Position IDs are adjusted so that
  • VAE tokens all share position 0
  • ViT tokens all share position 1
  • Text tokens use sequential positions starting from 2

This matches the position scheme used by the single-stage DiT pipeline, ensuring the transferred KV cache + ropes are directly compatible with the DiT's denoising loop.

device instance-attribute

device = get_local_device()

downsample instance-attribute

downsample = get('downsample')

latent_channel instance-attribute

latent_channel = get('z_channels')

latent_downsample instance-attribute

latent_downsample = downsample * latent_patch_size

latent_patch_size instance-attribute

latent_patch_size = getattr(config, 'latent_patch_size', 2)

latent_pos_embed instance-attribute

latent_pos_embed = PositionEmbedding(
    max_latent_size, hidden_size
)

max_latent_size instance-attribute

max_latent_size = getattr(config, 'max_latent_size', 32)

packed_modules_mapping class-attribute instance-attribute

packed_modules_mapping = {
    "qkv_proj": ["q_proj", "k_proj", "v_proj"],
    "gate_up_proj": ["gate_proj", "up_proj"],
    "qkv_proj_moe_gen": [
        "q_proj_moe_gen",
        "k_proj_moe_gen",
        "v_proj_moe_gen",
    ],
    "mlp_moe_gen.gate_up_proj": [
        "mlp_moe_gen.gate_proj",
        "mlp_moe_gen.up_proj",
    ],
}

time_embedder instance-attribute

time_embedder = TimestepEmbedder(hidden_size)

vae instance-attribute

vae2llm instance-attribute

vae2llm = Linear(patch_latent_dim, hidden_size)

embed_multimodal

embed_multimodal(
    **kwargs: object,
) -> MultiModalEmbeddings | None

flush_pending_metadata

flush_pending_metadata(req_ids: list[str]) -> None

Map pending metadata (batch order) to req_ids after forward().

Guard: if a request already has metadata with image_shape (written during img2img prefill), don't overwrite it with decode-step metadata that lacks image_shape.

forward

forward(
    input_ids: Tensor | None,
    positions: Tensor,
    intermediate_tensors=None,
    inputs_embeds: Tensor | None = None,
    **kwargs: object,
) -> Tensor

get_flattened_position_ids

get_flattened_position_ids(
    img_h, img_w, patch_size, max_num_patches_per_side
)

get_kv_transfer_metadata

get_kv_transfer_metadata(
    req_id: str, *, num_computed_tokens: int | None = None
) -> dict[str, Any] | None

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

prepare_runner_inputs

prepare_runner_inputs(
    input_ids: Tensor | None,
    positions: Tensor | None,
    inputs_embeds: Tensor | None,
    req_ids: list[str],
    num_computed_tokens: list[int],
    num_scheduled_tokens: list[int],
    input_ids_buffer: Tensor | None = None,
) -> tuple[Tensor | None, Tensor | None]

Restore input_ids so _adjust_positions_for_img2img can locate the <|fim_middle|> placeholder for thinking-mode pre_text_len detection.