vllm_omni.model_executor.models.bagel ¶
Modules:
| Name | Description |
|---|---|
bagel | |
pipeline | BAGEL-7B-MoT pipeline topologies (frozen). |
OmniBagelForConditionalGeneration ¶
Bases: BagelForConditionalGeneration
Omni version of BagelForConditionalGeneration.
Extends the base model with a VAE encoder so that img2img can embed both VAE latents and ViT features within the AR stage, producing a combined KV cache that is then transferred to the DiT stage.
Position IDs are adjusted so that
- VAE tokens all share position 0
- ViT tokens all share position 1
- Text tokens use sequential positions starting from 2
This matches the position scheme used by the single-stage DiT pipeline, ensuring the transferred KV cache + ropes are directly compatible with the DiT's denoising loop.
latent_pos_embed instance-attribute ¶
latent_pos_embed = PositionEmbedding(
max_latent_size, hidden_size
)
packed_modules_mapping class-attribute instance-attribute ¶
packed_modules_mapping = {
"qkv_proj": ["q_proj", "k_proj", "v_proj"],
"gate_up_proj": ["gate_proj", "up_proj"],
"qkv_proj_moe_gen": [
"q_proj_moe_gen",
"k_proj_moe_gen",
"v_proj_moe_gen",
],
"mlp_moe_gen.gate_up_proj": [
"mlp_moe_gen.gate_proj",
"mlp_moe_gen.up_proj",
],
}
flush_pending_metadata ¶
Map pending metadata (batch order) to req_ids after forward().
Guard: if a request already has metadata with image_shape (written during img2img prefill), don't overwrite it with decode-step metadata that lacks image_shape.
forward ¶
forward(
input_ids: Tensor | None,
positions: Tensor,
intermediate_tensors=None,
inputs_embeds: Tensor | None = None,
**kwargs: object,
) -> Tensor
get_flattened_position_ids ¶
get_kv_transfer_metadata ¶
get_kv_transfer_metadata(
req_id: str, *, num_computed_tokens: int | None = None
) -> dict[str, Any] | None
prepare_runner_inputs ¶
prepare_runner_inputs(
input_ids: Tensor | None,
positions: Tensor | None,
inputs_embeds: Tensor | None,
req_ids: list[str],
num_computed_tokens: list[int],
num_scheduled_tokens: list[int],
input_ids_buffer: Tensor | None = None,
) -> tuple[Tensor | None, Tensor | None]
Restore input_ids so _adjust_positions_for_img2img can locate the <|fim_middle|> placeholder for thinking-mode pre_text_len detection.