vllm_omni.diffusion.models.lance ¶
Lance (ByteDance) diffusion model components.
Lance is a unified autoregressive + diffusion multimodal model on a Qwen2.5-VL-3B backbone. Architecturally it is the BAGEL family (ByteDance Mixture-of-Transformers): the released Lance_3B checkpoint uses the exact same *_moe_gen MoT weight layout as BAGEL, plus vae2llm / llm2vae / time_embedder / latent_pos_embed connectors. The deltas vs BAGEL are:
- backbone is Qwen2.5-VL (mRoPE) instead of Qwen2,
- understanding ViT is Qwen2.5-VL vision (not SigLIP), loaded from the base
Qwen/Qwen2.5-VL-3B-Instructrather than from the Lance checkpoint, - VAE is Wan2.2 (reused from the vLLM-Omni WAN path) instead of the BAGEL AE,
- video path adds 3D latent position embeddings (follow-up; this module implements the image path first).
Because Lance is BAGEL-lineage, the transformer core is reused verbatim from vllm_omni.diffusion.models.bagel.bagel_transformer and only the pipeline wiring (ViT / VAE / checkpoint layout) is specialized here.
Modules:
| Name | Description |
|---|---|
lance_transformer | Lance transformer pieces. |
pipeline_lance | LancePipeline — Lance (ByteDance) packaged for the vLLM-Omni diffusion engine. |
prompts | Lance chat / system prompts. |
wan_vae | Wan2.2 VAE used by Lance, ported from upstream so |
LancePipeline ¶
Bases: BagelPipeline
Lance pipeline. Inherits BAGEL's forward/generation; overrides only construction (checkpoint layout, Qwen2.5-VL ViT, Wan2.2 VAE).
bagel instance-attribute ¶
bagel = LanceBagel(
language_model=language_model,
vit_model=vit_model,
parallel_config=parallel_config,
quant_config=quant_config,
prefix="bagel",
config=BagelConfig(
llm_config=llm_config,
vae_config=vae_cfg,
vit_config=vit_cfg,
vit_max_num_patch_per_side=vit_max_num_patch_per_side,
connector_act=connector_act,
interpolate_pos=False,
latent_patch_size=latent_patch_size_spatial,
max_latent_size=max_latent_size,
timestep_shift=timestep_shift,
visual_gen=True,
visual_und=und_enabled,
),
)
language_model instance-attribute ¶
language_model = Qwen2MoTForCausalLM(
llm_config,
parallel_config=parallel_config,
quant_config=quant_config,
prefix="bagel.language_model",
)
weights_sources instance-attribute ¶
weights_sources = [
ComponentSource(
model_or_path=weights_model,
subfolder=ckpt_dir
if ckpt_path != repo_root
else None,
revision=revision,
prefix="bagel.",
fall_back_to_pt=False,
)
]
forward ¶
Dispatch on prompt modality.
modalities == ["video"](text-to-video) → :meth:_forward_t2v(3-D latents +LanceWanVAE.decode_video).modalities == ["text"]+multi_modal_data.video(x2t_video) → :meth:_forward_x2t_video(multi-frame Qwen2.5-VL ViT prefill).modalities == ["image"]+multi_modal_data.img2img(image_edit) → :meth:_forward_image_edit(Lance-native VAE+ViT prefill + image gen).modalities == ["video"]+multi_modal_data.video(video_edit) → :meth:_forward_video_edit(Lance-native multi-frame VAE+ViT prefill- video gen).
- Everything else falls through to :meth:
BagelPipeline.forward(t2i, x2t_image).
get_lance_post_process_func ¶
get_lance_post_process_func(od_config: OmniDiffusionConfig)
Lance returns PIL.Image.Image directly, same as BAGEL.