vllm_omni.diffusion.models.lance.pipeline_lance ¶
LancePipeline — Lance (ByteDance) packaged for the vLLM-Omni diffusion engine.
Lance is BAGEL-lineage (Qwen2-MoT unified AR+diffusion), so the transformer core and the entire generation/forward machinery are inherited unchanged from :class:BagelPipeline. Only model construction differs, and only in three well-localized places:
- Checkpoint layout — the HF repo
bytedance-research/LancebundlesLance_3B/(image) andLance_3B_Video/(video) LLM checkpoints,Qwen2.5-VL-ViT/(understanding ViT) andWan2.2_VAE.pth(VAE) in a single repo. There is no BAGEL-style top-levelconfig.jsoncarryingvae_config/vit_config/latent_patch_size— those are Lance constants taken from upstreamconfig/config_factory.pyandinference_lance.shand hardcoded in :data:LANCE_DEFAULTS. - Understanding ViT — Qwen2.5-VL vision tower (bundled
Qwen2.5-VL-ViT/vit.safetensors) instead of SigLIP. - VAE — Wan2.2 (
Wan2.2_VAE.pth) instead of the BAGEL autoencoder.
Scope: this lands the image path (t2i / image_edit / x2t_image) which is the direct BAGEL analogue. The Lance_3B_Video path needs the 3-D latent position embedding (:class:LancePositionEmbedding3D) and temporal VAE handling and is an explicit follow-up — see the PR description.
Bring-up status (verified on a B300 against bytedance-research/Lance): * t2i: end-to-end working (1024x1024 image in ~6 s, 0 missing keys). * x2t (image understanding): plumbing wired (Qwen2.5-VL ViT + Qwen2-VL image processor + no-op connector/vit_pos_embed; VAE prefill skipped for Lance via :class:LanceBagel). The pipeline runs to completion without crashes but currently emits an immediate EOS because the Qwen2.5-VL backbone needs mRoPE position ids on the image+text sequence and we presently force scalar positions (rope_scaling = None); enabling mRoPE end-to-end is the next follow-up. * image_edit (img2img): blocked on the same VAE-prefill issue (Wan2.2 latents do not map onto BAGEL's latent_pos_embed grid); needs a Lance-specific prepare_vae_images. * video (Lance_3B_Video): needs the 3-D latent position embedding wired into Bagel plus a multi-frame VAE decode path in the pipeline. :class:LanceWanVAE.decode_video and :class:LancePositionEmbedding3D are already in place. * Two-stage (AR thinker + DiT): needs LanceConfig / LanceProcessor registered in the vllm package; tracked in a separate PR.
LanceDefaults dataclass ¶
Lance constants that upstream keeps in config/config_factory.py / inference_lance.sh rather than in any shipped JSON.
Verified against bytedance-research/Lance's released Lance_3B/model.safetensors: vae2llm.weight = (2048, 48) ⇒ patch_latent_dim = latent_patch_size**2 * z_channels = 1 * 48, i.e. Lance does not unfold the Wan latent into a 2×2 patch the way BAGEL does (Wan2.2 already patchifies internally), and latent_pos_embed.pos_embed = (4096, 2048) ⇒ max_latent_size = 64.
LancePipeline ¶
Bases: BagelPipeline
Lance pipeline. Inherits BAGEL's forward/generation; overrides only construction (checkpoint layout, Qwen2.5-VL ViT, Wan2.2 VAE).
bagel instance-attribute ¶
bagel = LanceBagel(
language_model=language_model,
vit_model=vit_model,
parallel_config=parallel_config,
quant_config=quant_config,
prefix="bagel",
config=BagelConfig(
llm_config=llm_config,
vae_config=vae_cfg,
vit_config=vit_cfg,
vit_max_num_patch_per_side=vit_max_num_patch_per_side,
connector_act=connector_act,
interpolate_pos=False,
latent_patch_size=latent_patch_size_spatial,
max_latent_size=max_latent_size,
timestep_shift=timestep_shift,
visual_gen=True,
visual_und=und_enabled,
),
)
language_model instance-attribute ¶
language_model = Qwen2MoTForCausalLM(
llm_config,
parallel_config=parallel_config,
quant_config=quant_config,
prefix="bagel.language_model",
)
weights_sources instance-attribute ¶
weights_sources = [
ComponentSource(
model_or_path=weights_model,
subfolder=ckpt_dir
if ckpt_path != repo_root
else None,
revision=revision,
prefix="bagel.",
fall_back_to_pt=False,
)
]
forward ¶
Dispatch on prompt modality.
modalities == ["video"](text-to-video) → :meth:_forward_t2v(3-D latents +LanceWanVAE.decode_video).modalities == ["text"]+multi_modal_data.video(x2t_video) → :meth:_forward_x2t_video(multi-frame Qwen2.5-VL ViT prefill).modalities == ["image"]+multi_modal_data.img2img(image_edit) → :meth:_forward_image_edit(Lance-native VAE+ViT prefill + image gen).modalities == ["video"]+multi_modal_data.video(video_edit) → :meth:_forward_video_edit(Lance-native multi-frame VAE+ViT prefill- video gen).
- Everything else falls through to :meth:
BagelPipeline.forward(t2i, x2t_image).
get_lance_post_process_func ¶
get_lance_post_process_func(od_config: OmniDiffusionConfig)
Lance returns PIL.Image.Image directly, same as BAGEL.