Skip to content

vllm_omni.diffusion.models.lance

Lance (ByteDance) diffusion model components.

Lance is a unified autoregressive + diffusion multimodal model on a Qwen2.5-VL-3B backbone. Architecturally it is the BAGEL family (ByteDance Mixture-of-Transformers): the released Lance_3B checkpoint uses the exact same *_moe_gen MoT weight layout as BAGEL, plus vae2llm / llm2vae / time_embedder / latent_pos_embed connectors. The deltas vs BAGEL are:

  • backbone is Qwen2.5-VL (mRoPE) instead of Qwen2,
  • understanding ViT is Qwen2.5-VL vision (not SigLIP), loaded from the base Qwen/Qwen2.5-VL-3B-Instruct rather than from the Lance checkpoint,
  • VAE is Wan2.2 (reused from the vLLM-Omni WAN path) instead of the BAGEL AE,
  • video path adds 3D latent position embeddings (follow-up; this module implements the image path first).

Because Lance is BAGEL-lineage, the transformer core is reused verbatim from vllm_omni.diffusion.models.bagel.bagel_transformer and only the pipeline wiring (ViT / VAE / checkpoint layout) is specialized here.

Modules:

Name Description
lance_transformer

Lance transformer pieces.

pipeline_lance

LancePipeline — Lance (ByteDance) packaged for the vLLM-Omni diffusion engine.

prompts

Lance chat / system prompts.

wan_vae

Wan2.2 VAE used by Lance, ported from upstream so Wan2.2_VAE.pth loads

LancePipeline

Bases: BagelPipeline

Lance pipeline. Inherits BAGEL's forward/generation; overrides only construction (checkpoint layout, Qwen2.5-VL ViT, Wan2.2 VAE).

bagel instance-attribute

bagel = LanceBagel(
    language_model=language_model,
    vit_model=vit_model,
    parallel_config=parallel_config,
    quant_config=quant_config,
    prefix="bagel",
    config=BagelConfig(
        llm_config=llm_config,
        vae_config=vae_cfg,
        vit_config=vit_cfg,
        vit_max_num_patch_per_side=vit_max_num_patch_per_side,
        connector_act=connector_act,
        interpolate_pos=False,
        latent_patch_size=latent_patch_size_spatial,
        max_latent_size=max_latent_size,
        timestep_shift=timestep_shift,
        visual_gen=True,
        visual_und=und_enabled,
    ),
)

device instance-attribute

device = get_local_device()

image_processor instance-attribute

image_processor = _build_image_processor()

language_model instance-attribute

language_model = Qwen2MoTForCausalLM(
    llm_config,
    parallel_config=parallel_config,
    quant_config=quant_config,
    prefix="bagel.language_model",
)

od_config instance-attribute

od_config = od_config

scheduler instance-attribute

scheduler = None

scheduler_kwargs instance-attribute

scheduler_kwargs = {}

tokenizer instance-attribute

tokenizer = _load_tokenizer(ckpt_path)

transformer instance-attribute

transformer = model

vae instance-attribute

vae = _build_wan22_vae(repo_root)

video_processor instance-attribute

video_processor = _build_video_processor()

vit_model instance-attribute

vit_model = _build_qwen2_5_vl_vit(repo_root)

weights_sources instance-attribute

weights_sources = [
    ComponentSource(
        model_or_path=weights_model,
        subfolder=ckpt_dir
        if ckpt_path != repo_root
        else None,
        revision=revision,
        prefix="bagel.",
        fall_back_to_pt=False,
    )
]

forward

forward(req)

Dispatch on prompt modality.

  • modalities == ["video"] (text-to-video) → :meth:_forward_t2v (3-D latents + LanceWanVAE.decode_video).
  • modalities == ["text"] + multi_modal_data.video (x2t_video) → :meth:_forward_x2t_video (multi-frame Qwen2.5-VL ViT prefill).
  • modalities == ["image"] + multi_modal_data.img2img (image_edit) → :meth:_forward_image_edit (Lance-native VAE+ViT prefill + image gen).
  • modalities == ["video"] + multi_modal_data.video (video_edit) → :meth:_forward_video_edit (Lance-native multi-frame VAE+ViT prefill
  • video gen).
  • Everything else falls through to :meth:BagelPipeline.forward (t2i, x2t_image).

get_lance_post_process_func

get_lance_post_process_func(od_config: OmniDiffusionConfig)

Lance returns PIL.Image.Image directly, same as BAGEL.