Skip to content

vllm_omni.diffusion.models.lance.pipeline_lance

LancePipeline — Lance (ByteDance) packaged for the vLLM-Omni diffusion engine.

Lance is BAGEL-lineage (Qwen2-MoT unified AR+diffusion), so the transformer core and the entire generation/forward machinery are inherited unchanged from :class:BagelPipeline. Only model construction differs, and only in three well-localized places:

  1. Checkpoint layout — the HF repo bytedance-research/Lance bundles Lance_3B/ (image) and Lance_3B_Video/ (video) LLM checkpoints, Qwen2.5-VL-ViT/ (understanding ViT) and Wan2.2_VAE.pth (VAE) in a single repo. There is no BAGEL-style top-level config.json carrying vae_config / vit_config / latent_patch_size — those are Lance constants taken from upstream config/config_factory.py and inference_lance.sh and hardcoded in :data:LANCE_DEFAULTS.
  2. Understanding ViT — Qwen2.5-VL vision tower (bundled Qwen2.5-VL-ViT/vit.safetensors) instead of SigLIP.
  3. VAE — Wan2.2 (Wan2.2_VAE.pth) instead of the BAGEL autoencoder.

Scope: this lands the image path (t2i / image_edit / x2t_image) which is the direct BAGEL analogue. The Lance_3B_Video path needs the 3-D latent position embedding (:class:LancePositionEmbedding3D) and temporal VAE handling and is an explicit follow-up — see the PR description.

Bring-up status (verified on a B300 against bytedance-research/Lance): * t2i: end-to-end working (1024x1024 image in ~6 s, 0 missing keys). * x2t (image understanding): plumbing wired (Qwen2.5-VL ViT + Qwen2-VL image processor + no-op connector/vit_pos_embed; VAE prefill skipped for Lance via :class:LanceBagel). The pipeline runs to completion without crashes but currently emits an immediate EOS because the Qwen2.5-VL backbone needs mRoPE position ids on the image+text sequence and we presently force scalar positions (rope_scaling = None); enabling mRoPE end-to-end is the next follow-up. * image_edit (img2img): blocked on the same VAE-prefill issue (Wan2.2 latents do not map onto BAGEL's latent_pos_embed grid); needs a Lance-specific prepare_vae_images. * video (Lance_3B_Video): needs the 3-D latent position embedding wired into Bagel plus a multi-frame VAE decode path in the pipeline. :class:LanceWanVAE.decode_video and :class:LancePositionEmbedding3D are already in place. * Two-stage (AR thinker + DiT): needs LanceConfig / LanceProcessor registered in the vllm package; tracked in a separate PR.

LANCE_DEFAULTS module-attribute

LANCE_DEFAULTS = LanceDefaults()

logger module-attribute

logger = init_logger(__name__)

LanceDefaults dataclass

Lance constants that upstream keeps in config/config_factory.py / inference_lance.sh rather than in any shipped JSON.

Verified against bytedance-research/Lance's released Lance_3B/model.safetensors: vae2llm.weight = (2048, 48)patch_latent_dim = latent_patch_size**2 * z_channels = 1 * 48, i.e. Lance does not unfold the Wan latent into a 2×2 patch the way BAGEL does (Wan2.2 already patchifies internally), and latent_pos_embed.pos_embed = (4096, 2048)max_latent_size = 64.

cfg_text_scale class-attribute instance-attribute

cfg_text_scale: float = 4.0

connector_act class-attribute instance-attribute

connector_act: str = 'gelu_pytorch_tanh'

latent_patch_size_spatial class-attribute instance-attribute

latent_patch_size_spatial: int = 1

latent_patch_size_temporal class-attribute instance-attribute

latent_patch_size_temporal: int = 1

max_latent_size class-attribute instance-attribute

max_latent_size: int = 64

max_num_video_latent_frames class-attribute instance-attribute

max_num_video_latent_frames: int = 31

num_timesteps class-attribute instance-attribute

num_timesteps: int = 30

timestep_shift class-attribute instance-attribute

timestep_shift: float = 3.5

vae_downsample_spatial class-attribute instance-attribute

vae_downsample_spatial: int = 16

vae_downsample_temporal class-attribute instance-attribute

vae_downsample_temporal: int = 4

vae_z_channels class-attribute instance-attribute

vae_z_channels: int = 48

vit_max_num_patch_per_side class-attribute instance-attribute

vit_max_num_patch_per_side: int = 70

LancePipeline

Bases: BagelPipeline

Lance pipeline. Inherits BAGEL's forward/generation; overrides only construction (checkpoint layout, Qwen2.5-VL ViT, Wan2.2 VAE).

bagel instance-attribute

bagel = LanceBagel(
    language_model=language_model,
    vit_model=vit_model,
    parallel_config=parallel_config,
    quant_config=quant_config,
    prefix="bagel",
    config=BagelConfig(
        llm_config=llm_config,
        vae_config=vae_cfg,
        vit_config=vit_cfg,
        vit_max_num_patch_per_side=vit_max_num_patch_per_side,
        connector_act=connector_act,
        interpolate_pos=False,
        latent_patch_size=latent_patch_size_spatial,
        max_latent_size=max_latent_size,
        timestep_shift=timestep_shift,
        visual_gen=True,
        visual_und=und_enabled,
    ),
)

device instance-attribute

device = get_local_device()

image_processor instance-attribute

image_processor = _build_image_processor()

language_model instance-attribute

language_model = Qwen2MoTForCausalLM(
    llm_config,
    parallel_config=parallel_config,
    quant_config=quant_config,
    prefix="bagel.language_model",
)

od_config instance-attribute

od_config = od_config

scheduler instance-attribute

scheduler = None

scheduler_kwargs instance-attribute

scheduler_kwargs = {}

tokenizer instance-attribute

tokenizer = _load_tokenizer(ckpt_path)

transformer instance-attribute

transformer = model

vae instance-attribute

vae = _build_wan22_vae(repo_root)

video_processor instance-attribute

video_processor = _build_video_processor()

vit_model instance-attribute

vit_model = _build_qwen2_5_vl_vit(repo_root)

weights_sources instance-attribute

weights_sources = [
    ComponentSource(
        model_or_path=weights_model,
        subfolder=ckpt_dir
        if ckpt_path != repo_root
        else None,
        revision=revision,
        prefix="bagel.",
        fall_back_to_pt=False,
    )
]

forward

forward(req)

Dispatch on prompt modality.

  • modalities == ["video"] (text-to-video) → :meth:_forward_t2v (3-D latents + LanceWanVAE.decode_video).
  • modalities == ["text"] + multi_modal_data.video (x2t_video) → :meth:_forward_x2t_video (multi-frame Qwen2.5-VL ViT prefill).
  • modalities == ["image"] + multi_modal_data.img2img (image_edit) → :meth:_forward_image_edit (Lance-native VAE+ViT prefill + image gen).
  • modalities == ["video"] + multi_modal_data.video (video_edit) → :meth:_forward_video_edit (Lance-native multi-frame VAE+ViT prefill
  • video gen).
  • Everything else falls through to :meth:BagelPipeline.forward (t2i, x2t_image).

get_lance_post_process_func

get_lance_post_process_func(od_config: OmniDiffusionConfig)

Lance returns PIL.Image.Image directly, same as BAGEL.

get_lance_pre_process_func

get_lance_pre_process_func(od_config: OmniDiffusionConfig)