vllm_omni.diffusion.models.lance.pipeline_lance ¶

LancePipeline — Lance (ByteDance) packaged for the vLLM-Omni diffusion engine.

Lance is BAGEL-lineage (Qwen2-MoT unified AR+diffusion), so the transformer core and the entire generation/forward machinery are inherited unchanged from :class:BagelPipeline. Only model construction differs, and only in three well-localized places:

Checkpoint layout — the HF repo bytedance-research/Lance bundles Lance_3B/ (image) and Lance_3B_Video/ (video) LLM checkpoints, Qwen2.5-VL-ViT/ (understanding ViT) and Wan2.2_VAE.pth (VAE) in a single repo. There is no BAGEL-style top-level config.json carrying vae_config / vit_config / latent_patch_size — those are Lance constants taken from upstream config/config_factory.py and inference_lance.sh and hardcoded in :data:LANCE_DEFAULTS.
Understanding ViT — Qwen2.5-VL vision tower (bundled Qwen2.5-VL-ViT/vit.safetensors) instead of SigLIP.
VAE — Wan2.2 (Wan2.2_VAE.pth) instead of the BAGEL autoencoder.

Scope: this lands the image path (t2i / image_edit / x2t_image) which is the direct BAGEL analogue. The Lance_3B_Video path needs the 3-D latent position embedding (:class:LancePositionEmbedding3D) and temporal VAE handling and is an explicit follow-up — see the PR description.

Bring-up status (verified on a B300 against bytedance-research/Lance): * t2i: end-to-end working (1024x1024 image in ~6 s, 0 missing keys). * x2t (image understanding): plumbing wired (Qwen2.5-VL ViT + Qwen2-VL image processor + no-op connector/vit_pos_embed; VAE prefill skipped for Lance via :class:LanceBagel). The pipeline runs to completion without crashes but currently emits an immediate EOS because the Qwen2.5-VL backbone needs mRoPE position ids on the image+text sequence and we presently force scalar positions (rope_scaling = None); enabling mRoPE end-to-end is the next follow-up. * image_edit (img2img): blocked on the same VAE-prefill issue (Wan2.2 latents do not map onto BAGEL's latent_pos_embed grid); needs a Lance-specific prepare_vae_images. * video (Lance_3B_Video): needs the 3-D latent position embedding wired into Bagel plus a multi-frame VAE decode path in the pipeline. :class:LanceWanVAE.decode_video and :class:LancePositionEmbedding3D are already in place. * Two-stage (AR thinker + DiT): needs LanceConfig / LanceProcessor registered in the vllm package; tracked in a separate PR.

LANCE_DEFAULTS `module-attribute` ¶

LANCE_DEFAULTS = LanceDefaults()

logger `module-attribute` ¶

logger = init_logger(__name__)

LanceDefaults `dataclass` ¶

Lance constants that upstream keeps in config/config_factory.py / inference_lance.sh rather than in any shipped JSON.

Verified against bytedance-research/Lance's released Lance_3B/model.safetensors: vae2llm.weight = (2048, 48) ⇒ patch_latent_dim = latent_patch_size**2 * z_channels = 1 * 48, i.e. Lance does not unfold the Wan latent into a 2×2 patch the way BAGEL does (Wan2.2 already patchifies internally), and latent_pos_embed.pos_embed = (4096, 2048) ⇒ max_latent_size = 64.

cfg_text_scale `class-attribute` `instance-attribute` ¶

cfg_text_scale: float = 4.0

connector_act `class-attribute` `instance-attribute` ¶

connector_act: str = 'gelu_pytorch_tanh'

latent_patch_size_spatial `class-attribute` `instance-attribute` ¶

latent_patch_size_spatial: int = 1

latent_patch_size_temporal `class-attribute` `instance-attribute` ¶

latent_patch_size_temporal: int = 1

max_latent_size `class-attribute` `instance-attribute` ¶

max_latent_size: int = 64

max_num_video_latent_frames `class-attribute` `instance-attribute` ¶

max_num_video_latent_frames: int = 31

num_timesteps `class-attribute` `instance-attribute` ¶

num_timesteps: int = 30

timestep_shift `class-attribute` `instance-attribute` ¶

timestep_shift: float = 3.5

vae_downsample_spatial `class-attribute` `instance-attribute` ¶

vae_downsample_spatial: int = 16

vae_downsample_temporal `class-attribute` `instance-attribute` ¶

vae_downsample_temporal: int = 4

vae_z_channels `class-attribute` `instance-attribute` ¶

vae_z_channels: int = 48

vit_max_num_patch_per_side `class-attribute` `instance-attribute` ¶

vit_max_num_patch_per_side: int = 70

LancePipeline ¶

Bases: BagelPipeline

Lance pipeline. Inherits BAGEL's forward/generation; overrides only construction (checkpoint layout, Qwen2.5-VL ViT, Wan2.2 VAE).

bagel `instance-attribute` ¶

bagel = LanceBagel(
    language_model=self.language_model,
    vit_model=self.vit_model,
    parallel_config=parallel_config,
    quant_config=quant_config,
    prefix="bagel",
    config=BagelConfig(
        llm_config=llm_config,
        vae_config=vae_cfg,
        vit_config=vit_cfg,
        vit_max_num_patch_per_side=LANCE_DEFAULTS.vit_max_num_patch_per_side,
        connector_act=LANCE_DEFAULTS.connector_act,
        interpolate_pos=False,
        latent_patch_size=LANCE_DEFAULTS.latent_patch_size_spatial,
        max_latent_size=LANCE_DEFAULTS.max_latent_size,
        timestep_shift=LANCE_DEFAULTS.timestep_shift,
        visual_gen=True,
        visual_und=und_enabled,
    ),
)

device `instance-attribute` ¶

device = get_local_device()

image_processor `instance-attribute` ¶

image_processor = self._build_image_processor()

language_model `instance-attribute` ¶

language_model = Qwen2MoTForCausalLM(
    llm_config,
    parallel_config=parallel_config,
    quant_config=quant_config,
    prefix="bagel.language_model",
)

od_config `instance-attribute` ¶

od_config = od_config

scheduler `instance-attribute` ¶

scheduler = None

scheduler_kwargs `instance-attribute` ¶

scheduler_kwargs = {}

tokenizer `instance-attribute` ¶

tokenizer = self._load_tokenizer(ckpt_path)

transformer `instance-attribute` ¶

transformer = self.language_model.model

vae `instance-attribute` ¶

vae = self._build_wan22_vae(repo_root)

video_processor `instance-attribute` ¶

video_processor = self._build_video_processor()

vit_model `instance-attribute` ¶

vit_model = self._build_qwen2_5_vl_vit(repo_root)

weights_sources `instance-attribute` ¶

weights_sources = [
    DiffusersPipelineLoader.ComponentSource(
        model_or_path=weights_model,
        subfolder=ckpt_dir
        if ckpt_path != repo_root
        else None,
        revision=od_config.revision,
        prefix="bagel.",
        fall_back_to_pt=False,
    )
]

forward ¶

forward(req)

Dispatch on prompt modality.

modalities == ["video"] (text-to-video) → :meth:_forward_t2v (3-D latents + LanceWanVAE.decode_video).
modalities == ["text"] + multi_modal_data.video (x2t_video) → :meth:_forward_x2t_video (multi-frame Qwen2.5-VL ViT prefill).
modalities == ["image"] + multi_modal_data.img2img (image_edit) → :meth:_forward_image_edit (Lance-native VAE+ViT prefill + image gen).
modalities == ["video"] + multi_modal_data.video (video_edit) → :meth:_forward_video_edit (Lance-native multi-frame VAE+ViT prefill
video gen).
Everything else falls through to :meth:BagelPipeline.forward (t2i, x2t_image).

get_lance_post_process_func ¶

get_lance_post_process_func(od_config: OmniDiffusionConfig)

Lance returns PIL.Image.Image directly, same as BAGEL.

get_lance_pre_process_func ¶

get_lance_pre_process_func(od_config: OmniDiffusionConfig)

vllm_omni.diffusion.models.lance.pipeline_lance ¶

LANCE_DEFAULTS module-attribute ¶

logger module-attribute ¶

LanceDefaults dataclass ¶

cfg_text_scale class-attribute instance-attribute ¶

connector_act class-attribute instance-attribute ¶

latent_patch_size_spatial class-attribute instance-attribute ¶

latent_patch_size_temporal class-attribute instance-attribute ¶

max_latent_size class-attribute instance-attribute ¶

max_num_video_latent_frames class-attribute instance-attribute ¶

num_timesteps class-attribute instance-attribute ¶

timestep_shift class-attribute instance-attribute ¶

vae_downsample_spatial class-attribute instance-attribute ¶

vae_downsample_temporal class-attribute instance-attribute ¶

vae_z_channels class-attribute instance-attribute ¶

vit_max_num_patch_per_side class-attribute instance-attribute ¶