Lance: Offline inference¶

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/lance.

Lance is a 3B unified autoregressive + diffusion multimodal model on a Qwen2.5-VL backbone. It is BAGEL-lineage (ByteDance Mixture-of-Transformers): the released Lance_3B checkpoint uses the same *_moe_gen MoT weight layout as BAGEL, so vLLM-Omni implements it by reusing the BAGEL transformer core and specializing only the ViT (Qwen2.5-VL vision), the VAE (Wan2.2) and the checkpoint layout.

This example covers all six Lance modalities from the upstream HF model card: t2i, t2v, image_edit, video_edit, x2t_image (image understanding) and x2t_video (video understanding).

Hardware¶

Single NVIDIA GPU with 16 GB+ VRAM in BF16 (we test on B300 / A100). CUDA ≥ 12.4.

Run¶

# Text-to-image
python examples/offline_inference/lance/end2end.py \
    --model bytedance-research/Lance \
    --prompts "a corgi astronaut on the moon, cinematic" \
    --steps 30 --cfg-text-scale 4.0 --timestep-shift 3.5 \
    --height 1024 --width 1024 \
    --output ./out

# Text-to-video (uses the Lance_3B_Video subfolder; see ``--modality``
# choices for all six task variants)
python examples/offline_inference/lance/end2end.py \
    --model bytedance-research/Lance/Lance_3B_Video --modality text2video \
    --num-frames 25 --video-height 480 --video-width 768 \
    --prompts "a cat playing piano, cinematic" \
    --steps 30 --fps 8 --output ./out

video_edit requires --model bytedance-research/Lance/Lance_3B_Video so the 3-D latent_pos_embed table is loaded; the other paths can point at the top-level bytedance-research/Lance repo and resolve the right sub-checkpoint automatically.

The HF repo bundles everything (Lance_3B/, Lance_3B_Video/, Qwen2.5-VL-ViT/, Wan2.2_VAE.pth); no separate downloads are required.

Defaults¶

Matches upstream inference_lance.sh: 30 denoising steps, timestep-shift 3.5, text CFG 4.0, seed 42, 1024×1024 (override with --height / --width). For the understanding paths (img2text / video2text), sampling is enabled by default at --text-temperature 0.8 because Lance's greedy decoder emits an immediate EOS for many prompts.

Example materials¶

end2end.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/lance/end2end.py.

gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/lance/gradio_demo.py.