Skip to content

vllm_omni.diffusion.models.lance.prompts

Lance chat / system prompts.

Matches the upstream Lance training distribution (see bytedance/Lance/data/system_prompt_render.py and data/common.py::generate_system_prompt):

  • each task has a specific system prompt describing what to attend to;
  • the prompt is wrapped in the Qwen chat template <|im_start|>system\n…<|im_end|>\n<|im_start|>user\n…<|im_end|>\n <|im_start|>assistant\n;
  • vision tokens are framed by <|vision_start|><|video_pad|><|vision_end|> (upstream uses <|video_pad|> even for image type by default).

These are pure string formatting helpers — the runtime pipeline still does the actual VAE / ViT prefill separately.

IMAGE_PAD module-attribute

IMAGE_PAD = '<|image_pad|>'

SYSTEM_PROMPTS module-attribute

SYSTEM_PROMPTS: dict[tuple[str, str], str] = {
    (
        "t2i",
        "image",
    ): "Describe the image by detailing the color, quantity, text, shape, size, texture, spatial relationships of the objects and background:",
    (
        "t2v",
        "video",
    ): "Describe the video by detailing the color, quantity, visible text, shape, size, texture, spatial relationships and motion/camera movements of the objects and background:",
    (
        "i2v",
        "video",
    ): "Describe the video by detailing the color, quantity, visible text, shape, size, texture, spatial relationships and motion/camera movements of the objects and background:",
    (
        "image_edit",
        "image",
    ): "Describe the key features of the input image (color, shape, size, texture, objects, background), then explain how the user's text instruction should alter or modify the image. Generate a new image that meets the user's requirements while maintaining consistency with the original input where appropriate.",
    (
        "video_edit",
        "video",
    ): "Describe the key features of the input video (color, shape, size, texture, objects, background), then explain how the user's text instruction should alter or modify the video. Generate a new video that meets the user's requirements while maintaining consistency with the original input where appropriate.",
    (
        "x2t_image",
        "image",
    ): "Generate a detailed and accurate description of the image, including all the key moments and visual details.",
    (
        "x2t_video",
        "video",
    ): "Generate a detailed and accurate description of the video, including all the key moments and visual details.",
}

VIDEO_PAD module-attribute

VIDEO_PAD = '<|video_pad|>'

VISION_END module-attribute

VISION_END = '<|vision_end|>'

VISION_START module-attribute

VISION_START = '<|vision_start|>'

render_lance_prompt

render_lance_prompt(
    task: str,
    user_text: str,
    *,
    vision_token: str | None = None,
    system_prompt: str | None = None,
) -> str

Render a Lance-compatible user-side prompt.

task is one of t2i, t2v, image_edit, video_edit, x2t_image, x2t_video. vision_token should be the visual placeholder to embed inside the user message (e.g. VIDEO_PAD or a VISION_START + VIDEO_PAD + VISION_END block); pass None for text-only inputs. system_prompt overrides the default task system prompt — required for x2t QA examples whose upstream JSON carries a per-example instruction (e.g. "Look at the image carefully and answer the question."); without it x2t falls back to the caption-style default and the model describes instead of answering.

Returns a single string ready to be tokenized; no further wrapping is needed by the caller.