vllm_omni.diffusion.models.lance.prompts ¶
Lance chat / system prompts.
Matches the upstream Lance training distribution (see bytedance/Lance/data/system_prompt_render.py and data/common.py::generate_system_prompt):
- each task has a specific system prompt describing what to attend to;
- the prompt is wrapped in the Qwen chat template
<|im_start|>system\n…<|im_end|>\n<|im_start|>user\n…<|im_end|>\n<|im_start|>assistant\n; - vision tokens are framed by
<|vision_start|><|video_pad|><|vision_end|>(upstream uses<|video_pad|>even for image type by default).
These are pure string formatting helpers — the runtime pipeline still does the actual VAE / ViT prefill separately.
SYSTEM_PROMPTS module-attribute ¶
SYSTEM_PROMPTS: dict[tuple[str, str], str] = {
(
"t2i",
"image",
): "Describe the image by detailing the color, quantity, text, shape, size, texture, spatial relationships of the objects and background:",
(
"t2v",
"video",
): "Describe the video by detailing the color, quantity, visible text, shape, size, texture, spatial relationships and motion/camera movements of the objects and background:",
(
"i2v",
"video",
): "Describe the video by detailing the color, quantity, visible text, shape, size, texture, spatial relationships and motion/camera movements of the objects and background:",
(
"image_edit",
"image",
): "Describe the key features of the input image (color, shape, size, texture, objects, background), then explain how the user's text instruction should alter or modify the image. Generate a new image that meets the user's requirements while maintaining consistency with the original input where appropriate.",
(
"video_edit",
"video",
): "Describe the key features of the input video (color, shape, size, texture, objects, background), then explain how the user's text instruction should alter or modify the video. Generate a new video that meets the user's requirements while maintaining consistency with the original input where appropriate.",
(
"x2t_image",
"image",
): "Generate a detailed and accurate description of the image, including all the key moments and visual details.",
(
"x2t_video",
"video",
): "Generate a detailed and accurate description of the video, including all the key moments and visual details.",
}
render_lance_prompt ¶
render_lance_prompt(
task: str,
user_text: str,
*,
vision_token: str | None = None,
system_prompt: str | None = None,
) -> str
Render a Lance-compatible user-side prompt.
task is one of t2i, t2v, image_edit, video_edit, x2t_image, x2t_video. vision_token should be the visual placeholder to embed inside the user message (e.g. VIDEO_PAD or a VISION_START + VIDEO_PAD + VISION_END block); pass None for text-only inputs. system_prompt overrides the default task system prompt — required for x2t QA examples whose upstream JSON carries a per-example instruction (e.g. "Look at the image carefully and answer the question."); without it x2t falls back to the caption-style default and the model describes instead of answering.
Returns a single string ready to be tokenized; no further wrapping is needed by the caller.