Skip to content

vllm_omni.diffusion.models.hunyuan_image3.prompt_utils

Shared prompt-template construction for HunyuanImage-3.0-Instruct.

Single source of truth for the AR-prefill prompt format used by the example scripts and any downstream caller that needs to build HunyuanImage3 chat-template token sequences without invoking the full diffusion pipeline tokenizer wrapper.

The DiT pipeline (pipeline_hunyuan_image3.py) builds prompts through TokenizerWrapper.apply_chat_template, which eagerly consumes JointImageInfo objects produced by image preprocessing. The example flow uses an <img> placeholder + multi_modal_data instead, so it needs a lighter-weight builder that only requires a HF tokenizer. This module provides that builder; the (task, bot_task) -> template mapping below is the canonical mapping for both flows.

Two orthogonal axes:

  • task selects the I/O modality combination, which only controls whether <img> placeholders are emitted between User: and the user prompt: i2t / it2i produce them, t2t / t2i do not.

  • bot_task selects the prompting mode and drives both the system prompt and the trigger tag appended after Assistant:. None (default) gives a plain Assistant turn under the unified prompt; think / recaption switch the trigger tag to <think> / <recaption>; think_recaption swaps the system prompt for the dedicated combined-mode template; vanilla drops the chat structure entirely (pretrain template, t2i only).

HUNYUAN_IMAGE3_SPECIAL_TOKEN_IDS module-attribute

HUNYUAN_IMAGE3_SPECIAL_TOKEN_IDS: dict[str, int] = {
    "<|endoftext|>": 127957,
    "<|startoftext|>": 127958,
    "<boi>": 128000,
    "<eoi>": 128001,
    "<img>": 128006,
    "<cfg>": 128010,
    "<recaption>": 128018,
    "</recaption>": 128019,
    "<think>": 128023,
    "</think>": 128024,
    "<answer>": 128025,
    "</answer>": 128026,
    "<img_size_1024>": 128037,
    "<img_ratio_0>": 128044,
    "<img_ratio_32>": 128076,
    "<img_ratio_33>": 130103,
    "<img_ratio_36>": 130106,
}

MAX_IMAGES_PER_REQUEST module-attribute

MAX_IMAGES_PER_REQUEST = 3

PromptTokensResult dataclass

system_prompt_type instance-attribute

system_prompt_type: str

token_ids instance-attribute

token_ids: list[int]

available_bot_tasks

available_bot_tasks() -> list[str | None]

Sorted list of bot_task values (with None first).

available_tasks

available_tasks() -> list[str]

Sorted list of task values accepted by the prompt builders.

build_prompt

build_prompt(
    user_prompt: str,
    task: str = "it2i",
    bot_task: str
    | None
    | _DefaultBotTask = _DEFAULT_BOT_TASK,
    sys_type: str | None = None,
    custom_system_prompt: str | None = None,
    num_images: int = 1,
) -> str

Build a HunyuanImage-3.0 prompt as a string (legacy/compat path).

build_prompt_tokens

build_prompt_tokens(
    user_prompt: str,
    tokenizer,
    task: str = "it2i",
    bot_task: str
    | None
    | _DefaultBotTask = _DEFAULT_BOT_TASK,
    sys_type: str | None = None,
    custom_system_prompt: str | None = None,
    num_images: int = 1,
) -> PromptTokensResult

Segment-by-segment tokenization that matches HF apply_chat_template.

resolve_stop_token_ids

resolve_stop_token_ids(
    task: str = "it2i",
    bot_task: str
    | None
    | _DefaultBotTask = _DEFAULT_BOT_TASK,
    tokenizer: Any | None = None,
    image_size: str | None = None,
) -> list[int]

AR stop-token ids for a given (task, bot_task) generation request.

Image-output tasks (it2i / t2i) stop on any <img_ratio_*> token. Upstream modeling_hunyuan_image_3.py::generate_image (line 3289-3303) sets final_stop_tokens to the full ratio token range when need_ratio is true, then strips the trailing ratio token before passing the cot to the image stage. AR's natural trajectory under _stage_transitions is </recaption><answer><boi><img_size_base><img_ratio_X>; stopping AT the ratio token means KV ends exactly at the prefix DiT reuses, and ar2diffusion can read the ratio off the last sampled token without AR wasting decode steps on <|endoftext|>.

Text-output tasks (i2t / t2t) stop on <answer> -- the AR is the final stage, and the comprehension response sits inside the <answer> body so the answer-open is the natural cot/recaption terminator.

resolve_sys_type

resolve_sys_type(bot_task: str | None) -> str

Default system-prompt type for a given bot_task.