vllm_omni.diffusion.models.sensenova_u1.pipeline_sensenova_u1 ¶
SenseNova-U1 Pipeline for vLLM-Omni.
SenseNova-U1 is a unified Qwen3-based model that uses Mixture-of-Tokenizers (MoT) attention for text-to-image generation via flow matching in patch space. It has no separate VAE or text encoder — the Qwen3 LLM itself serves as both the text encoder (via KV cache) and the denoising backbone (via MoT branches).
Key integration points: - Transformer layers ported with TP support (QKVParallelLinear, MergedColumnParallelLinear, RowParallelLinear) in sensenova_u1_transformer.py. - Vision model (NEOVisionModel) and FM modules kept as standard nn.Module since they are lightweight (no transformer blocks). - Weight loading uses stacked_params_mapping for fused QKV and gate_up.
SYSTEM_MESSAGE_FOR_GEN module-attribute ¶
SYSTEM_MESSAGE_FOR_GEN = "You are an image generation and editing assistant that accurately understands and executes user intent.\n\nYou support two modes:\n\n1. Think Mode:\nIf the task requires reasoning, you MUST start with a <think></think> block. Put all reasoning inside the block using plain text. DO NOT include any image tags. Keep it reasonable and directly useful for producing the final image.\n\n2. Non-Think Mode:\nIf no reasoning is needed, directly produce the final image.\n\nTask Types:\n\nA. Text-to-Image Generation:\n- Generate a high-quality image based on the user's description.\n- Ensure visual clarity, semantic consistency, and completeness.\n- DO NOT introduce elements that contradict or override the user's intent.\n\nB. Image Editing:\n- Use the provided image(s) as input or reference for modification or transformation.\n- The result can be an edited image or a new image based on the reference(s).\n- Preserve all unspecified attributes unless explicitly changed.\n\nGeneral Rules:\n- For any visible text in the image, follow the language specified for the rendered text in the user's description, not the language of the prompt. If no language is specified, use the user's input language."
ConvDecoder ¶
Bases: Module
NEOVisionEmbeddings ¶
NEOVisionModel ¶
Bases: Module
SenseNovaU1Pipeline ¶
Bases: Module, SupportsComponentDiscovery, DiffusionPipelineProfilerMixin
SenseNova-U1 text-to-image and image-to-image pipeline for vllm-omni.
Builds the full model graph internally: - language_model: SenseNovaU1ForCausalLM (TP-aware) - vision_model: NEOVisionModel (understanding branch) - fm_modules: ModuleDict with vision_model_mot_gen, timestep_embedder, fm_head, etc.
img2img (image editing) is triggered when multi_modal_data["image"] is present in the prompt dict. The pipeline then uses triple KV caches (condition / img_condition / uncondition) with dual CFG (cfg_scale + img_cfg_scale).
EXTRA_BODY_PARAMS class-attribute ¶
EXTRA_BODY_PARAMS: frozenset[str] = frozenset(
{
"think",
"cfg_scale",
"cfg_norm",
"timestep_shift",
"t_eps",
"img_cfg_scale",
"max_tokens",
}
)
EXTRA_OUTPUT_PARAMS class-attribute ¶
fm_modules instance-attribute ¶
fm_modules = ModuleDict(
{
"vision_model_mot_gen": vision_model_mot_gen,
"timestep_embedder": timestep_embedder,
"fm_head": fm_head,
}
)
img_context_token_id instance-attribute ¶
img_context_token_id = convert_tokens_to_ids(
IMG_CONTEXT_TOKEN
)
language_model instance-attribute ¶
language_model = SenseNovaU1ForCausalLM(
llm_cfg, prefix="language_model"
)
weights_sources instance-attribute ¶
weights_sources = [
ComponentSource(
model_or_path=local_model_path,
subfolder=None,
revision=revision,
prefix="",
fall_back_to_pt=False,
)
]
TimestepEmbedder ¶
Bases: Module
mlp instance-attribute ¶
mlp = Sequential(
Linear(
frequency_embedding_size, hidden_size, bias=True
),
SiLU(),
Linear(hidden_size, hidden_size, bias=True),
)
get_sensenova_u1_post_process_func ¶
get_sensenova_u1_post_process_func(
od_config: OmniDiffusionConfig,
)