Skip to content

vllm_omni.diffusion.models.sensenova_u1.pipeline_sensenova_u1

SenseNova-U1 Pipeline for vLLM-Omni.

SenseNova-U1 is a unified Qwen3-based model that uses Mixture-of-Tokenizers (MoT) attention for text-to-image generation via flow matching in patch space. It has no separate VAE or text encoder — the Qwen3 LLM itself serves as both the text encoder (via KV cache) and the denoising backbone (via MoT branches).

Key integration points: - Transformer layers ported with TP support (QKVParallelLinear, MergedColumnParallelLinear, RowParallelLinear) in sensenova_u1_transformer.py. - Vision model (NEOVisionModel) and FM modules kept as standard nn.Module since they are lightweight (no transformer blocks). - Weight loading uses stacked_params_mapping for fused QKV and gate_up.

IMAGENET_MEAN module-attribute

IMAGENET_MEAN = (0.485, 0.456, 0.406)

IMAGENET_STD module-attribute

IMAGENET_STD = (0.229, 0.224, 0.225)

IMG_CONTEXT_TOKEN module-attribute

IMG_CONTEXT_TOKEN = '<IMG_CONTEXT>'

IMG_END_TOKEN module-attribute

IMG_END_TOKEN = '</img>'

IMG_START_TOKEN module-attribute

IMG_START_TOKEN = '<img>'

NORM_MEAN module-attribute

NORM_MEAN = (0.5, 0.5, 0.5)

NORM_STD module-attribute

NORM_STD = (0.5, 0.5, 0.5)

SYSTEM_MESSAGE_FOR_GEN module-attribute

SYSTEM_MESSAGE_FOR_GEN = "You are an image generation and editing assistant that accurately understands and executes user intent.\n\nYou support two modes:\n\n1. Think Mode:\nIf the task requires reasoning, you MUST start with a <think></think> block. Put all reasoning inside the block using plain text. DO NOT include any image tags. Keep it reasonable and directly useful for producing the final image.\n\n2. Non-Think Mode:\nIf no reasoning is needed, directly produce the final image.\n\nTask Types:\n\nA. Text-to-Image Generation:\n- Generate a high-quality image based on the user's description.\n- Ensure visual clarity, semantic consistency, and completeness.\n- DO NOT introduce elements that contradict or override the user's intent.\n\nB. Image Editing:\n- Use the provided image(s) as input or reference for modification or transformation.\n- The result can be an edited image or a new image based on the reference(s).\n- Preserve all unspecified attributes unless explicitly changed.\n\nGeneral Rules:\n- For any visible text in the image, follow the language specified for the rendered text in the user's description, not the language of the prompt. If no language is specified, use the user's input language."

logger module-attribute

logger = init_logger(__name__)

ConvDecoder

Bases: Module

act1 instance-attribute

act1 = GELU()

conv1 instance-attribute

conv1 = Conv2d(
    input_dim // 4, hidden_dim, kernel_size=3, padding=1
)

conv2 instance-attribute

conv2 = Conv2d(
    hidden_dim // 4, 192, kernel_size=3, padding=1
)

ps1 instance-attribute

ps1 = PixelShuffle(2)

ps2 instance-attribute

ps2 = PixelShuffle(2)

ps3 instance-attribute

ps3 = PixelShuffle(8)

forward

forward(x)

NEOVisionEmbeddings

Bases: Module

config instance-attribute

config = config

dense_embedding instance-attribute

dense_embedding = Conv2d(
    embed_dim,
    llm_embed_dim,
    kernel_size=downsample_factor,
    stride=downsample_factor,
)

downsample_factor instance-attribute

downsample_factor = int(1 / ds_ratio)

embed_dim instance-attribute

embed_dim = hidden_size

gelu instance-attribute

gelu = GELU()

llm_embed_dim instance-attribute

llm_embed_dim = llm_hidden

patch_embedding instance-attribute

patch_embedding = Conv2d(
    num_channels,
    embed_dim,
    kernel_size=patch_size,
    stride=patch_size,
)

patch_size instance-attribute

patch_size = patch_size

forward

forward(pixel_values, grid_hw=None)

NEOVisionModel

Bases: Module

embeddings instance-attribute

embeddings = NEOVisionEmbeddings(config)

forward

forward(pixel_values=None, grid_hw=None, **_kwargs)

SenseNovaU1Pipeline

Bases: Module, SupportsComponentDiscovery, DiffusionPipelineProfilerMixin

SenseNova-U1 text-to-image and image-to-image pipeline for vllm-omni.

Builds the full model graph internally: - language_model: SenseNovaU1ForCausalLM (TP-aware) - vision_model: NEOVisionModel (understanding branch) - fm_modules: ModuleDict with vision_model_mot_gen, timestep_embedder, fm_head, etc.

img2img (image editing) is triggered when multi_modal_data["image"] is present in the prompt dict. The pipeline then uses triple KV caches (condition / img_condition / uncondition) with dual CFG (cfg_scale + img_cfg_scale).

EXTRA_BODY_PARAMS class-attribute

EXTRA_BODY_PARAMS: frozenset[str] = frozenset(
    {
        "think",
        "cfg_scale",
        "cfg_norm",
        "timestep_shift",
        "t_eps",
        "img_cfg_scale",
        "max_tokens",
    }
)

EXTRA_OUTPUT_PARAMS class-attribute

EXTRA_OUTPUT_PARAMS: frozenset[str] = frozenset(
    {"think_text"}
)

device instance-attribute

device = get_local_device()

downsample_ratio instance-attribute

downsample_ratio = downsample_ratio

fm_modules instance-attribute

fm_modules = ModuleDict(
    {
        "vision_model_mot_gen": vision_model_mot_gen,
        "timestep_embedder": timestep_embedder,
        "fm_head": fm_head,
    }
)

img_context_token_id instance-attribute

img_context_token_id = convert_tokens_to_ids(
    IMG_CONTEXT_TOKEN
)

img_start_token_id instance-attribute

img_start_token_id = convert_tokens_to_ids(IMG_START_TOKEN)

language_model instance-attribute

language_model = SenseNovaU1ForCausalLM(
    llm_cfg, prefix="language_model"
)

local_model_path instance-attribute

local_model_path = _resolve_model_path(model_path)

merge_size instance-attribute

merge_size = merge_size

od_config instance-attribute

od_config = od_config

patch_size instance-attribute

patch_size = patch_size

support_image_input class-attribute instance-attribute

support_image_input = True

tokenizer instance-attribute

tokenizer = from_pretrained(local_model_path)

vision_model instance-attribute

vision_model = NEOVisionModel(vis_cfg)

weights_sources instance-attribute

weights_sources = [
    ComponentSource(
        model_or_path=local_model_path,
        subfolder=None,
        revision=revision,
        prefix="",
        fall_back_to_pt=False,
    )
]

forward

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

TimestepEmbedder

Bases: Module

frequency_embedding_size instance-attribute

frequency_embedding_size = frequency_embedding_size

mlp instance-attribute

mlp = Sequential(
    Linear(
        frequency_embedding_size, hidden_size, bias=True
    ),
    SiLU(),
    Linear(hidden_size, hidden_size, bias=True),
)

forward

forward(t)

timestep_embedding staticmethod

timestep_embedding(t, dim, max_period=10000.0)

get_sensenova_u1_post_process_func

get_sensenova_u1_post_process_func(
    od_config: OmniDiffusionConfig,
)