Skip to content

SenseNova-U1-8B-MoT

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/sensenova_u1.

Architecture

SenseNova-U1 is a unified Qwen3-based LLM with Mixture-of-Tokenizers (MoT) attention. Unlike two-stage pipelines (e.g., BAGEL), it handles text encoding, optional reasoning ("think mode"), and flow-matching-based image denoising entirely within a single diffusion stage.

Feature Description
Model type Unified LLM (single-stage diffusion pipeline)
Base LLM Qwen3 with MoT attention and 3D RoPE
Image generation Flow-matching Euler sampler, no separate VAE
Think mode Optional chain-of-thought reasoning before image generation
Parallelism Tensor Parallelism (TP) with fused QKV and fused gate/up projections
Modalities text2img, img2img, img2text (understanding), text2text

Quick Start

cd examples/offline_inference/sensenova_u1

# Text-to-image
python end2end.py --prompt "A cute cat" --think

# Image-to-image editing
python end2end.py --prompt "Turn this into an oil painting" \
                  --image input.png --think

# Image understanding (img2text)
python end2end.py --modality img2text \
                  --prompt "Describe this image in detail" \
                  --image photo.jpg

# Text-to-text (pure chat)
python end2end.py --modality text2text \
                  --prompt "What is the capital of France?"

# Custom resolution (image generation)
python end2end.py --prompt "A futuristic cityscape at sunset" \
                  --width 2048 --height 1024 --think

# Text-to-image with Cache-DiT acceleration
python end2end.py --prompt "A cute cat" \
                  --cache-backend cache_dit \

Note: Default configuration works on a single NVIDIA A100 (80GB) or H100 GPU.

Text-to-Image (t2i)

Standard text-to-image with optional think mode:

python end2end.py \
    --prompt "Close portrait of an elderly woman by a farmhouse window, textured skin, gentle smile, warm natural light" \
    --width 1536 --height 2720 \
    --think --print-think

Image-to-Image Editing (img2img)

Pass one or more --image paths to trigger img2img mode. The model uses the input image(s) as visual context for editing:

# Single input image
python end2end.py \
    --prompt "Add a sunset sky in the background" \
    --image photo.jpg \
    --width 2048 --height 2048 \
    --think

# Multiple reference images
python end2end.py \
    --prompt "Combine the style of Image-1 with the content of Image-2" \
    --image style_ref.png content_ref.png \
    --width 2048 --height 2048 \
    --think

Dual CFG for img2img

img2img supports dual classifier-free guidance (text CFG + image CFG):

python end2end.py \
    --prompt "Make the person smile" \
    --image portrait.jpg \
    --cfg-scale 4.0 \
    --img-cfg-scale 2.0 \
    --think
img_cfg_scale Behavior
1.0 (default) Text-only CFG: guidance = image-condition → full-condition
== cfg_scale Standard CFG: guidance = unconditional → full-condition
Other value Dual CFG: separate text and image guidance strengths

Full Parameter Reference

General Parameters

Parameter Default Description
--modality auto Task modality: auto, text2img, img2img, img2text, text2text
--prompt "A cute cat..." Text prompt / editing instruction / question
--image None Input image path(s) for img2img or img2text

Image Generation Parameters (text2img / img2img)

Parameter Default Description
--height 2048 Height of generated image (pixels)
--width 2048 Width of generated image (pixels)
--seed 42 Random seed for reproducibility
--num-steps 50 Number of denoising steps
--cfg-scale 4.0 Text classifier-free guidance scale
--img-cfg-scale 1.0 Image CFG scale (img2img only, 1.0 = disabled)
--cfg-norm "none" CFG normalization: none, global, channel, cfg_zero_star (t2i only)
--timestep-shift 3.0 Timestep shift for flow-matching schedule
--t-eps 0.02 Epsilon for timestep schedule
--think False Enable think mode
--print-think False Print think text to stdout

Text Generation Parameters (text2text / img2text)

Parameter Default Description
--max-tokens 512 Maximum number of tokens to generate
--do-sample False Use sampling instead of greedy decoding
--temperature 0.7 Sampling temperature (higher = more diverse)

Infrastructure Parameters

Parameter Default Description
--model SenseNova/SenseNova-U1-8B-MoT HuggingFace model ID or local path
--output . Output directory for saved images
--tensor-parallel-size 1 Number of GPUs for tensor parallelism
--enforce-eager False Disable torch.compile
--enable-cpu-offload False Enable module-wise (sequential) CPU offload to reduce peak VRAM
--cache-backend None Set to cache_dit for Cache-DiT acceleration
--enable-cache-dit-summary False Print Cache-DiT cache statistics after generation

Reducing GPU Memory Usage

For hardware with limited VRAM, enable module-wise CPU offload with --enable-cpu-offload. The pipeline implements SupportsModuleOffload, so the vision encoder (vision_model) and the Qwen3 LLM (language_model) are swapped between CPU and GPU on demand:

  • During text/vision encoding, the LLM is on CPU.
  • During the diffusion loop, the vision encoder is on CPU.
  • Lightweight FM modules stay resident on GPU.

This lowers peak VRAM at the cost of extra CPU<->GPU transfers (use pinned memory) — useful for running SenseNova-U1 on consumer-grade GPUs.

# Text-to-image with CPU offload
python end2end.py \
    --prompt "A cute cat sitting on a windowsill" \
    --width 2048 --height 2048 \
    --enable-cpu-offload --think

# Image-to-image editing with CPU offload
python end2end.py \
    --prompt "Turn this into an oil painting" \
    --image input.png \
    --width 2048 --height 2048 \
    --enable-cpu-offload --think

# Image understanding (img2text) with CPU offload
python end2end.py \
    --modality img2text \
    --prompt "Describe this image in detail" \
    --image photo.jpg \
    --enable-cpu-offload

Notes - CPU offload is single-GPU only (incompatible with --tensor-parallel-size > 1). - First-step latency is higher because of the cold-start CPU<->GPU transfers. - For more details and other offloading strategies, see CPU Offloading for Diffusion Models.

Reproducing the E2E Test

The following command reproduces the pixel-validated CI test case:

python end2end.py \
    --prompt "Close portrait of an elderly woman by a farmhouse window, textured skin, gentle smile, warm natural light, emotional documentary look. The portrait should feel polished and natural, with sharp eyes, realistic skin texture, accurate facial anatomy, and premium lighting that keeps the face as the main focus." \
    --width 1536 --height 2720 \
    --seed 42 --num-steps 50 \
    --cfg-scale 4.0 --timestep-shift 3.0 --cfg-norm none \
    --think --print-think \
    --output outputs

The corresponding pytest:

pytest -s -v tests/e2e/offline_inference/test_sensenova_u1_text2img.py \
    -m "advanced_model" --run-level "advanced_model"

The img2img counterpart lives at tests/e2e/offline_inference/test_sensenova_u1_img2img.py.

Online Serving

For OpenAI-compatible API serving, see examples/online_serving/sensenova_u1/.

Example materials

end2end.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/sensenova_u1/end2end.py.