Skip to content

Text-To-Image

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/text_to_image.

Generate images from text prompts using vLLM-Omni's diffusion pipeline entrypoints.

  • text_to_image.py: command-line script for single image generation with advanced options.
  • gradio_demo.py: lightweight Gradio UI for interactive prompt/seed/CFG exploration.

Table of Contents

Overview

This folder provides several entrypoints for experimenting with text-to-image diffusion models using vLLM-Omni. Note that NextStep-1.1 has a different architecture, so it is treated differently regarding running arguments and pipeline.

Supported Models

Model Image Shape Peak VRAM (GiB) * Model Weights (GiB)
Qwen/Qwen-Image 1024 x 1024 60.0 53.7
Qwen/Qwen-Image-2512 1024 x 1024 60.0 53.7
Tongyi-MAI/Z-Image-Turbo 1024 x 1024 24.8 19.2
stepfun-ai/NextStep-1.1 512 x 512 71.8 28.1
meituan-longcat/LongCat-Image 1024 x 1024 71.2 27.3
AIDC-AI/Ovis-Image-7B 1024 x 1024 71.8 17.1
OmniGen2/OmniGen2 1024 x 1024 20.1 14.7
stabilityai/stable-diffusion-3.5-medium 1024 x 1024 20.1 15.6
black-forest-labs/FLUX.1-dev 1024 x 1024 33.9 31.4
black-forest-labs/FLUX.1-schnell 1024 x 1024 33.9 31.4
black-forest-labs/FLUX.2-klein-4B 1024 x 1024 72.7 14.9
black-forest-labs/FLUX.2-klein-9B 1024 x 1024 37.1 32.3
black-forest-labs/FLUX.2-dev 1024 x 1024 65.7 >80 (CPU offload required)
HunyuanImage-3.0 1024 x 1024 80.0 (TP≥3) 160
HiDream-I1-Full 1024 x 1024 63.7 57.7

Info

*Peak VRAM: based on basic single-card usage, batch size =1, without any acceleration/optimization features. FLUX.2-dev requires --enable-cpu-offload on a single 80 GiB GPU.

Default model: Qwen/Qwen-Image

Quick Start

Python API

Single-prompt generation:

from vllm_omni.entrypoints.omni import Omni

if __name__ == "__main__":
    omni = Omni(model="Qwen/Qwen-Image")
    prompt = "a cup of coffee on the table"
    outputs = omni.generate(prompt)
    images = outputs[0].request_output.images
    images[0].save("coffee.png")

Local CLI Usage

python text_to_image.py \
  --model Qwen/Qwen-Image \
  --prompt "a cup of coffee on the table" \
  --output coffee.png

Key Arguments

Common arguments:

Argument Type Default Description
--prompt str "a cup of coffee on the table" Text description for image generation
--seed int 142 Integer seed for deterministic sampling
--negative-prompt str None Negative prompt for classifier-free conditional guidance
--cfg-scale float 4.0 True CFG scale (model-specific guidance strength)
--guidance-scale float 1.0 Classifier-free guidance scale
--num-images-per-prompt int 1 Number of images per prompt (saved as output, output_1, ...)
--num-inference-steps int 50 Diffusion sampling steps (more steps = higher quality, slower)
--height int 1024 Output image height in pixels
--width int 1024 Output image width in pixels
--output str "qwen_image_output.png" Path to save the generated image
--vae-use-slicing flag off Enable VAE slicing for memory optimization
--vae-use-tiling flag off Enable VAE tiling for memory optimization
--cfg-parallel-size int 1 Set to 2 to enable CFG Parallel
--ulysses-degree int 1 Ulysses sequence parallel degree for multi-GPU inference
--ring-degree int 1 Ring sequence parallel degree for hybrid Ulysses + Ring inference
--ulysses-mode str "strict" Ulysses SP mode: "strict" or "advanced_uaa"
--enable-cpu-offload flag off Enable CPU offloading for diffusion models
--lora-path str Path to PEFT LoRA adapter folder
--lora-scale float 1.0 Scale factor for LoRA weights
--use-system-prompt str None System prompt preset: en_unified, en_vanilla, en_recaption, en_think_recaption, dynamic, None, or custom text. Recommended: en_unified. Only for HunyuanImage-3.0.
--system-prompt str None Custom system prompt text. Only used when --use-system-prompt is set to custom. Only for HunyuanImage-3.0.
--auxiliary-text-encoder str None Supplementary auxiliary text encoder parameters model name or path (especially for Hidream-l1-full).

NextStep-1.1 specific arguments:

Argument Type Default Description
--guidance-scale-2 float 1.0 Secondary guidance scale (e.g. image-level CFG)
--timesteps-shift float 1.0 Timesteps shift parameter for sampling
--cfg-schedule str "constant" CFG schedule type: "constant" or "linear"
--use-norm flag off Apply layer normalization to sampled tokens

If you encounter OOM errors, try using --vae-use-slicing and --vae-use-tiling to reduce memory usage.

Qwen-Image currently publishes best-effort presets at 1328x1328, 1664x928, 928x1664, 1472x1140, 1140x1472, 1584x1056, and 1056x1584. Adjust --height/--width accordingly for the most reliable outcomes.

More CLI Examples

Tongyi Models

python text_to_image.py \
  --model Tongyi-MAI/Z-Image-Turbo \
  --prompt "a cup of coffee on the table" \
  --seed 42 \
  --guidance-scale 0.0 \
  --num-images-per-prompt 1 \
  --num-inference-steps 9 \
  --height 1024 \
  --width 1024 \
  --output outputs/coffee.png

Tongyi-MAI/Z-Image-Turbo is a distilled version of Z-Image. Distilled diffusion models usually require less number of inference steps (4~9), and Classifier-Free Guidance (CFG) is usually NOT applied. Similar distilled models are black-forest-labs/FLUX.2-klein-4B and black-forest-labs/FLUX.2-klein-9B.

Advanced UAA example (requires 2 GPUs):

python text_to_image.py \
  --model Tongyi-MAI/Z-Image-Turbo \
  --prompt "a cup of coffee on the table" \
  --ulysses-degree 2 \
  --ulysses-mode advanced_uaa \
  --height 1024 \
  --width 1024 \
  --output outputs/coffee_hybrid.png

NextStep Models

NextStep-1.1 supports extra arguments for dual-level CFG control:

python text_to_image.py \
  --model stepfun-ai/NextStep-1.1 \
  --prompt "A baby panda wearing an Iron Man mask, holding a board with 'NextStep-1' written on it" \
  --height 512 \
  --width 512 \
  --num-inference-steps 28 \
  --guidance-scale 7.5 \
  --guidance-scale-2 1.0 \
  --cfg-schedule constant \
  --output nextstep_output.png \
  --seed 42

FLUX.2-dev Models

To run FLUX.2-dev on a single GPU, --enable-cpu-offload is required because the model weights exceed 80 GiB:

python examples/offline_inference/text_to_image/text_to_image.py \
  --model black-forest-labs/FLUX.2-dev \
  --prompt "a lovely bunny holding a sign that says 'vllm-omni'" \
  --seed 42 \
  --tensor-parallel-size 1 \
  --num-images-per-prompt 1 \
  --num-inference-steps 50 \
  --guidance-scale 4.0 \
  --height 1024 \
  --width 1024 \
  --enable-cpu-offload \
  --output flux2-dev.png

HiDream-I1-Full Models

The --auxiliary-text-encoder parameter is required when running HiDream‑I1‑Full:

python examples/offline_inference/text_to_image/text_to_image.py \
  --model HiDream-ai/HiDream-I1-Full \
  --prompt "The setting sun of late autumn dyes the riverside with a warm orange hue" \
  --seed 42 \
  --guidance-scale 5.0 \
  --tensor-parallel-size 1 \
  --num-images-per-prompt 1 \
  --num-inference-steps 50 \
  --auxiliary-text-encoder meta-llama/Meta-Llama-3.1-8B-Instruct \
  --output /output.png

Batch Requests (Multiple Prompts)

You can pass multiple prompts in a single generate call.

from vllm_omni.entrypoints.omni import Omni

if __name__ == "__main__":
    omni = Omni(model="Qwen/Qwen-Image")
    prompts = [
        "a cup of coffee on a table",
        "a toy dinosaur on a sandy beach",
        "a fox waking up in bed and yawning",
    ]
    outputs = omni.generate(prompts)
    for i, output in enumerate(outputs):
        output.request_output.images[0].save(f"{i}.jpg")

Info

Not all models support batch inference, and batch requesting mostly does not provide significant performance improvement. This feature is primarily for interface compatibility with vLLM and to allow for future improvements.

Info

For diffusion pipelines, the stage config field stage_args.[].runtime.max_batch_size is 1 by default, and the input list is sliced into single-item requests before feeding into the diffusion pipeline. For models that do internally support batched inputs, you can modify this configuration to let the model accept a longer batch of prompts.

Negative Prompts

vLLM-Omni supports dictionary prompts for models that accept negative prompts:

from vllm_omni.entrypoints.omni import Omni

if __name__ == "__main__":
    omni = Omni(model="Qwen/Qwen-Image")
    outputs = omni.generate([
        {
            "prompt": "a cup of coffee on a table",
            "negative_prompt": "low resolution"
        },
        {
            "prompt": "a toy dinosaur on a sandy beach",
            "negative_prompt": "cinematic, realistic"
        }
    ])
    for i, output in enumerate(outputs):
        output.request_output.images[0].save(f"{i}.jpg")

You can also pass a negative prompt via the CLI argument --negative-prompt:

python examples/offline_inference/text_to_image/text_to_image.py \
  --model Qwen/Qwen-Image \
  --prompt "a cup of coffee on a table" \
  --negative-prompt "low resolution, blurry" \
  --output coffee.png

Advanced Features

CFG Parallel

Set --cfg-parallel-size 2 to enable CFG Parallel for faster inference on multi-GPU setups. See more examples in the cfg_parallel user guide.

LoRA

This example supports PEFT-compatible LoRA (Low-Rank Adaptation) adapters for diffusion models. Pass --lora-path to use a LoRA adapter and optionally --lora-scale (default 1.0); omit it to use the base model only.

python text_to_image.py \
  --model Tongyi-MAI/Z-Image-Turbo \
  --prompt "A piece of cheesecake" \
  --lora-path /path/to/lora/ \
  --lora-scale 1.0 \
  --output output.png

LoRA adapters must be in PEFT format. A typical adapter directory structure:

lora_adapter/
├── adapter_config.json
└── adapter_model.safetensors

Web UI Demo

Launch the Gradio demo:

python gradio_demo.py --port 7862

Then open http://localhost:7862/ in your local browser to interact with the web UI.

Example materials

gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/text_to_image/gradio_demo.py.

text_to_image.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/text_to_image/text_to_image.py.