Skip to content

vllm_omni.diffusion.models.omnigen2.pipeline_omnigen2

logger module-attribute

logger = getLogger(__name__)

FlowMatchEulerDiscreteScheduler

Bases: SchedulerMixin, ConfigMixin

Euler scheduler.

This model inherits from [SchedulerMixin] and [ConfigMixin]. Check the superclass documentation for the generic methods the library implements for all schedulers such as loading and saving.

Parameters:

Name Type Description Default
num_train_timesteps `int`, defaults to 1000

The number of diffusion steps to train the model.

1000
dynamic_time_shift `bool`, defaults to `True`

Whether to use dynamic time shifting for the timestep schedule.

True

begin_index property

begin_index

The index for the first timestep. It should be set from pipeline with set_begin_index method.

order class-attribute instance-attribute

order = 1

step_index property

step_index

The index counter for current timestep. It will increase 1 after each scheduler step.

timesteps instance-attribute

timesteps = timesteps

index_for_timestep

index_for_timestep(timestep, schedule_timesteps=None)

set_begin_index

set_begin_index(begin_index: int = 0)

Sets the begin index for the scheduler. This function should be run from pipeline before the inference.

Parameters:

Name Type Description Default
begin_index `int`

The begin index for the scheduler.

0

set_timesteps

set_timesteps(
    num_inference_steps: int = None,
    device: str | device = None,
    timesteps: list[float] | None = None,
    num_tokens: int | None = None,
)

Sets the discrete timesteps used for the diffusion chain (to be run before inference).

Parameters:

Name Type Description Default
num_inference_steps `int`

The number of diffusion steps used when generating samples with a pre-trained model.

None
device `str` or `torch.device`, *optional*

The device to which the timesteps should be moved to. If None, the timesteps are not moved.

None
timesteps `list[float]`, *optional*

Custom timesteps to use. If provided, num_inference_steps is ignored.

None
num_tokens `int`, *optional*

Number of tokens, used for dynamic time shifting.

None

step

step(
    model_output: FloatTensor,
    timestep: float | FloatTensor,
    sample: FloatTensor,
    generator: Generator | None = None,
    return_dict: bool = True,
) -> FlowMatchEulerDiscreteSchedulerOutput | tuple

Predict the sample from the previous timestep by reversing the SDE. This function propagates the diffusion process from the learned model outputs (most often the predicted noise).

Parameters:

Name Type Description Default
model_output `torch.FloatTensor`

The direct output from learned diffusion model.

required
timestep `float`

The current discrete timestep in the diffusion chain.

required
sample `torch.FloatTensor`

A current instance of a sample created by the diffusion process.

required
generator `torch.Generator`, *optional*

A random number generator.

None
return_dict `bool`

Whether or not to return a [~FlowMatchEulerDiscreteSchedulerOutput] or tuple.

True

Returns:

Type Description
FlowMatchEulerDiscreteSchedulerOutput | tuple

[~FlowMatchEulerDiscreteSchedulerOutput] or tuple: If return_dict is True, [~FlowMatchEulerDiscreteSchedulerOutput] is returned, otherwise a tuple is returned where the first element is the sample tensor.

FlowMatchEulerDiscreteSchedulerOutput dataclass

Bases: BaseOutput

Output class for the scheduler's step function output.

Parameters:

Name Type Description Default
prev_sample `torch.FloatTensor` of shape `(batch_size, num_channels, height, width)` for images

Computed sample (x_{t-1}) of previous timestep. prev_sample should be used as next model input in the denoising loop.

required

prev_sample instance-attribute

prev_sample: FloatTensor

OmniGen2ImageProcessor

Bases: VaeImageProcessor

Image processor for OmniGen2 image resize and crop.

Parameters:

Name Type Description Default
do_resize `bool`, *optional*, defaults to `True`

Whether to downscale the image's (height, width) dimensions to multiples of vae_scale_factor. Can accept height and width arguments from [image_processor.VaeImageProcessor.preprocess] method.

True
vae_scale_factor `int`, *optional*, defaults to `16`

VAE scale factor. If do_resize is True, the image is automatically resized to multiples of this factor.

16
resample `str`, *optional*, defaults to `lanczos`

Resampling filter to use when resizing the image.

'lanczos'
max_pixels `int`, *optional*, defaults to `1048576`

Maximum number of pixels allowed in the image. Images exceeding this limit are downscaled proportionally.

1024 * 1024
max_side_length `int`, *optional*, defaults to `1024`

Maximum length of the longer side of the image. Images exceeding this limit are downscaled proportionally.

1024
do_normalize `bool`, *optional*, defaults to `True`

Whether to normalize the image to [-1,1].

True
do_binarize `bool`, *optional*, defaults to `False`

Whether to binarize the image to 0/1.

False
do_convert_grayscale `bool`, *optional*, defaults to `False`

Whether to convert the images to grayscale format.

False

max_pixels instance-attribute

max_pixels = max_pixels

max_side_length instance-attribute

max_side_length = max_side_length

get_new_height_width

get_new_height_width(
    image: Image | ndarray | Tensor,
    height: int | None = None,
    width: int | None = None,
    max_pixels: int | None = None,
    max_side_length: int | None = None,
) -> tuple[int, int]

Returns the height and width of the image, downscaled to the next integer multiple of vae_scale_factor.

Parameters:

Name Type Description Default
image `Union[PIL.Image.Image, np.ndarray, torch.Tensor]`

The image input, which can be a PIL image, NumPy array, or PyTorch tensor. If it is a NumPy array, it should have shape [batch, height, width] or [batch, height, width, channels]. If it is a PyTorch tensor, it should have shape [batch, channels, height, width].

required
height `Optional[int]`, *optional*, defaults to `None`

The height of the preprocessed image. If None, the height of the image input will be used.

None
width `Optional[int]`, *optional*, defaults to `None`

The width of the preprocessed image. If None, the width of the image input will be used.

None

Returns:

Type Description
tuple[int, int]

Tuple[int, int]: A tuple containing the height and width, both resized to the nearest integer multiple of vae_scale_factor.

preprocess

preprocess(
    image: PipelineImageInput,
    height: int | None = None,
    width: int | None = None,
    max_pixels: int | None = None,
    max_side_length: int | None = None,
    resize_mode: str = "default",
    crops_coords: tuple[int, int, int, int] | None = None,
) -> Tensor

Preprocess the image input.

Parameters:

Name Type Description Default
image `PipelineImageInput`

The image input, accepted formats are PIL images, NumPy arrays, PyTorch tensors; Also accept list of supported formats.

required
height `int`, *optional*

The height in preprocessed image. If None, will use the get_default_height_width() to get default height.

None
width `int`, *optional*

The width in preprocessed. If None, will use get_default_height_width()` to get the default width.

None
resize_mode `str`, *optional*, defaults to `default`

The resize mode, can be one of default or fill. If default, will resize the image to fit within the specified width and height, and it may not maintaining the original aspect ratio. If fill, will resize the image to fit within the specified width and height, maintaining the aspect ratio, and then center the image within the dimensions, filling empty with data from image. If crop, will resize the image to fit within the specified width and height, maintaining the aspect ratio, and then center the image within the dimensions, cropping the excess. Note that resize_mode fill and crop are only supported for PIL image input.

'default'
crops_coords `List[Tuple[int, int, int, int]]`, *optional*, defaults to `None`

The crop coordinates for each image in the batch. If None, will not crop the image.

None

Returns:

Type Description
Tensor

torch.Tensor: The preprocessed image.

OmniGen2Pipeline

Bases: CFGParallelMixin, Module, SupportsComponentDiscovery

Pipeline for text-to-image generation using OmniGen2.

This pipeline implements a text-to-image generation model that uses: - Qwen2.5-VL for text encoding - A custom transformer architecture for image generation - VAE for image encoding/decoding - FlowMatchEulerDiscreteScheduler for noise scheduling

Parameters:

Name Type Description Default
od_config OmniDiffusionConfig

The OmniDiffusion configuration.

required

cfg_range property

cfg_range

default_sample_size instance-attribute

default_sample_size = 128

device instance-attribute

device = get_local_device()

image_guidance_scale property

image_guidance_scale

image_processor instance-attribute

image_processor = OmniGen2ImageProcessor(
    vae_scale_factor=vae_scale_factor * 2, do_resize=True
)

mllm instance-attribute

mllm = to(device)

num_timesteps property

num_timesteps

od_config instance-attribute

od_config = od_config

processor instance-attribute

processor = from_pretrained_with_prefetch(
    from_pretrained,
    model,
    subfolder="processor",
    prefetch_list=omnigen2_subfolders,
    local_files_only=local_files_only,
)

scheduler instance-attribute

scheduler = from_pretrained(
    model,
    subfolder="scheduler",
    local_files_only=local_files_only,
)

text_guidance_scale property

text_guidance_scale

transformer instance-attribute

transformer = OmniGen2Transformer2DModel(
    **transformer_kwargs, quant_config=quantization_config
)

vae instance-attribute

vae = to(device)

vae_scale_factor instance-attribute

vae_scale_factor = (
    2 ** (len(block_out_channels) - 1)
    if hasattr(self, "vae") and vae is not None
    else 8
)

weights_sources instance-attribute

weights_sources = [
    ComponentSource(
        model_or_path=model,
        subfolder="transformer",
        revision=None,
        prefix="transformer.",
        fall_back_to_pt=True,
    )
]

combine_multi_branch_cfg_noise

combine_multi_branch_cfg_noise(
    predictions, true_cfg_scale, cfg_normalize=False
)

Override: 3-branch dual scale or 2-branch standard CFG.

encode_prompt

encode_prompt(
    prompt: str | list[str],
    do_classifier_free_guidance: bool = True,
    negative_prompt: str | list[str] | None = None,
    num_images_per_prompt: int = 1,
    device: device | None = None,
    prompt_embeds: Tensor | None = None,
    negative_prompt_embeds: Tensor | None = None,
    prompt_attention_mask: Tensor | None = None,
    negative_prompt_attention_mask: Tensor | None = None,
    max_sequence_length: int = 256,
) -> tuple[Tensor, Tensor, Tensor, Tensor]

Encodes the prompt into text encoder hidden states.

Parameters:

Name Type Description Default
prompt `str` or `List[str]`, *optional*

prompt to be encoded

required
negative_prompt `str` or `List[str]`, *optional*

The prompt not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1). For Lumina-T2I, this should be "".

None
do_classifier_free_guidance `bool`, *optional*, defaults to `True`

whether to use classifier free guidance or not

True
num_images_per_prompt `int`, *optional*, defaults to 1

number of images that should be generated per prompt

1
device device | None

(torch.device, optional): torch device to place the resulting embeddings on

None
prompt_embeds `torch.Tensor`, *optional*

Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.

None
negative_prompt_embeds `torch.Tensor`, *optional*

Pre-generated negative text embeddings. For Lumina-T2I, it's should be the embeddings of the "" string.

None
max_sequence_length `int`, defaults to `256`

Maximum sequence length to use for the prompt.

256

encode_vae

encode_vae(img: FloatTensor) -> FloatTensor

Encode an image into the VAE latent space.

Parameters:

Name Type Description Default
img FloatTensor

The input image tensor to encode.

required

Returns:

Type Description
FloatTensor

torch.FloatTensor: The encoded latent representation.

forward

forward(
    req: OmniDiffusionRequest,
    prompt: str | list[str] | None = None,
    negative_prompt: str | list[str] | None = None,
    prompt_embeds: FloatTensor | None = None,
    negative_prompt_embeds: FloatTensor | None = None,
    prompt_attention_mask: LongTensor | None = None,
    negative_prompt_attention_mask: LongTensor
    | None = None,
    max_sequence_length: int | None = 1024,
    input_images: list[Image] | None = None,
    num_images_per_prompt: int = 1,
    height: int | None = None,
    width: int | None = None,
    max_pixels: int = 1024 * 1024,
    max_input_image_side_length: int = 1024,
    align_res: bool = True,
    num_inference_steps: int = 28,
    text_guidance_scale: float = 4.0,
    image_guidance_scale: float = 1.0,
    cfg_range: tuple[float, float] = (0.0, 1.0),
    attention_kwargs: dict[str, Any] | None = None,
    timesteps: list[int] = None,
    generator: Generator | list[Generator] | None = None,
    latents: FloatTensor | None = None,
    verbose: bool = False,
    step_func=None,
) -> DiffusionOutput

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

predict

predict(
    t,
    latents,
    prompt_embeds,
    freqs_cis,
    prompt_attention_mask,
    ref_image_hidden_states,
)

predict_noise

predict_noise(**kwargs)

Override CFGParallelMixin.predict_noise to use self.predict.

prepare_image

prepare_image(
    images: list[Image] | Image,
    batch_size: int,
    num_images_per_prompt: int,
    max_pixels: int,
    max_side_length: int,
    device: device,
    dtype: dtype,
) -> list[FloatTensor | None]

Prepare input images for processing by encoding them into the VAE latent space.

Parameters:

Name Type Description Default
images list[Image] | Image

Single image or list of images to process.

required
batch_size int

The number of images to generate per prompt.

required
num_images_per_prompt int

The number of images to generate for each prompt.

required
device device

The device to place the encoded latents on.

required
dtype dtype

The data type of the encoded latents.

required

Returns:

Type Description
list[FloatTensor | None]

List[Optional[torch.FloatTensor]]: List of encoded latent representations for each image.

prepare_latents

prepare_latents(
    batch_size: int,
    num_channels_latents: int,
    height: int,
    width: int,
    dtype: dtype,
    device: device,
    generator: Generator | None,
    latents: FloatTensor | None = None,
) -> FloatTensor

Prepare the initial latents for the diffusion process.

Parameters:

Name Type Description Default
batch_size int

The number of images to generate.

required
num_channels_latents int

The number of channels in the latent space.

required
height int

The height of the generated image.

required
width int

The width of the generated image.

required
dtype dtype

The data type of the latents.

required
device device

The device to place the latents on.

required
generator Generator | None

The random number generator to use.

required
latents FloatTensor | None

Optional pre-computed latents to use instead of random initialization.

None

Returns:

Type Description
FloatTensor

torch.FloatTensor: The prepared latents tensor.

processing

processing(
    latents,
    ref_latents,
    prompt_embeds,
    freqs_cis,
    negative_prompt_embeds,
    prompt_attention_mask,
    negative_prompt_attention_mask,
    num_inference_steps,
    timesteps,
    device,
    dtype,
    verbose,
    step_func=None,
)

get_omnigen2_post_process_func

get_omnigen2_post_process_func(
    od_config: OmniDiffusionConfig,
)

get_omnigen2_pre_process_func

get_omnigen2_pre_process_func(
    od_config: OmniDiffusionConfig,
)

Pre-processing function for OmniGen2Pipeline.

retrieve_timesteps

retrieve_timesteps(
    scheduler,
    num_inference_steps: int | None = None,
    device: str | device | None = None,
    timesteps: list[int] | None = None,
    **kwargs: Any,
)

Calls the scheduler's set_timesteps method and retrieves timesteps from the scheduler after the call. Handles custom timesteps. Any kwargs will be supplied to scheduler.set_timesteps.

Parameters:

Name Type Description Default
scheduler `SchedulerMixin`

The scheduler to get timesteps from.

required
num_inference_steps `int`

The number of diffusion steps used when generating samples with a pre-trained model. If used, timesteps must be None.

None
device `str` or `torch.device`, *optional*

The device to which the timesteps should be moved to. If None, the timesteps are not moved.

None
timesteps `List[int]`, *optional*

Custom timesteps used to override the timestep spacing strategy of the scheduler. If timesteps is passed, num_inference_steps must be None.

None
**kwargs `Any`

Additional keyword arguments passed to scheduler.set_timesteps.

{}

Returns:

Name Type Description
timesteps `torch.Tensor`

The timestep schedule from the scheduler.

num_inference_steps `int`

The number of inference steps.