vllm_omni.diffusion.models.omnigen2.pipeline_omnigen2 ¶

logger `module-attribute` ¶

logger = logging.getLogger(__name__)

FlowMatchEulerDiscreteScheduler ¶

Bases: SchedulerMixin, ConfigMixin

Euler scheduler.

This model inherits from [SchedulerMixin] and [ConfigMixin]. Check the superclass documentation for the generic methods the library implements for all schedulers such as loading and saving.

Parameters:

Name	Type	Description	Default
`num_train_timesteps`	`int`, defaults to 1000	The number of diffusion steps to train the model.	`1000`
`dynamic_time_shift`	`bool`, defaults to `True`	Whether to use dynamic time shifting for the timestep schedule.	`True`

begin_index `property` ¶

begin_index

The index for the first timestep. It should be set from pipeline with set_begin_index method.

order `class-attribute` `instance-attribute` ¶

order = 1

step_index `property` ¶

step_index

The index counter for current timestep. It will increase 1 after each scheduler step.

timesteps `instance-attribute` ¶

timesteps = timesteps

index_for_timestep ¶

index_for_timestep(timestep, schedule_timesteps=None)

set_begin_index ¶

set_begin_index(begin_index: int = 0)

Sets the begin index for the scheduler. This function should be run from pipeline before the inference.

Parameters:

Name	Type	Description	Default
`begin_index`	`int`	The begin index for the scheduler.	`0`

set_timesteps ¶

set_timesteps(
    num_inference_steps: int = None,
    device: str | device = None,
    timesteps: list[float] | None = None,
    num_tokens: int | None = None,
)

Sets the discrete timesteps used for the diffusion chain (to be run before inference).

Parameters:

Name	Type	Description	Default
`num_inference_steps`	`int`	The number of diffusion steps used when generating samples with a pre-trained model.	`None`
`device`	`str` or `torch.device`, optional	The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.	`None`
`timesteps`	`list[float]`, optional	Custom timesteps to use. If provided, `num_inference_steps` is ignored.	`None`
`num_tokens`	`int`, optional	Number of tokens, used for dynamic time shifting.	`None`

step ¶

step(
    model_output: FloatTensor,
    timestep: float | FloatTensor,
    sample: FloatTensor,
    generator: Generator | None = None,
    return_dict: bool = True,
) -> FlowMatchEulerDiscreteSchedulerOutput | tuple

Predict the sample from the previous timestep by reversing the SDE. This function propagates the diffusion process from the learned model outputs (most often the predicted noise).

Parameters:

Name	Type	Description	Default
`model_output`	`torch.FloatTensor`	The direct output from learned diffusion model.	required
`timestep`	`float`	The current discrete timestep in the diffusion chain.	required
`sample`	`torch.FloatTensor`	A current instance of a sample created by the diffusion process.	required
`generator`	`torch.Generator`, optional	A random number generator.	`None`
`return_dict`	`bool`	Whether or not to return a [`~FlowMatchEulerDiscreteSchedulerOutput`] or tuple.	`True`

Returns:

Type	Description
`FlowMatchEulerDiscreteSchedulerOutput \| tuple`	[`~FlowMatchEulerDiscreteSchedulerOutput`] or `tuple`: If return_dict is `True`, [`~FlowMatchEulerDiscreteSchedulerOutput`] is returned, otherwise a tuple is returned where the first element is the sample tensor.

FlowMatchEulerDiscreteSchedulerOutput `dataclass` ¶

Bases: BaseOutput

Output class for the scheduler's step function output.

Parameters:

Name	Type	Description	Default
`prev_sample`	`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)` for images	Computed sample `(x_{t-1})` of previous timestep. `prev_sample` should be used as next model input in the denoising loop.	required

prev_sample `instance-attribute` ¶

prev_sample: FloatTensor

OmniGen2ImageProcessor ¶

Bases: VaeImageProcessor

Image processor for OmniGen2 image resize and crop.

Parameters:

Name	Type	Description	Default
`do_resize`	`bool`, optional, defaults to `True`	Whether to downscale the image's (height, width) dimensions to multiples of `vae_scale_factor`. Can accept `height` and `width` arguments from [`image_processor.VaeImageProcessor.preprocess`] method.	`True`
`vae_scale_factor`	`int`, optional, defaults to `16`	VAE scale factor. If `do_resize` is `True`, the image is automatically resized to multiples of this factor.	`16`
`resample`	`str`, optional, defaults to `lanczos`	Resampling filter to use when resizing the image.	`'lanczos'`
`max_pixels`	`int`, optional, defaults to `1048576`	Maximum number of pixels allowed in the image. Images exceeding this limit are downscaled proportionally.	`1024 * 1024`
`max_side_length`	`int`, optional, defaults to `1024`	Maximum length of the longer side of the image. Images exceeding this limit are downscaled proportionally.	`1024`
`do_normalize`	`bool`, optional, defaults to `True`	Whether to normalize the image to [-1,1].	`True`
`do_binarize`	`bool`, optional, defaults to `False`	Whether to binarize the image to 0/1.	`False`
`do_convert_grayscale`	`bool`, optional, defaults to `False`	Whether to convert the images to grayscale format.	`False`

max_pixels `instance-attribute` ¶

max_pixels = max_pixels

max_side_length `instance-attribute` ¶

max_side_length = max_side_length

get_new_height_width ¶

get_new_height_width(
    image: Image | ndarray | Tensor,
    height: int | None = None,
    width: int | None = None,
    max_pixels: int | None = None,
    max_side_length: int | None = None,
) -> tuple[int, int]

Returns the height and width of the image, downscaled to the next integer multiple of vae_scale_factor.

Parameters:

Name	Type	Description	Default
`image`	`Union[PIL.Image.Image, np.ndarray, torch.Tensor]`	The image input, which can be a PIL image, NumPy array, or PyTorch tensor. If it is a NumPy array, it should have shape `[batch, height, width]` or `[batch, height, width, channels]`. If it is a PyTorch tensor, it should have shape `[batch, channels, height, width]`.	required
`height`	`Optional[int]`, optional, defaults to `None`	The height of the preprocessed image. If `None`, the height of the `image` input will be used.	`None`
`width`	`Optional[int]`, optional, defaults to `None`	The width of the preprocessed image. If `None`, the width of the `image` input will be used.	`None`

Returns:

Type	Description
`tuple[int, int]`	`Tuple[int, int]`: A tuple containing the height and width, both resized to the nearest integer multiple of `vae_scale_factor`.

preprocess ¶

preprocess(
    image: PipelineImageInput,
    height: int | None = None,
    width: int | None = None,
    max_pixels: int | None = None,
    max_side_length: int | None = None,
    resize_mode: str = "default",
    crops_coords: tuple[int, int, int, int] | None = None,
) -> Tensor

Preprocess the image input.

Parameters:

Name	Type	Description	Default
`image`	`PipelineImageInput`	The image input, accepted formats are PIL images, NumPy arrays, PyTorch tensors; Also accept list of supported formats.	required
`height`	`int`, optional	The height in preprocessed image. If `None`, will use the `get_default_height_width()` to get default height.	`None`
`width`	`int`, optional	The width in preprocessed. If `None`, will use get_default_height_width()` to get the default width.	`None`
`resize_mode`	`str`, optional, defaults to `default`	The resize mode, can be one of `default` or `fill`. If `default`, will resize the image to fit within the specified width and height, and it may not maintaining the original aspect ratio. If `fill`, will resize the image to fit within the specified width and height, maintaining the aspect ratio, and then center the image within the dimensions, filling empty with data from image. If `crop`, will resize the image to fit within the specified width and height, maintaining the aspect ratio, and then center the image within the dimensions, cropping the excess. Note that resize_mode `fill` and `crop` are only supported for PIL image input.	`'default'`
`crops_coords`	`List[Tuple[int, int, int, int]]`, optional, defaults to `None`	The crop coordinates for each image in the batch. If `None`, will not crop the image.	`None`

Returns:

Type	Description
`Tensor`	`torch.Tensor`: The preprocessed image.

OmniGen2Pipeline ¶

Bases: CFGParallelMixin, Module, SupportsComponentDiscovery

Pipeline for text-to-image generation using OmniGen2.

This pipeline implements a text-to-image generation model that uses: - Qwen2.5-VL for text encoding - A custom transformer architecture for image generation - VAE for image encoding/decoding - FlowMatchEulerDiscreteScheduler for noise scheduling

Parameters:

Name	Type	Description	Default
`od_config`	`OmniDiffusionConfig`	The OmniDiffusion configuration.	required

cfg_range `property` ¶

cfg_range

default_sample_size `instance-attribute` ¶

default_sample_size = 128

device `instance-attribute` ¶

device = get_local_device()

image_guidance_scale `property` ¶

image_guidance_scale

image_processor `instance-attribute` ¶

image_processor = OmniGen2ImageProcessor(
    vae_scale_factor=self.vae_scale_factor * 2,
    do_resize=True,
)

mllm `instance-attribute` ¶

mllm = from_pretrained_with_prefetch(
    Qwen2_5_VLForConditionalGeneration.from_pretrained,
    model,
    subfolder="mllm",
    prefetch_list=omnigen2_subfolders,
    local_files_only=local_files_only,
).to(self.device)

num_timesteps `property` ¶

num_timesteps

od_config `instance-attribute` ¶

od_config = od_config

processor `instance-attribute` ¶

processor = from_pretrained_with_prefetch(
    Qwen2_5_VLProcessor.from_pretrained,
    model,
    subfolder="processor",
    prefetch_list=omnigen2_subfolders,
    local_files_only=local_files_only,
)

scheduler `instance-attribute` ¶

scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained(
    model,
    subfolder="scheduler",
    local_files_only=local_files_only,
)

text_guidance_scale `property` ¶

text_guidance_scale

transformer `instance-attribute` ¶

transformer = OmniGen2Transformer2DModel(
    **transformer_kwargs,
    quant_config=od_config.quantization_config,
)

vae `instance-attribute` ¶

vae = from_pretrained_with_prefetch(
    AutoencoderKL.from_pretrained,
    model,
    subfolder="vae",
    prefetch_list=omnigen2_subfolders,
    local_files_only=local_files_only,
).to(self.device)

vae_scale_factor `instance-attribute` ¶

vae_scale_factor = (
    2 ** (len(self.vae.config.block_out_channels) - 1)
    if hasattr(self, "vae") and self.vae is not None
    else 8
)

weights_sources `instance-attribute` ¶

weights_sources = [
    DiffusersPipelineLoader.ComponentSource(
        model_or_path=od_config.model,
        subfolder="transformer",
        revision=None,
        prefix="transformer.",
        fall_back_to_pt=True,
    )
]

combine_multi_branch_cfg_noise ¶

combine_multi_branch_cfg_noise(
    predictions, true_cfg_scale, cfg_normalize=False
)

Override: 3-branch dual scale or 2-branch standard CFG.

encode_prompt ¶

encode_prompt(
    prompt: str | list[str],
    do_classifier_free_guidance: bool = True,
    negative_prompt: str | list[str] | None = None,
    num_images_per_prompt: int = 1,
    device: device | None = None,
    prompt_embeds: Tensor | None = None,
    negative_prompt_embeds: Tensor | None = None,
    prompt_attention_mask: Tensor | None = None,
    negative_prompt_attention_mask: Tensor | None = None,
    max_sequence_length: int = 256,
) -> tuple[Tensor, Tensor, Tensor, Tensor]

Encodes the prompt into text encoder hidden states.

Parameters:

Name	Type	Description	Default
`prompt`	`str` or `List[str]`, optional	prompt to be encoded	required
`negative_prompt`	`str` or `List[str]`, optional	The prompt not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`). For Lumina-T2I, this should be "".	`None`
`do_classifier_free_guidance`	`bool`, optional, defaults to `True`	whether to use classifier free guidance or not	`True`
`num_images_per_prompt`	`int`, optional, defaults to 1	number of images that should be generated per prompt	`1`
`device`	`device \| None`	(`torch.device`, optional): torch device to place the resulting embeddings on	`None`
`prompt_embeds`	`torch.Tensor`, optional	Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.	`None`
`negative_prompt_embeds`	`torch.Tensor`, optional	Pre-generated negative text embeddings. For Lumina-T2I, it's should be the embeddings of the "" string.	`None`
`max_sequence_length`	`int`, defaults to `256`	Maximum sequence length to use for the prompt.	`256`

encode_vae ¶

encode_vae(img: FloatTensor) -> FloatTensor

Encode an image into the VAE latent space.

Parameters:

Name	Type	Description	Default
`img`	`FloatTensor`	The input image tensor to encode.	required

Returns:

Type	Description
`FloatTensor`	torch.FloatTensor: The encoded latent representation.

forward ¶

forward(
    req: DiffusionRequestBatch,
    prompt: str | list[str] | None = None,
    negative_prompt: str | list[str] | None = None,
    prompt_embeds: FloatTensor | None = None,
    negative_prompt_embeds: FloatTensor | None = None,
    prompt_attention_mask: LongTensor | None = None,
    negative_prompt_attention_mask: LongTensor
    | None = None,
    max_sequence_length: int | None = 1024,
    input_images: list[Image] | None = None,
    num_images_per_prompt: int = 1,
    height: int | None = None,
    width: int | None = None,
    max_pixels: int = 1024 * 1024,
    max_input_image_side_length: int = 1024,
    align_res: bool = True,
    num_inference_steps: int = 28,
    text_guidance_scale: float = 4.0,
    image_guidance_scale: float = 1.0,
    cfg_range: tuple[float, float] = (0.0, 1.0),
    attention_kwargs: dict[str, Any] | None = None,
    timesteps: list[int] = None,
    generator: Generator | list[Generator] | None = None,
    latents: FloatTensor | None = None,
    verbose: bool = False,
    step_func=None,
) -> DiffusionOutput

load_weights ¶

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

predict ¶

predict(
    t,
    latents,
    prompt_embeds,
    freqs_cis,
    prompt_attention_mask,
    ref_image_hidden_states,
)

predict_noise ¶

predict_noise(**kwargs)

Override CFGParallelMixin.predict_noise to use self.predict.

prepare_image ¶

prepare_image(
    images: list[Image] | Image,
    batch_size: int,
    num_images_per_prompt: int,
    max_pixels: int,
    max_side_length: int,
    device: device,
    dtype: dtype,
) -> list[FloatTensor | None]

Prepare input images for processing by encoding them into the VAE latent space.

Parameters:

Name	Type	Description	Default
`images`	`list[Image] \| Image`	Single image or list of images to process.	required
`batch_size`	`int`	The number of images to generate per prompt.	required
`num_images_per_prompt`	`int`	The number of images to generate for each prompt.	required
`device`	`device`	The device to place the encoded latents on.	required
`dtype`	`dtype`	The data type of the encoded latents.	required

Returns:

Type	Description
`list[FloatTensor \| None]`	List[Optional[torch.FloatTensor]]: List of encoded latent representations for each image.

prepare_latents ¶

prepare_latents(
    batch_size: int,
    num_channels_latents: int,
    height: int,
    width: int,
    dtype: dtype,
    device: device,
    generator: Generator | None,
    latents: FloatTensor | None = None,
) -> FloatTensor

Prepare the initial latents for the diffusion process.

Parameters:

Name	Type	Description	Default
`batch_size`	`int`	The number of images to generate.	required
`num_channels_latents`	`int`	The number of channels in the latent space.	required
`height`	`int`	The height of the generated image.	required
`width`	`int`	The width of the generated image.	required
`dtype`	`dtype`	The data type of the latents.	required
`device`	`device`	The device to place the latents on.	required
`generator`	`Generator \| None`	The random number generator to use.	required
`latents`	`FloatTensor \| None`	Optional pre-computed latents to use instead of random initialization.	`None`

Returns:

Type	Description
`FloatTensor`	torch.FloatTensor: The prepared latents tensor.

processing ¶

processing(
    latents,
    ref_latents,
    prompt_embeds,
    freqs_cis,
    negative_prompt_embeds,
    prompt_attention_mask,
    negative_prompt_attention_mask,
    num_inference_steps,
    timesteps,
    device,
    dtype,
    verbose,
    step_func=None,
)

get_omnigen2_post_process_func ¶

get_omnigen2_post_process_func(
    od_config: OmniDiffusionConfig,
)

get_omnigen2_pre_process_func ¶

get_omnigen2_pre_process_func(
    od_config: OmniDiffusionConfig,
)

Pre-processing function for OmniGen2Pipeline.

retrieve_timesteps ¶

retrieve_timesteps(
    scheduler,
    num_inference_steps: int | None = None,
    device: str | device | None = None,
    timesteps: list[int] | None = None,
    **kwargs: Any,
)

Calls the scheduler's set_timesteps method and retrieves timesteps from the scheduler after the call. Handles custom timesteps. Any kwargs will be supplied to scheduler.set_timesteps.

Parameters:

Name	Type	Description	Default
`scheduler`	`SchedulerMixin`	The scheduler to get timesteps from.	required
`num_inference_steps`	`int`	The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps` must be `None`.	`None`
`device`	`str` or `torch.device`, optional	The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.	`None`
`timesteps`	`List[int]`, optional	Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed, `num_inference_steps` must be `None`.	`None`
`**kwargs`	`Any`	Additional keyword arguments passed to `scheduler.set_timesteps`.	`{}`

Returns:

Name	Type	Description
`timesteps`	`torch.Tensor`	The timestep schedule from the scheduler.
`num_inference_steps`	`int`	The number of inference steps.

vllm_omni.diffusion.models.omnigen2.pipeline_omnigen2 ¶

logger module-attribute ¶

FlowMatchEulerDiscreteScheduler ¶

begin_index property ¶

order class-attribute instance-attribute ¶

step_index property ¶

timesteps instance-attribute ¶

index_for_timestep ¶

set_begin_index ¶

set_timesteps ¶

step ¶

FlowMatchEulerDiscreteSchedulerOutput dataclass ¶

prev_sample instance-attribute ¶

OmniGen2ImageProcessor ¶

max_pixels instance-attribute ¶

max_side_length instance-attribute ¶

get_new_height_width ¶

preprocess ¶

OmniGen2Pipeline ¶

cfg_range property ¶

default_sample_size instance-attribute ¶

device instance-attribute ¶

image_guidance_scale property ¶

image_processor instance-attribute ¶

mllm instance-attribute ¶

num_timesteps property ¶

od_config instance-attribute ¶

processor instance-attribute ¶

scheduler instance-attribute ¶

text_guidance_scale property ¶

transformer instance-attribute ¶

vae instance-attribute ¶

vae_scale_factor instance-attribute ¶

weights_sources instance-attribute ¶

combine_multi_branch_cfg_noise ¶

encode_prompt ¶

encode_vae ¶

forward ¶

load_weights ¶

predict ¶

predict_noise ¶

prepare_image ¶

prepare_latents ¶

processing ¶

get_omnigen2_post_process_func ¶

get_omnigen2_pre_process_func ¶

retrieve_timesteps ¶

logger `module-attribute` ¶

begin_index `property` ¶

order `class-attribute` `instance-attribute` ¶

step_index `property` ¶

timesteps `instance-attribute` ¶

FlowMatchEulerDiscreteSchedulerOutput `dataclass` ¶

prev_sample `instance-attribute` ¶

max_pixels `instance-attribute` ¶

max_side_length `instance-attribute` ¶

cfg_range `property` ¶

default_sample_size `instance-attribute` ¶

device `instance-attribute` ¶

image_guidance_scale `property` ¶

image_processor `instance-attribute` ¶

mllm `instance-attribute` ¶

num_timesteps `property` ¶

od_config `instance-attribute` ¶

processor `instance-attribute` ¶

scheduler `instance-attribute` ¶

text_guidance_scale `property` ¶

transformer `instance-attribute` ¶

vae `instance-attribute` ¶

vae_scale_factor `instance-attribute` ¶

weights_sources `instance-attribute` ¶