vllm_omni.diffusion.models.omnigen2.pipeline_omnigen2 ¶
FlowMatchEulerDiscreteScheduler ¶
Bases: SchedulerMixin, ConfigMixin
Euler scheduler.
This model inherits from [SchedulerMixin] and [ConfigMixin]. Check the superclass documentation for the generic methods the library implements for all schedulers such as loading and saving.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_train_timesteps | `int`, defaults to 1000 | The number of diffusion steps to train the model. | 1000 |
dynamic_time_shift | `bool`, defaults to `True` | Whether to use dynamic time shifting for the timestep schedule. | True |
begin_index property ¶
The index for the first timestep. It should be set from pipeline with set_begin_index method.
step_index property ¶
The index counter for current timestep. It will increase 1 after each scheduler step.
set_begin_index ¶
set_begin_index(begin_index: int = 0)
Sets the begin index for the scheduler. This function should be run from pipeline before the inference.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
begin_index | `int` | The begin index for the scheduler. | 0 |
set_timesteps ¶
set_timesteps(
num_inference_steps: int = None,
device: str | device = None,
timesteps: list[float] | None = None,
num_tokens: int | None = None,
)
Sets the discrete timesteps used for the diffusion chain (to be run before inference).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_inference_steps | `int` | The number of diffusion steps used when generating samples with a pre-trained model. | None |
device | `str` or `torch.device`, *optional* | The device to which the timesteps should be moved to. If | None |
timesteps | `list[float]`, *optional* | Custom timesteps to use. If provided, | None |
num_tokens | `int`, *optional* | Number of tokens, used for dynamic time shifting. | None |
step ¶
step(
model_output: FloatTensor,
timestep: float | FloatTensor,
sample: FloatTensor,
generator: Generator | None = None,
return_dict: bool = True,
) -> FlowMatchEulerDiscreteSchedulerOutput | tuple
Predict the sample from the previous timestep by reversing the SDE. This function propagates the diffusion process from the learned model outputs (most often the predicted noise).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_output | `torch.FloatTensor` | The direct output from learned diffusion model. | required |
timestep | `float` | The current discrete timestep in the diffusion chain. | required |
sample | `torch.FloatTensor` | A current instance of a sample created by the diffusion process. | required |
generator | `torch.Generator`, *optional* | A random number generator. | None |
return_dict | `bool` | Whether or not to return a [ | True |
Returns:
| Type | Description |
|---|---|
FlowMatchEulerDiscreteSchedulerOutput | tuple | [ |
FlowMatchEulerDiscreteSchedulerOutput dataclass ¶
Bases: BaseOutput
Output class for the scheduler's step function output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prev_sample | `torch.FloatTensor` of shape `(batch_size, num_channels, height, width)` for images | Computed sample | required |
OmniGen2ImageProcessor ¶
Bases: VaeImageProcessor
Image processor for OmniGen2 image resize and crop.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
do_resize | `bool`, *optional*, defaults to `True` | Whether to downscale the image's (height, width) dimensions to multiples of | True |
vae_scale_factor | `int`, *optional*, defaults to `16` | VAE scale factor. If | 16 |
resample | `str`, *optional*, defaults to `lanczos` | Resampling filter to use when resizing the image. | 'lanczos' |
max_pixels | `int`, *optional*, defaults to `1048576` | Maximum number of pixels allowed in the image. Images exceeding this limit are downscaled proportionally. | 1024 * 1024 |
max_side_length | `int`, *optional*, defaults to `1024` | Maximum length of the longer side of the image. Images exceeding this limit are downscaled proportionally. | 1024 |
do_normalize | `bool`, *optional*, defaults to `True` | Whether to normalize the image to [-1,1]. | True |
do_binarize | `bool`, *optional*, defaults to `False` | Whether to binarize the image to 0/1. | False |
do_convert_grayscale | `bool`, *optional*, defaults to `False` | Whether to convert the images to grayscale format. | False |
get_new_height_width ¶
get_new_height_width(
image: Image | ndarray | Tensor,
height: int | None = None,
width: int | None = None,
max_pixels: int | None = None,
max_side_length: int | None = None,
) -> tuple[int, int]
Returns the height and width of the image, downscaled to the next integer multiple of vae_scale_factor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image | `Union[PIL.Image.Image, np.ndarray, torch.Tensor]` | The image input, which can be a PIL image, NumPy array, or PyTorch tensor. If it is a NumPy array, it should have shape | required |
height | `Optional[int]`, *optional*, defaults to `None` | The height of the preprocessed image. If | None |
width | `Optional[int]`, *optional*, defaults to `None` | The width of the preprocessed image. If | None |
Returns:
| Type | Description |
|---|---|
tuple[int, int] |
|
preprocess ¶
preprocess(
image: PipelineImageInput,
height: int | None = None,
width: int | None = None,
max_pixels: int | None = None,
max_side_length: int | None = None,
resize_mode: str = "default",
crops_coords: tuple[int, int, int, int] | None = None,
) -> Tensor
Preprocess the image input.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image | `PipelineImageInput` | The image input, accepted formats are PIL images, NumPy arrays, PyTorch tensors; Also accept list of supported formats. | required |
height | `int`, *optional* | The height in preprocessed image. If | None |
width | `int`, *optional* | The width in preprocessed. If | None |
resize_mode | `str`, *optional*, defaults to `default` | The resize mode, can be one of | 'default' |
crops_coords | `List[Tuple[int, int, int, int]]`, *optional*, defaults to `None` | The crop coordinates for each image in the batch. If | None |
Returns:
| Type | Description |
|---|---|
Tensor |
|
OmniGen2Pipeline ¶
Bases: CFGParallelMixin, Module, SupportsComponentDiscovery
Pipeline for text-to-image generation using OmniGen2.
This pipeline implements a text-to-image generation model that uses: - Qwen2.5-VL for text encoding - A custom transformer architecture for image generation - VAE for image encoding/decoding - FlowMatchEulerDiscreteScheduler for noise scheduling
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
od_config | OmniDiffusionConfig | The OmniDiffusion configuration. | required |
image_processor instance-attribute ¶
image_processor = OmniGen2ImageProcessor(
vae_scale_factor=vae_scale_factor * 2, do_resize=True
)
processor instance-attribute ¶
processor = from_pretrained_with_prefetch(
from_pretrained,
model,
subfolder="processor",
prefetch_list=omnigen2_subfolders,
local_files_only=local_files_only,
)
scheduler instance-attribute ¶
transformer instance-attribute ¶
transformer = OmniGen2Transformer2DModel(
**transformer_kwargs, quant_config=quantization_config
)
vae_scale_factor instance-attribute ¶
vae_scale_factor = (
2 ** (len(block_out_channels) - 1)
if hasattr(self, "vae") and vae is not None
else 8
)
weights_sources instance-attribute ¶
weights_sources = [
ComponentSource(
model_or_path=model,
subfolder="transformer",
revision=None,
prefix="transformer.",
fall_back_to_pt=True,
)
]
combine_multi_branch_cfg_noise ¶
Override: 3-branch dual scale or 2-branch standard CFG.
encode_prompt ¶
encode_prompt(
prompt: str | list[str],
do_classifier_free_guidance: bool = True,
negative_prompt: str | list[str] | None = None,
num_images_per_prompt: int = 1,
device: device | None = None,
prompt_embeds: Tensor | None = None,
negative_prompt_embeds: Tensor | None = None,
prompt_attention_mask: Tensor | None = None,
negative_prompt_attention_mask: Tensor | None = None,
max_sequence_length: int = 256,
) -> tuple[Tensor, Tensor, Tensor, Tensor]
Encodes the prompt into text encoder hidden states.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt | `str` or `List[str]`, *optional* | prompt to be encoded | required |
negative_prompt | `str` or `List[str]`, *optional* | The prompt not to guide the image generation. If not defined, one has to pass | None |
do_classifier_free_guidance | `bool`, *optional*, defaults to `True` | whether to use classifier free guidance or not | True |
num_images_per_prompt | `int`, *optional*, defaults to 1 | number of images that should be generated per prompt | 1 |
device | device | None | ( | None |
prompt_embeds | `torch.Tensor`, *optional* | Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from | None |
negative_prompt_embeds | `torch.Tensor`, *optional* | Pre-generated negative text embeddings. For Lumina-T2I, it's should be the embeddings of the "" string. | None |
max_sequence_length | `int`, defaults to `256` | Maximum sequence length to use for the prompt. | 256 |
encode_vae ¶
Encode an image into the VAE latent space.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
img | FloatTensor | The input image tensor to encode. | required |
Returns:
| Type | Description |
|---|---|
FloatTensor | torch.FloatTensor: The encoded latent representation. |
forward ¶
forward(
req: OmniDiffusionRequest,
prompt: str | list[str] | None = None,
negative_prompt: str | list[str] | None = None,
prompt_embeds: FloatTensor | None = None,
negative_prompt_embeds: FloatTensor | None = None,
prompt_attention_mask: LongTensor | None = None,
negative_prompt_attention_mask: LongTensor
| None = None,
max_sequence_length: int | None = 1024,
input_images: list[Image] | None = None,
num_images_per_prompt: int = 1,
height: int | None = None,
width: int | None = None,
max_pixels: int = 1024 * 1024,
max_input_image_side_length: int = 1024,
align_res: bool = True,
num_inference_steps: int = 28,
text_guidance_scale: float = 4.0,
image_guidance_scale: float = 1.0,
cfg_range: tuple[float, float] = (0.0, 1.0),
attention_kwargs: dict[str, Any] | None = None,
timesteps: list[int] = None,
generator: Generator | list[Generator] | None = None,
latents: FloatTensor | None = None,
verbose: bool = False,
step_func=None,
) -> DiffusionOutput
predict ¶
predict_noise ¶
Override CFGParallelMixin.predict_noise to use self.predict.
prepare_image ¶
prepare_image(
images: list[Image] | Image,
batch_size: int,
num_images_per_prompt: int,
max_pixels: int,
max_side_length: int,
device: device,
dtype: dtype,
) -> list[FloatTensor | None]
Prepare input images for processing by encoding them into the VAE latent space.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
images | list[Image] | Image | Single image or list of images to process. | required |
batch_size | int | The number of images to generate per prompt. | required |
num_images_per_prompt | int | The number of images to generate for each prompt. | required |
device | device | The device to place the encoded latents on. | required |
dtype | dtype | The data type of the encoded latents. | required |
Returns:
| Type | Description |
|---|---|
list[FloatTensor | None] | List[Optional[torch.FloatTensor]]: List of encoded latent representations for each image. |
prepare_latents ¶
prepare_latents(
batch_size: int,
num_channels_latents: int,
height: int,
width: int,
dtype: dtype,
device: device,
generator: Generator | None,
latents: FloatTensor | None = None,
) -> FloatTensor
Prepare the initial latents for the diffusion process.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_size | int | The number of images to generate. | required |
num_channels_latents | int | The number of channels in the latent space. | required |
height | int | The height of the generated image. | required |
width | int | The width of the generated image. | required |
dtype | dtype | The data type of the latents. | required |
device | device | The device to place the latents on. | required |
generator | Generator | None | The random number generator to use. | required |
latents | FloatTensor | None | Optional pre-computed latents to use instead of random initialization. | None |
Returns:
| Type | Description |
|---|---|
FloatTensor | torch.FloatTensor: The prepared latents tensor. |
processing ¶
processing(
latents,
ref_latents,
prompt_embeds,
freqs_cis,
negative_prompt_embeds,
prompt_attention_mask,
negative_prompt_attention_mask,
num_inference_steps,
timesteps,
device,
dtype,
verbose,
step_func=None,
)
get_omnigen2_pre_process_func ¶
get_omnigen2_pre_process_func(
od_config: OmniDiffusionConfig,
)
Pre-processing function for OmniGen2Pipeline.
retrieve_timesteps ¶
retrieve_timesteps(
scheduler,
num_inference_steps: int | None = None,
device: str | device | None = None,
timesteps: list[int] | None = None,
**kwargs: Any,
)
Calls the scheduler's set_timesteps method and retrieves timesteps from the scheduler after the call. Handles custom timesteps. Any kwargs will be supplied to scheduler.set_timesteps.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scheduler | `SchedulerMixin` | The scheduler to get timesteps from. | required |
num_inference_steps | `int` | The number of diffusion steps used when generating samples with a pre-trained model. If used, | None |
device | `str` or `torch.device`, *optional* | The device to which the timesteps should be moved to. If | None |
timesteps | `List[int]`, *optional* | Custom timesteps used to override the timestep spacing strategy of the scheduler. If | None |
**kwargs | `Any` | Additional keyword arguments passed to | {} |
Returns:
| Name | Type | Description |
|---|---|---|
timesteps | `torch.Tensor` | The timestep schedule from the scheduler. |
num_inference_steps | `int` | The number of inference steps. |