Skip to content

vllm_omni.diffusion.models.sd3

Stable diffusion3 model components.

Modules:

Name Description
pipeline_sd3
sd3_transformer

SD3Transformer2DModel

Bases: Module

The Transformer model introduced in Stable Diffusion 3.

attention_head_dim instance-attribute

attention_head_dim = model_config.attention_head_dim

caption_projection_dim instance-attribute

caption_projection_dim = model_config.caption_projection_dim

context_embedder instance-attribute

context_embedder = ReplicatedLinear(
    self.joint_attention_dim, self.caption_projection_dim
)

dual_attention_layers instance-attribute

dual_attention_layers = (
    model_config.dual_attention_layers
    if hasattr(model_config, "dual_attention_layers")
    else ()
)

in_channels instance-attribute

in_channels = model_config.in_channels

inner_dim instance-attribute

inner_dim = (
    model_config.num_attention_heads
    * model_config.attention_head_dim
)

joint_attention_dim instance-attribute

joint_attention_dim = model_config.joint_attention_dim

norm_out instance-attribute

norm_out = AdaLayerNormContinuous(
    self.inner_dim,
    self.inner_dim,
    elementwise_affine=False,
    eps=1e-06,
)

num_attention_heads instance-attribute

num_attention_heads = model_config.num_attention_heads

num_layers instance-attribute

num_layers = model_config.num_layers

out_channels instance-attribute

out_channels = model_config.out_channels

parallel_config instance-attribute

parallel_config = od_config.parallel_config

patch_size instance-attribute

patch_size = model_config.patch_size

pooled_projection_dim instance-attribute

pooled_projection_dim = model_config.pooled_projection_dim

pos_embed instance-attribute

pos_embed = PatchEmbed(
    height=self.sample_size,
    width=self.sample_size,
    patch_size=self.patch_size,
    in_channels=self.in_channels,
    embed_dim=self.inner_dim,
    pos_embed_max_size=self.pos_embed_max_size,
)

pos_embed_max_size instance-attribute

pos_embed_max_size = model_config.pos_embed_max_size

proj_out instance-attribute

proj_out = ReplicatedLinear(
    self.inner_dim,
    self.patch_size * self.patch_size * self.out_channels,
    bias=True,
)

qk_norm instance-attribute

qk_norm = (
    model_config.qk_norm
    if hasattr(model_config, "qk_norm")
    else ""
)

sample_size instance-attribute

sample_size = model_config.sample_size

time_text_embed instance-attribute

time_text_embed = CombinedTimestepTextProjEmbeddings(
    embedding_dim=self.inner_dim,
    pooled_projection_dim=self.pooled_projection_dim,
)

transformer_blocks instance-attribute

transformer_blocks = nn.ModuleList(
    [
        (
            SD3TransformerBlock(
                dim=self.inner_dim,
                num_attention_heads=self.num_attention_heads,
                attention_head_dim=self.attention_head_dim,
                context_pre_only=i == self.num_layers - 1,
                qk_norm=self.qk_norm,
                use_dual_attention=True
                if i in self.dual_attention_layers
                else False,
            )
        )
        for i in (range(self.num_layers))
    ]
)

forward

forward(
    hidden_states: Tensor,
    encoder_hidden_states: Tensor,
    pooled_projections: Tensor,
    timestep: LongTensor,
    return_dict: bool = True,
) -> Tensor | Transformer2DModelOutput

The [SD3Transformer2DModel] forward method.

Parameters:

Name Type Description Default
hidden_states `torch.Tensor` of shape `(batch_size, image_sequence_length, in_channels)`

Input hidden_states.

required
encoder_hidden_states `torch.Tensor` of shape `(batch_size, text_sequence_length, joint_attention_dim)`

Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.

required
pooled_projections `torch.Tensor` of shape `(batch_size, projection_dim)`

Embeddings projected from the embeddings of input conditions.

required
timestep `torch.LongTensor`

Used to indicate denoising step.

required
return_dict `bool`, *optional*, defaults to `True`

Whether or not to return a [~models.transformer_2d.Transformer2DModelOutput] instead of a plain tuple.

True

Returns:

Type Description
Tensor | Transformer2DModelOutput

If return_dict is True, an [~models.transformer_2d.Transformer2DModelOutput] is returned, otherwise a

Tensor | Transformer2DModelOutput

tuple where the first element is the sample tensor.

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

StableDiffusion3Pipeline

Bases: Module, CFGParallelMixin, DiffusionPipelineProfilerMixin, SupportsComponentDiscovery

current_timestep property

current_timestep

default_sample_size instance-attribute

default_sample_size = 128

device instance-attribute

device = get_local_device()

guidance_scale property

guidance_scale

image_processor instance-attribute

image_processor = VaeImageProcessor(
    vae_scale_factor=self.vae_scale_factor
)

interrupt property

interrupt

num_timesteps property

num_timesteps

od_config instance-attribute

od_config = od_config

output_type instance-attribute

output_type = self.od_config.output_type

patch_size instance-attribute

patch_size = 2

scheduler instance-attribute

scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained(
    model,
    subfolder="scheduler",
    local_files_only=local_files_only,
)

text_encoder instance-attribute

text_encoder = from_pretrained_with_prefetch(
    CLIPTextModelWithProjection.from_pretrained,
    model,
    subfolder="text_encoder",
    prefetch_list=sd3_subfolders,
    local_files_only=local_files_only,
    torch_dtype=dtype,
)

text_encoder_2 instance-attribute

text_encoder_2 = from_pretrained_with_prefetch(
    CLIPTextModelWithProjection.from_pretrained,
    model,
    subfolder="text_encoder_2",
    prefetch_list=sd3_subfolders,
    local_files_only=local_files_only,
    torch_dtype=dtype,
)

text_encoder_3 instance-attribute

text_encoder_3 = from_pretrained_with_prefetch(
    T5EncoderModel.from_pretrained,
    model,
    subfolder="text_encoder_3",
    prefetch_list=sd3_subfolders,
    local_files_only=local_files_only,
    torch_dtype=dtype,
)

tokenizer instance-attribute

tokenizer = CLIPTokenizer.from_pretrained(
    model,
    subfolder="tokenizer",
    local_files_only=local_files_only,
)

tokenizer_2 instance-attribute

tokenizer_2 = CLIPTokenizer.from_pretrained(
    model,
    subfolder="tokenizer_2",
    local_files_only=local_files_only,
)

tokenizer_3 instance-attribute

tokenizer_3 = T5Tokenizer.from_pretrained(
    model,
    subfolder="tokenizer_3",
    local_files_only=local_files_only,
)

tokenizer_max_length instance-attribute

tokenizer_max_length = (
    self.tokenizer.model_max_length
    if hasattr(self, "tokenizer")
    and self.tokenizer is not None
    else 77
)

transformer instance-attribute

transformer = SD3Transformer2DModel(od_config=od_config)

vae instance-attribute

vae = from_pretrained_with_prefetch(
    DistributedAutoencoderKL.from_pretrained,
    model,
    subfolder="vae",
    prefetch_list=sd3_subfolders,
    local_files_only=local_files_only,
    torch_dtype=dtype,
).to(self.device)

vae_scale_factor instance-attribute

vae_scale_factor = (
    2 ** (len(self.vae.config.block_out_channels) - 1)
    if getattr(self, "vae", None)
    else 8
)

weights_sources instance-attribute

weights_sources = [
    DiffusersPipelineLoader.ComponentSource(
        model_or_path=od_config.model,
        subfolder="transformer",
        revision=None,
        prefix="transformer.",
        fall_back_to_pt=True,
    )
]

check_inputs

check_inputs(
    prompt,
    prompt_2,
    prompt_3,
    height,
    width,
    negative_prompt=None,
    negative_prompt_2=None,
    negative_prompt_3=None,
    prompt_embeds=None,
    negative_prompt_embeds=None,
    max_sequence_length=None,
)

diffuse

diffuse(
    latents: Tensor,
    timesteps: Tensor,
    prompt_embeds: Tensor,
    pooled_prompt_embeds: Tensor | None,
    negative_prompt_embeds: Tensor | None,
    negative_pooled_prompt_embeds: Tensor | None,
    do_true_cfg: bool,
    guidance_scale: float,
    cfg_normalize: bool = False,
) -> Tensor

Diffusion loop with optional classifier-free guidance.

Parameters:

Name Type Description Default
latents Tensor

Noise latents to denoise

required
timesteps Tensor

Diffusion timesteps

required
prompt_embeds Tensor

Positive prompt embeddings

required
pooled_prompt_embeds Tensor | None

Pooled positive prompt embeddings

required
negative_prompt_embeds Tensor | None

Negative prompt embeddings

required
negative_pooled_prompt_embeds Tensor | None

Pooled negative prompt embeddings

required
do_true_cfg bool

Whether to apply CFG

required
guidance_scale float

CFG scale factor

required
cfg_normalize bool

Whether to normalize CFG output (default: False)

False

Returns:

Type Description
Tensor

Denoised latents

encode_prompt

encode_prompt(
    prompt: str | list[str],
    prompt_2: str | list[str],
    prompt_3: str | list[str],
    prompt_embeds: Tensor | None = None,
    max_sequence_length: int = 256,
    num_images_per_prompt: int = 1,
)

Parameters:

Name Type Description Default
prompt `str` or `List[str]`, *optional*

prompt to be encoded

required
prompt_2 `str` or `List[str]`, *optional*

The prompt or prompts to be sent to the tokenizer_2 and text_encoder_2. If not defined, prompt is used in all text-encoders

required
prompt_3 `str` or `List[str]`, *optional*

The prompt or prompts to be sent to the tokenizer_3 and text_encoder_3. If not defined, prompt is used in all text-encoders

required
num_images_per_prompt `int`

number of images that should be generated per prompt

1
prompt_embeds `torch.FloatTensor`, *optional*

Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.

None

forward

forward(
    req: OmniDiffusionRequest,
    prompt: str | list[str] = "",
    prompt_2: str | list[str] = "",
    prompt_3: str | list[str] = "",
    negative_prompt: str | list[str] = "",
    negative_prompt_2: str | list[str] = "",
    negative_prompt_3: str | list[str] = "",
    height: int | None = None,
    width: int | None = None,
    num_inference_steps: int = 28,
    sigmas: list[float] | None = None,
    num_images_per_prompt: int = 1,
    generator: Generator | list[Generator] | None = None,
    latents: Tensor | None = None,
    prompt_embeds: Tensor | None = None,
    negative_prompt_embeds: Tensor | None = None,
    pooled_prompt_embeds: Tensor | None = None,
    negative_pooled_prompt_embeds: Tensor | None = None,
    max_sequence_length: int = 256,
) -> DiffusionOutput

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

prepare_latents

prepare_latents(
    batch_size,
    num_channels_latents,
    height,
    width,
    generator,
    latents=None,
) -> Tensor

prepare_timesteps

prepare_timesteps(
    num_inference_steps, sigmas, image_seq_len
)

get_sd3_image_post_process_func

get_sd3_image_post_process_func(
    od_config: OmniDiffusionConfig,
)