Skip to content

vllm_omni.diffusion.models.glm_image.pipeline_glm_image

GlmImagePipeline implementation for vLLM-Omni.

This pipeline implements GLM-Image text-to-image generation with: - AR stage (vLLM): GLM-Image AR stage generates prior tokens - DiT stage: GlmImageTransformer2DModel performs diffusion denoising - VAE: AutoencoderKL decodes latents to images

logger module-attribute

logger = getLogger(__name__)

GlmImagePipeline

Bases: Module, DiffusionPipelineProfilerMixin

GLM-Image Pipeline for text-to-image and image-to-image generation.

This pipeline integrates: - AR stage (vLLM): Generates prior image tokens - Text encoder (T5EncoderModel): Encodes glyph/text embeddings - DiT model (GlmImageTransformer2DModel): Diffusion transformer - VAE (AutoencoderKL): Encodes/decodes images to/from latent space

The pipeline flow: 1. AR stage provides prior_token_ids (and optionally prior_token_image_ids) 2. T5 encodes glyph text for text rendering 3. DiT performs iterative denoising conditioned on prior tokens 4. VAE decodes final latents to image

default_sample_size instance-attribute

default_sample_size = 128

device instance-attribute

device = get_local_device()

image_processor instance-attribute

image_processor = VaeImageProcessor(
    vae_scale_factor=vae_scale_factor
)

od_config instance-attribute

od_config = od_config

parallel_config instance-attribute

parallel_config = parallel_config

scheduler instance-attribute

scheduler = from_pretrained(
    model_path, subfolder="scheduler", local_files_only=True
)

text_encoder instance-attribute

text_encoder = to(device)

tokenizer instance-attribute

tokenizer = from_pretrained(
    model_path, subfolder="tokenizer", local_files_only=True
)

transformer instance-attribute

transformer = GlmImageTransformer2DModel(
    od_config=od_config, quant_config=quantization_config
)

vae instance-attribute

vae = to(device)

vae_scale_factor instance-attribute

vae_scale_factor = 2 ** (len(block_out_channels) - 1)

weights_sources instance-attribute

weights_sources = [
    ComponentSource(
        model_or_path=model,
        subfolder="transformer",
        revision=revision,
        prefix="transformer.",
        fall_back_to_pt=True,
    )
]

check_inputs

check_inputs(
    prompt: str | list[str] | None,
    height: int | None,
    width: int | None,
    prompt_embeds: Tensor | None = None,
) -> None

Validate input arguments before generation.

diffuse

diffuse(
    latents: Tensor,
    prior_token_id: Tensor,
    prompt_embeds: Tensor,
    negative_prompt_embeds: Tensor | None,
    timesteps: Tensor,
    target_size: Tensor,
    crop_coords: Tensor,
    guidance_scale: float,
    do_classifier_free_guidance: bool,
    kv_caches: GlmImageKVCache | None = None,
) -> Tensor

Denoising loop for diffusion process with CFG-Parallel support.

Parameters:

Name Type Description Default
latents Tensor

Initial noise latents

required
prior_token_id Tensor

Prior tokens generated by AR model

required
prompt_embeds Tensor

Encoded positive prompt embeddings (glyph embeddings)

required
negative_prompt_embeds Tensor | None

Encoded negative prompt embeddings

required
timesteps Tensor

Denoising timesteps

required
target_size Tensor

Target image size tensor [[height, width]]

required
crop_coords Tensor

Crop coordinates tensor

required
guidance_scale float

CFG scale

required
do_classifier_free_guidance bool

Whether to apply CFG

required
kv_caches GlmImageKVCache | None

Optional KV cache for Image Edit mode

None

Returns:

Type Description
Tensor

Denoised latents ready for VAE decode

encode_prompt

encode_prompt(
    prompt: str | list[str],
    do_classifier_free_guidance: bool = True,
    num_images_per_prompt: int = 1,
    prompt_embeds: Tensor | None = None,
    device: device | None = None,
    dtype: dtype | None = None,
    max_sequence_length: int = 2048,
) -> tuple[Tensor, Tensor | None]

Encode prompt into glyph embeddings for text rendering.

forward

Main generation forward pass.

Parameters:

Name Type Description Default
req OmniDiffusionRequest

OmniDiffusionRequest with generation parameters

required

Returns:

Type Description
DiffusionOutput

DiffusionOutput containing generated image

get_glyph_texts

get_glyph_texts(prompt: str | list[str]) -> list[str]

Extract text within quotes for glyph rendering.

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Load transformer weights.

prepare_latents

prepare_latents(
    batch_size: int,
    num_channels_latents: int,
    height: int,
    width: int,
    dtype: dtype,
    device: device,
    generator: Generator | None,
    latents: Tensor | None = None,
) -> Tensor

Prepare random noise latents.

calculate_shift

calculate_shift(
    image_seq_len: int,
    base_seq_len: int = 256,
    base_shift: float = 0.25,
    max_shift: float = 0.75,
) -> float

Calculate timestep shift based on image sequence length.

get_glm_image_post_process_func

get_glm_image_post_process_func(
    od_config: OmniDiffusionConfig,
)

Get post-processing function for GLM-Image pipeline.

get_glm_image_pre_process_func

get_glm_image_pre_process_func(
    od_config: OmniDiffusionConfig,
)

Get pre-processing function for GLM-Image pipeline.

Pre-processes condition images before they are sent to the pipeline. This is called by DiffusionEngine before batching requests.

retrieve_latents

retrieve_latents(
    encoder_output: Tensor,
    generator: Generator | None = None,
    sample_mode: str = "sample",
) -> Tensor

Extract latents from VAE encoder output.

retrieve_timesteps

retrieve_timesteps(
    scheduler,
    num_inference_steps: int | None = None,
    device: str | device | None = None,
    timesteps: list[int] | None = None,
    sigmas: list[float] | None = None,
    **kwargs,
) -> tuple[Tensor, int]

Calls the scheduler's set_timesteps method and retrieves timesteps. Handles custom timesteps and sigmas schedules.