Skip to content

vllm_omni.diffusion.models.glm_image

GLM Image diffusion model components.

Modules:

Name Description
glm_image_transformer
pipeline_glm_image

GlmImagePipeline implementation for vLLM-Omni.

GlmImageKVCache

Container for all layers' KV caches.

Manages KV cache for all transformer layers in GLM-Image model. Provides a unified interface for setting mode and clearing cache.

Parameters:

Name Type Description Default
num_layers int

Number of transformer layers in the model.

required
Example

kv_cache = GlmImageKVCache(num_layers=28) kv_cache.set_mode(KVCacheMode.WRITE)

... process condition image ...

kv_cache.set_mode(KVCacheMode.READ)

... process target image ...

kv_cache.clear()

caches instance-attribute

caches = [
    (GlmImageLayerKVCache()) for _ in (range(num_layers))
]

is_empty property

is_empty: bool

Check if all layer caches are empty.

mode property

mode: KVCacheMode | None

Get current cache mode.

num_layers instance-attribute

num_layers = num_layers

clear

clear() -> None

Clear cache for all layers and reset mode.

set_mode

set_mode(mode: KVCacheMode | str | None) -> None

Set cache mode for all layers.

Parameters:

Name Type Description Default
mode KVCacheMode | str | None

Cache mode (WRITE, READ, SKIP) or string ("write", "read", "skip"). Use None to disable cache operations.

required

Raises:

Type Description
ValueError

If mode is an invalid string.

GlmImagePipeline

Bases: Module, DiffusionPipelineProfilerMixin

GLM-Image Pipeline for text-to-image and image-to-image generation.

This pipeline integrates: - AR stage (vLLM): Generates prior image tokens - Text encoder (T5EncoderModel): Encodes glyph/text embeddings - DiT model (GlmImageTransformer2DModel): Diffusion transformer - VAE (AutoencoderKL): Encodes/decodes images to/from latent space

The pipeline flow: 1. AR stage provides prior_token_ids (and optionally prior_token_image_ids) 2. T5 encodes glyph text for text rendering 3. DiT performs iterative denoising conditioned on prior tokens 4. VAE decodes final latents to image

default_sample_size instance-attribute

default_sample_size = 128

device instance-attribute

device = get_local_device()

image_processor instance-attribute

image_processor = VaeImageProcessor(
    vae_scale_factor=vae_scale_factor
)

od_config instance-attribute

od_config = od_config

parallel_config instance-attribute

parallel_config = parallel_config

scheduler instance-attribute

scheduler = from_pretrained(
    model_path, subfolder="scheduler", local_files_only=True
)

text_encoder instance-attribute

text_encoder = to(device)

tokenizer instance-attribute

tokenizer = from_pretrained(
    model_path, subfolder="tokenizer", local_files_only=True
)

transformer instance-attribute

transformer = GlmImageTransformer2DModel(
    od_config=od_config, quant_config=quantization_config
)

vae instance-attribute

vae = to(device)

vae_scale_factor instance-attribute

vae_scale_factor = 2 ** (len(block_out_channels) - 1)

weights_sources instance-attribute

weights_sources = [
    ComponentSource(
        model_or_path=model,
        subfolder="transformer",
        revision=revision,
        prefix="transformer.",
        fall_back_to_pt=True,
    )
]

check_inputs

check_inputs(
    prompt: str | list[str] | None,
    height: int | None,
    width: int | None,
    prompt_embeds: Tensor | None = None,
) -> None

Validate input arguments before generation.

diffuse

diffuse(
    latents: Tensor,
    prior_token_id: Tensor,
    prompt_embeds: Tensor,
    negative_prompt_embeds: Tensor | None,
    timesteps: Tensor,
    target_size: Tensor,
    crop_coords: Tensor,
    guidance_scale: float,
    do_classifier_free_guidance: bool,
    kv_caches: GlmImageKVCache | None = None,
) -> Tensor

Denoising loop for diffusion process with CFG-Parallel support.

Parameters:

Name Type Description Default
latents Tensor

Initial noise latents

required
prior_token_id Tensor

Prior tokens generated by AR model

required
prompt_embeds Tensor

Encoded positive prompt embeddings (glyph embeddings)

required
negative_prompt_embeds Tensor | None

Encoded negative prompt embeddings

required
timesteps Tensor

Denoising timesteps

required
target_size Tensor

Target image size tensor [[height, width]]

required
crop_coords Tensor

Crop coordinates tensor

required
guidance_scale float

CFG scale

required
do_classifier_free_guidance bool

Whether to apply CFG

required
kv_caches GlmImageKVCache | None

Optional KV cache for Image Edit mode

None

Returns:

Type Description
Tensor

Denoised latents ready for VAE decode

encode_prompt

encode_prompt(
    prompt: str | list[str],
    do_classifier_free_guidance: bool = True,
    num_images_per_prompt: int = 1,
    prompt_embeds: Tensor | None = None,
    device: device | None = None,
    dtype: dtype | None = None,
    max_sequence_length: int = 2048,
) -> tuple[Tensor, Tensor | None]

Encode prompt into glyph embeddings for text rendering.

forward

Main generation forward pass.

Parameters:

Name Type Description Default
req OmniDiffusionRequest

OmniDiffusionRequest with generation parameters

required

Returns:

Type Description
DiffusionOutput

DiffusionOutput containing generated image

get_glyph_texts

get_glyph_texts(prompt: str | list[str]) -> list[str]

Extract text within quotes for glyph rendering.

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Load transformer weights.

prepare_latents

prepare_latents(
    batch_size: int,
    num_channels_latents: int,
    height: int,
    width: int,
    dtype: dtype,
    device: device,
    generator: Generator | None,
    latents: Tensor | None = None,
) -> Tensor

Prepare random noise latents.

GlmImageTransformer2DModel

Bases: CachedTransformer

GLM-Image Transformer model for 2D image generation.

This is the vllm-omni optimized version of the GLM-Image DiT model.

Parameters:

Name Type Description Default
od_config OmniDiffusionConfig

OmniDiffusionConfig containing model configuration. Transformer hyper-parameters (e.g. patch size / channels / heads) are read from od_config.tf_model_config.

required

dtype property

dtype: dtype

Return dtype of model parameters.

glyph_projector instance-attribute

glyph_projector = GlmImageFeedForward(
    dim=text_embed_dim,
    dim_out=inner_dim,
    inner_dim=inner_dim,
    activation_fn="gelu",
    quant_config=quant_config,
    prefix="glyph_projector",
)

image_projector instance-attribute

image_projector = GlmImageImageProjector(
    in_channels, inner_dim, patch_size
)

in_channels instance-attribute

in_channels = in_channels

norm_out instance-attribute

norm_out = GlmImageAdaLayerNormContinuous(
    inner_dim,
    time_embed_dim,
    elementwise_affine=False,
    quant_config=None,
    prefix="norm_out",
)

num_layers property

num_layers: int

Return number of transformer layers.

od_config instance-attribute

od_config = od_config

out_channels instance-attribute

out_channels = out_channels

parallel_config instance-attribute

parallel_config = parallel_config

patch_size instance-attribute

patch_size = patch_size

prepare instance-attribute

prepare = GlmImagePrepare(image_projector, rope, patch_size)

prior_projector instance-attribute

prior_projector = GlmImageFeedForward(
    dim=inner_dim,
    dim_out=inner_dim,
    inner_dim=inner_dim,
    activation_fn="linear-silu",
    quant_config=quant_config,
    prefix="prior_projector",
)

prior_token_embedding instance-attribute

prior_token_embedding = Embedding(
    prior_vq_quantizer_codebook_size, inner_dim
)

proj_out instance-attribute

proj_out = Linear(
    inner_dim,
    patch_size * patch_size * out_channels,
    bias=True,
)

rope instance-attribute

rope = GlmImageRotaryPosEmbed(
    attention_head_dim, patch_size, theta=10000.0
)

time_condition_embed instance-attribute

time_condition_embed = (
    GlmImageCombinedTimestepSizeEmbeddings(
        embedding_dim=time_embed_dim,
        condition_dim=condition_dim,
        pooled_projection_dim=pooled_projection_dim,
        timesteps_dim=time_embed_dim,
    )
)

transformer_blocks instance-attribute

transformer_blocks = ModuleList(
    [
        (
            GlmImageTransformerBlock(
                inner_dim,
                num_attention_heads,
                attention_head_dim,
                time_embed_dim,
                ffn_hidden_dim=ffn_hidden_dim,
                parallel_config=parallel_config,
                quant_config=quant_config,
                prefix=f"transformer_blocks.{i}",
            )
        )
        for i in (range(num_layers))
    ]
)

create_kv_cache

create_kv_cache() -> GlmImageKVCache

Create a KV cache for image editing.

Returns a new GlmImageKVCache instance sized for this model's number of transformer layers. Use this for image editing workflows.

Example

kv_cache = transformer.create_kv_cache() kv_cache.set_mode("write") transformer(condition_image, kv_cache=kv_cache) kv_cache.set_mode("read") for t in timesteps: transformer(noisy_target, kv_cache=kv_cache) kv_cache.clear()

Returns:

Type Description
GlmImageKVCache

GlmImageKVCache instance with correct number of layers.

forward

forward(
    hidden_states: Tensor,
    encoder_hidden_states: Tensor,
    prior_token_id: Tensor,
    prior_token_drop: Tensor,
    timestep: LongTensor,
    target_size: Tensor,
    crop_coords: Tensor,
    attention_kwargs: dict[str, Any] | None = None,
    return_dict: bool = True,
    attention_mask: Tensor | None = None,
    image_rotary_emb: tuple[Tensor, Tensor] | None = None,
    kv_cache: GlmImageKVCache | None = None,
) -> Tensor | Transformer2DModelOutput

Forward pass of the GLM-Image Transformer.

Parameters:

Name Type Description Default
hidden_states Tensor

Input latent tensor of shape [B, C, H, W].

required
encoder_hidden_states Tensor

Text embeddings of shape [B, S, D].

required
prior_token_id Tensor

Prior VQ token IDs.

required
prior_token_drop Tensor

Mask for dropping prior tokens (CFG).

required
timestep LongTensor

Diffusion timestep.

required
target_size Tensor

Target image size for conditioning.

required
crop_coords Tensor

Crop coordinates for conditioning.

required
attention_kwargs dict[str, Any] | None

Additional attention arguments.

None
return_dict bool

Whether to return a dataclass.

True
attention_mask Tensor | None

Optional attention mask for text tokens.

None
image_rotary_emb tuple[Tensor, Tensor] | None

Pre-computed rotary embeddings.

None
kv_cache GlmImageKVCache | None

Optional KV cache for image editing. When provided, the cache's mode determines behavior: - WRITE: Store KV from condition images - READ: Use cached KV during generation - SKIP: No caching (same as None)

None

Returns:

Type Description
Tensor | Transformer2DModelOutput

Output tensor or Transformer2DModelOutput.

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Load weights from pretrained checkpoint.

This method handles the mapping from diffusers weight names to vllm-omni weight names, especially for fused QKV projections.

get_glm_image_post_process_func

get_glm_image_post_process_func(
    od_config: OmniDiffusionConfig,
)

Get post-processing function for GLM-Image pipeline.

get_glm_image_pre_process_func

get_glm_image_pre_process_func(
    od_config: OmniDiffusionConfig,
)

Get pre-processing function for GLM-Image pipeline.

Pre-processes condition images before they are sent to the pipeline. This is called by DiffusionEngine before batching requests.