vllm_omni.diffusion.models.glm_image ¶
GLM Image diffusion model components.
Modules:
| Name | Description |
|---|---|
glm_image_transformer | |
pipeline_glm_image | GlmImagePipeline implementation for vLLM-Omni. |
GlmImageKVCache ¶
Container for all layers' KV caches.
Manages KV cache for all transformer layers in GLM-Image model. Provides a unified interface for setting mode and clearing cache.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_layers | int | Number of transformer layers in the model. | required |
Example
kv_cache = GlmImageKVCache(num_layers=28) kv_cache.set_mode(KVCacheMode.WRITE)
... process condition image ...¶
kv_cache.set_mode(KVCacheMode.READ)
... process target image ...¶
kv_cache.clear()
set_mode ¶
set_mode(mode: KVCacheMode | str | None) -> None
Set cache mode for all layers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mode | KVCacheMode | str | None | Cache mode (WRITE, READ, SKIP) or string ("write", "read", "skip"). Use None to disable cache operations. | required |
Raises:
| Type | Description |
|---|---|
ValueError | If mode is an invalid string. |
GlmImagePipeline ¶
Bases: Module, DiffusionPipelineProfilerMixin
GLM-Image Pipeline for text-to-image and image-to-image generation.
This pipeline integrates: - AR stage (vLLM): Generates prior image tokens - Text encoder (T5EncoderModel): Encodes glyph/text embeddings - DiT model (GlmImageTransformer2DModel): Diffusion transformer - VAE (AutoencoderKL): Encodes/decodes images to/from latent space
The pipeline flow: 1. AR stage provides prior_token_ids (and optionally prior_token_image_ids) 2. T5 encodes glyph text for text rendering 3. DiT performs iterative denoising conditioned on prior tokens 4. VAE decodes final latents to image
image_processor instance-attribute ¶
scheduler instance-attribute ¶
tokenizer instance-attribute ¶
transformer instance-attribute ¶
transformer = GlmImageTransformer2DModel(
od_config=od_config, quant_config=quantization_config
)
weights_sources instance-attribute ¶
weights_sources = [
ComponentSource(
model_or_path=model,
subfolder="transformer",
revision=revision,
prefix="transformer.",
fall_back_to_pt=True,
)
]
check_inputs ¶
check_inputs(
prompt: str | list[str] | None,
height: int | None,
width: int | None,
prompt_embeds: Tensor | None = None,
) -> None
Validate input arguments before generation.
diffuse ¶
diffuse(
latents: Tensor,
prior_token_id: Tensor,
prompt_embeds: Tensor,
negative_prompt_embeds: Tensor | None,
timesteps: Tensor,
target_size: Tensor,
crop_coords: Tensor,
guidance_scale: float,
do_classifier_free_guidance: bool,
kv_caches: GlmImageKVCache | None = None,
) -> Tensor
Denoising loop for diffusion process with CFG-Parallel support.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
latents | Tensor | Initial noise latents | required |
prior_token_id | Tensor | Prior tokens generated by AR model | required |
prompt_embeds | Tensor | Encoded positive prompt embeddings (glyph embeddings) | required |
negative_prompt_embeds | Tensor | None | Encoded negative prompt embeddings | required |
timesteps | Tensor | Denoising timesteps | required |
target_size | Tensor | Target image size tensor [[height, width]] | required |
crop_coords | Tensor | Crop coordinates tensor | required |
guidance_scale | float | CFG scale | required |
do_classifier_free_guidance | bool | Whether to apply CFG | required |
kv_caches | GlmImageKVCache | None | Optional KV cache for Image Edit mode | None |
Returns:
| Type | Description |
|---|---|
Tensor | Denoised latents ready for VAE decode |
encode_prompt ¶
encode_prompt(
prompt: str | list[str],
do_classifier_free_guidance: bool = True,
num_images_per_prompt: int = 1,
prompt_embeds: Tensor | None = None,
device: device | None = None,
dtype: dtype | None = None,
max_sequence_length: int = 2048,
) -> tuple[Tensor, Tensor | None]
Encode prompt into glyph embeddings for text rendering.
forward ¶
forward(req: OmniDiffusionRequest) -> DiffusionOutput
Main generation forward pass.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
req | OmniDiffusionRequest | OmniDiffusionRequest with generation parameters | required |
Returns:
| Type | Description |
|---|---|
DiffusionOutput | DiffusionOutput containing generated image |
get_glyph_texts ¶
Extract text within quotes for glyph rendering.
load_weights ¶
Load transformer weights.
GlmImageTransformer2DModel ¶
Bases: CachedTransformer
GLM-Image Transformer model for 2D image generation.
This is the vllm-omni optimized version of the GLM-Image DiT model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
od_config | OmniDiffusionConfig | OmniDiffusionConfig containing model configuration. Transformer hyper-parameters (e.g. patch size / channels / heads) are read from | required |
glyph_projector instance-attribute ¶
glyph_projector = GlmImageFeedForward(
dim=text_embed_dim,
dim_out=inner_dim,
inner_dim=inner_dim,
activation_fn="gelu",
quant_config=quant_config,
prefix="glyph_projector",
)
image_projector instance-attribute ¶
image_projector = GlmImageImageProjector(
in_channels, inner_dim, patch_size
)
norm_out instance-attribute ¶
norm_out = GlmImageAdaLayerNormContinuous(
inner_dim,
time_embed_dim,
elementwise_affine=False,
quant_config=None,
prefix="norm_out",
)
prior_projector instance-attribute ¶
prior_projector = GlmImageFeedForward(
dim=inner_dim,
dim_out=inner_dim,
inner_dim=inner_dim,
activation_fn="linear-silu",
quant_config=quant_config,
prefix="prior_projector",
)
prior_token_embedding instance-attribute ¶
proj_out instance-attribute ¶
rope instance-attribute ¶
rope = GlmImageRotaryPosEmbed(
attention_head_dim, patch_size, theta=10000.0
)
time_condition_embed instance-attribute ¶
time_condition_embed = (
GlmImageCombinedTimestepSizeEmbeddings(
embedding_dim=time_embed_dim,
condition_dim=condition_dim,
pooled_projection_dim=pooled_projection_dim,
timesteps_dim=time_embed_dim,
)
)
transformer_blocks instance-attribute ¶
transformer_blocks = ModuleList(
[
(
GlmImageTransformerBlock(
inner_dim,
num_attention_heads,
attention_head_dim,
time_embed_dim,
ffn_hidden_dim=ffn_hidden_dim,
parallel_config=parallel_config,
quant_config=quant_config,
prefix=f"transformer_blocks.{i}",
)
)
for i in (range(num_layers))
]
)
create_kv_cache ¶
create_kv_cache() -> GlmImageKVCache
Create a KV cache for image editing.
Returns a new GlmImageKVCache instance sized for this model's number of transformer layers. Use this for image editing workflows.
Example
kv_cache = transformer.create_kv_cache() kv_cache.set_mode("write") transformer(condition_image, kv_cache=kv_cache) kv_cache.set_mode("read") for t in timesteps: transformer(noisy_target, kv_cache=kv_cache) kv_cache.clear()
Returns:
| Type | Description |
|---|---|
GlmImageKVCache | GlmImageKVCache instance with correct number of layers. |
forward ¶
forward(
hidden_states: Tensor,
encoder_hidden_states: Tensor,
prior_token_id: Tensor,
prior_token_drop: Tensor,
timestep: LongTensor,
target_size: Tensor,
crop_coords: Tensor,
attention_kwargs: dict[str, Any] | None = None,
return_dict: bool = True,
attention_mask: Tensor | None = None,
image_rotary_emb: tuple[Tensor, Tensor] | None = None,
kv_cache: GlmImageKVCache | None = None,
) -> Tensor | Transformer2DModelOutput
Forward pass of the GLM-Image Transformer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_states | Tensor | Input latent tensor of shape [B, C, H, W]. | required |
encoder_hidden_states | Tensor | Text embeddings of shape [B, S, D]. | required |
prior_token_id | Tensor | Prior VQ token IDs. | required |
prior_token_drop | Tensor | Mask for dropping prior tokens (CFG). | required |
timestep | LongTensor | Diffusion timestep. | required |
target_size | Tensor | Target image size for conditioning. | required |
crop_coords | Tensor | Crop coordinates for conditioning. | required |
attention_kwargs | dict[str, Any] | None | Additional attention arguments. | None |
return_dict | bool | Whether to return a dataclass. | True |
attention_mask | Tensor | None | Optional attention mask for text tokens. | None |
image_rotary_emb | tuple[Tensor, Tensor] | None | Pre-computed rotary embeddings. | None |
kv_cache | GlmImageKVCache | None | Optional KV cache for image editing. When provided, the cache's mode determines behavior: - WRITE: Store KV from condition images - READ: Use cached KV during generation - SKIP: No caching (same as None) | None |
Returns:
| Type | Description |
|---|---|
Tensor | Transformer2DModelOutput | Output tensor or Transformer2DModelOutput. |
get_glm_image_post_process_func ¶
get_glm_image_post_process_func(
od_config: OmniDiffusionConfig,
)
Get post-processing function for GLM-Image pipeline.
get_glm_image_pre_process_func ¶
get_glm_image_pre_process_func(
od_config: OmniDiffusionConfig,
)
Get pre-processing function for GLM-Image pipeline.
Pre-processes condition images before they are sent to the pipeline. This is called by DiffusionEngine before batching requests.