vllm_omni.diffusion.models.glm_image.pipeline_glm_image ¶
GlmImagePipeline implementation for vLLM-Omni.
This pipeline implements GLM-Image text-to-image generation with: - AR stage (vLLM): GLM-Image AR stage generates prior tokens - DiT stage: GlmImageTransformer2DModel performs diffusion denoising - VAE: AutoencoderKL decodes latents to images
GlmImagePipeline ¶
Bases: Module, DiffusionPipelineProfilerMixin
GLM-Image Pipeline for text-to-image and image-to-image generation.
This pipeline integrates: - AR stage (vLLM): Generates prior image tokens - Text encoder (T5EncoderModel): Encodes glyph/text embeddings - DiT model (GlmImageTransformer2DModel): Diffusion transformer - VAE (AutoencoderKL): Encodes/decodes images to/from latent space
The pipeline flow: 1. AR stage provides prior_token_ids (and optionally prior_token_image_ids) 2. T5 encodes glyph text for text rendering 3. DiT performs iterative denoising conditioned on prior tokens 4. VAE decodes final latents to image
image_processor instance-attribute ¶
scheduler instance-attribute ¶
tokenizer instance-attribute ¶
transformer instance-attribute ¶
transformer = GlmImageTransformer2DModel(
od_config=od_config, quant_config=quantization_config
)
weights_sources instance-attribute ¶
weights_sources = [
ComponentSource(
model_or_path=model,
subfolder="transformer",
revision=revision,
prefix="transformer.",
fall_back_to_pt=True,
)
]
check_inputs ¶
check_inputs(
prompt: str | list[str] | None,
height: int | None,
width: int | None,
prompt_embeds: Tensor | None = None,
) -> None
Validate input arguments before generation.
diffuse ¶
diffuse(
latents: Tensor,
prior_token_id: Tensor,
prompt_embeds: Tensor,
negative_prompt_embeds: Tensor | None,
timesteps: Tensor,
target_size: Tensor,
crop_coords: Tensor,
guidance_scale: float,
do_classifier_free_guidance: bool,
kv_caches: GlmImageKVCache | None = None,
) -> Tensor
Denoising loop for diffusion process with CFG-Parallel support.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
latents | Tensor | Initial noise latents | required |
prior_token_id | Tensor | Prior tokens generated by AR model | required |
prompt_embeds | Tensor | Encoded positive prompt embeddings (glyph embeddings) | required |
negative_prompt_embeds | Tensor | None | Encoded negative prompt embeddings | required |
timesteps | Tensor | Denoising timesteps | required |
target_size | Tensor | Target image size tensor [[height, width]] | required |
crop_coords | Tensor | Crop coordinates tensor | required |
guidance_scale | float | CFG scale | required |
do_classifier_free_guidance | bool | Whether to apply CFG | required |
kv_caches | GlmImageKVCache | None | Optional KV cache for Image Edit mode | None |
Returns:
| Type | Description |
|---|---|
Tensor | Denoised latents ready for VAE decode |
encode_prompt ¶
encode_prompt(
prompt: str | list[str],
do_classifier_free_guidance: bool = True,
num_images_per_prompt: int = 1,
prompt_embeds: Tensor | None = None,
device: device | None = None,
dtype: dtype | None = None,
max_sequence_length: int = 2048,
) -> tuple[Tensor, Tensor | None]
Encode prompt into glyph embeddings for text rendering.
forward ¶
forward(req: OmniDiffusionRequest) -> DiffusionOutput
Main generation forward pass.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
req | OmniDiffusionRequest | OmniDiffusionRequest with generation parameters | required |
Returns:
| Type | Description |
|---|---|
DiffusionOutput | DiffusionOutput containing generated image |
get_glyph_texts ¶
Extract text within quotes for glyph rendering.
load_weights ¶
Load transformer weights.
calculate_shift ¶
calculate_shift(
image_seq_len: int,
base_seq_len: int = 256,
base_shift: float = 0.25,
max_shift: float = 0.75,
) -> float
Calculate timestep shift based on image sequence length.
get_glm_image_post_process_func ¶
get_glm_image_post_process_func(
od_config: OmniDiffusionConfig,
)
Get post-processing function for GLM-Image pipeline.
get_glm_image_pre_process_func ¶
get_glm_image_pre_process_func(
od_config: OmniDiffusionConfig,
)
Get pre-processing function for GLM-Image pipeline.
Pre-processes condition images before they are sent to the pipeline. This is called by DiffusionEngine before batching requests.
retrieve_latents ¶
retrieve_latents(
encoder_output: Tensor,
generator: Generator | None = None,
sample_mode: str = "sample",
) -> Tensor
Extract latents from VAE encoder output.
retrieve_timesteps ¶
retrieve_timesteps(
scheduler,
num_inference_steps: int | None = None,
device: str | device | None = None,
timesteps: list[int] | None = None,
sigmas: list[float] | None = None,
**kwargs,
) -> tuple[Tensor, int]
Calls the scheduler's set_timesteps method and retrieves timesteps. Handles custom timesteps and sigmas schedules.