vllm_omni.diffusion.models.ming_flash_omni ¶
Modules:
| Name | Description |
|---|---|
byte5_encoder | ByT5 glyph/text encoder for Ming-flash-omni-2.0 image generation. |
condition_encoder | Ming-flash-omni-2.0 condition encoder for image generation. |
ming_zimage_transformer | Ming-specific subclass of ZImageTransformer2DModel that supports |
pipeline_ming_imagegen | Ming-flash-omni-2.0 imagegen (text-to-image / img2img) diffusion pipeline. |
t5_block_mapper | T5EncoderBlockByT5Mapper — Ming's per-block T5 stack mapping byte5 features |
MingByT5Encoder ¶
Bases: Module
Bundles byte5 tokenizer + T5 encoder + T5EncoderBlockByT5Mapper.
Build with MingByT5Encoder.from_checkpoint(<model>/byte5) when the checkpoint ships byte5 weights; otherwise callers can skip this and the pipeline falls back to no-byte5 conditioning.
forward ¶
Tokenize → T5 encode → mapper; masks out padded positions.
Returns [B, max_length, sdxl_channels]. Padded positions are zeroed so the downstream torch.cat with cap_feats doesn't inject garbage.
from_checkpoint classmethod ¶
from_checkpoint(
byte5_dir: Path, *, device: device, dtype: dtype
) -> MingByT5Encoder
MingConditionEncoder ¶
Bases: Module
Wraps a Qwen2 connector + norm/projection, producing DiT condition embeds.
The connector is a Qwen2ForCausalLM loaded from the connector/ subfolder of the Ming checkpoint. We run its base model in a non-causal (bidirectional) mode, because the connector is used as an encoder over the pre-baked query-token hidden states, not as an autoregressive decoder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image_gen_config | MingImageGenConfig |
| required |
thinker_hidden_size | int | Hidden size of the thinker (BailingMoeV2) model. Used to build a | 4096 |
device | device | str | None | Placement for the module. | None |
dtype | dtype | None | Parameter dtype (typically bfloat16 / float16). | None |
forward ¶
Encode thinker hidden states into DiT condition embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
thinker_hidden_states | Tensor |
| required |
attention_mask | Tensor | None | Optional | None |
Returns:
| Type | Description |
|---|---|
Tensor |
|
Tensor | ZImage transformer's |
load_from_checkpoint ¶
Load the Qwen2 connector + optional projection/norm weights.
This uses HF transformers directly (not vllm's weight loader) because the connector is small (~1.5B params) and only runs once per request as an encoder — vllm's distributed loading machinery is overkill.
zero_negative ¶
Return a zero tensor shaped like cap_feats for CFG negatives.
MingImagePipeline ¶
Bases: ZImagePipeline
Ming-flash-omni-2.0 text-to-image diffusion pipeline.
Ming-specific components added on top of the inherited contract
condition_encoder— Qwen2 connector + proj_in/out + F.normalize×1000byte5— Optional ByT5 glyph encoder (loaded if checkpoint shipsbyt5/)
condition_encoder instance-attribute ¶
condition_encoder = MingConditionEncoder(
image_gen_config,
thinker_hidden_size=thinker_hidden_size,
device=device,
dtype=dtype,
)
image_processor instance-attribute ¶
scheduler instance-attribute ¶
scheduler = from_pretrained(
model_path,
subfolder=scheduler_subfolder,
local_files_only=local_files_only,
)
weights_sources instance-attribute ¶
weights_sources = [
ComponentSource(
model_or_path=model_path,
subfolder=transformer_subfolder,
revision=revision,
prefix="transformer.",
fall_back_to_pt=True,
),
ComponentSource(
model_or_path=model_path,
subfolder=vae_subfolder,
revision=revision,
prefix="vae.",
),
]
forward ¶
forward(req: OmniDiffusionRequest) -> DiffusionOutput
Run one text-to-image generation request.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
req | OmniDiffusionRequest | Diffusion request. The cross-stage thinker hidden states must be present at | required |
Returns:
| Type | Description |
|---|---|
DiffusionOutput | DiffusionOutput with |
DiffusionOutput | image tensor in |
DiffusionOutput | output adapter converts this to PIL/base64 downstream. |
MingZImageTransformer2DModel ¶
T5EncoderBlockByT5Mapper ¶
Bases: ModelMixin
Stacks num_layers T5 encoder blocks on top of byte5 features and projects them to sdxl_channels (= Ming's diffusion_c_input_dim).
blocks instance-attribute ¶
blocks = ModuleList(
[
(
T5Block(
byte5_config,
has_relative_attention_bias=i == 0,
prefix=f"blocks.{i}",
)
)
for i in (range(num_layers))
]
)
final_layer_norm instance-attribute ¶
get_extended_attention_mask ¶
load_weights ¶
Load Ming's HF-format byte5_mapper checkpoint into the fused TP-aware layers.
Source format (from byte5_mapper.pt): blocks.{i}.layer.0.SelfAttention.{q,k,v,o}.weight blocks.{i}.layer.1.DenseReluDense.{wi_0,wi_1,wo}.weight blocks.{i}.layer.{0,1}.layer_norm.weight {layer_norm, channel_mapper, final_layer_norm}.{weight,bias}
Target format (after T5Block from t5_encoder.py): blocks.{i}.layer.0.SelfAttention.qkv_proj.weight (fused q+k+v) blocks.{i}.layer.1.DenseReluDense.wi.weight (fused wi_0+wi_1) (others identical)
get_ming_image_post_process_func ¶
get_ming_image_post_process_func(
od_config: OmniDiffusionConfig,
)
Return a post-process callable that converts the raw VAE tensor to PIL.
The diffusion engine calls post_process_func(output_data) where output_data is the DiffusionOutput.output tensor returned by MingImagePipeline.forward. It has shape [B, 3, H, W] in [-1, 1] (Z-image VAE convention). We run the standard VaeImageProcessor postprocess to convert it to list[PIL.Image] which vllm-omni's OmniRequestOutput.from_diffusion then bubbles up as omni_outputs.images for serving_chat to base64-encode.
Registered via _DIFFUSION_POST_PROCESS_FUNCS["MingImagePipeline"] in vllm_omni/diffusion/registry.py.