Skip to content

vllm_omni.diffusion.models.ming_flash_omni

Modules:

Name Description
byte5_encoder

ByT5 glyph/text encoder for Ming-flash-omni-2.0 image generation.

condition_encoder

Ming-flash-omni-2.0 condition encoder for image generation.

ming_zimage_transformer

Ming-specific subclass of ZImageTransformer2DModel that supports ref_x.

pipeline_ming_imagegen

Ming-flash-omni-2.0 imagegen (text-to-image / img2img) diffusion pipeline.

t5_block_mapper

T5EncoderBlockByT5Mapper — Ming's per-block T5 stack mapping byte5 features

MingByT5Encoder

Bases: Module

Bundles byte5 tokenizer + T5 encoder + T5EncoderBlockByT5Mapper.

Build with MingByT5Encoder.from_checkpoint(<model>/byte5) when the checkpoint ships byte5 weights; otherwise callers can skip this and the pipeline falls back to no-byte5 conditioning.

mapper instance-attribute

mapper = mapper

max_length instance-attribute

max_length = max_length

text_encoder instance-attribute

text_encoder = text_encoder

tokenizer instance-attribute

tokenizer = tokenizer

forward

forward(texts: list[str]) -> Tensor

Tokenize → T5 encode → mapper; masks out padded positions.

Returns [B, max_length, sdxl_channels]. Padded positions are zeroed so the downstream torch.cat with cap_feats doesn't inject garbage.

from_checkpoint classmethod

from_checkpoint(
    byte5_dir: Path, *, device: device, dtype: dtype
) -> MingByT5Encoder

MingConditionEncoder

Bases: Module

Wraps a Qwen2 connector + norm/projection, producing DiT condition embeds.

The connector is a Qwen2ForCausalLM loaded from the connector/ subfolder of the Ming checkpoint. We run its base model in a non-causal (bidirectional) mode, because the connector is used as an encoder over the pre-baked query-token hidden states, not as an autoregressive decoder.

Parameters:

Name Type Description Default
image_gen_config MingImageGenConfig

MingImageGenConfig from MingFlashOmniConfig.

required
thinker_hidden_size int

Hidden size of the thinker (BailingMoeV2) model. Used to build a proj_in layer when the connector embedding dim differs. For the released checkpoint this is 4096.

4096
device device | str | None

Placement for the module.

None
dtype dtype | None

Parameter dtype (typically bfloat16 / float16).

None

config instance-attribute

config = image_gen_config

connector instance-attribute

connector: Module | None = None

connector_hidden_size instance-attribute

connector_hidden_size: int | None = None

norm instance-attribute

norm: Module = Identity()

proj_in instance-attribute

proj_in: Module = Identity()

proj_out instance-attribute

proj_out: Module = Identity()

thinker_hidden_size instance-attribute

thinker_hidden_size = thinker_hidden_size

extra_repr

extra_repr() -> str

forward

forward(
    thinker_hidden_states: Tensor,
    attention_mask: Tensor | None = None,
) -> Tensor

Encode thinker hidden states into DiT condition embeddings.

Parameters:

Name Type Description Default
thinker_hidden_states Tensor

[B, N, thinker_hidden_size] — sliced at the learnable query-token positions by the stage input processor before being passed here.

required
attention_mask Tensor | None

Optional [B, N] mask. Defaults to all-ones.

None

Returns:

Type Description
Tensor

[B, N, diffusion_c_input_dim] condition tensor ready for the

Tensor

ZImage transformer's cap_feats input.

load_from_checkpoint

load_from_checkpoint(model_path: str | Path) -> None

Load the Qwen2 connector + optional projection/norm weights.

This uses HF transformers directly (not vllm's weight loader) because the connector is small (~1.5B params) and only runs once per request as an encoder — vllm's distributed loading machinery is overkill.

zero_negative

zero_negative(cap_feats: Tensor) -> Tensor

Return a zero tensor shaped like cap_feats for CFG negatives.

MingImagePipeline

Bases: ZImagePipeline

Ming-flash-omni-2.0 text-to-image diffusion pipeline.

Ming-specific components added on top of the inherited contract
  • condition_encoder — Qwen2 connector + proj_in/out + F.normalize×1000
  • byte5 — Optional ByT5 glyph encoder (loaded if checkpoint ships byt5/)

byte5 instance-attribute

byte5 = from_checkpoint(
    byte5_dir, device=device, dtype=dtype
)

condition_encoder instance-attribute

condition_encoder = MingConditionEncoder(
    image_gen_config,
    thinker_hidden_size=thinker_hidden_size,
    device=device,
    dtype=dtype,
)

device instance-attribute

device = _execution_device

image_gen_config instance-attribute

image_gen_config = MingImageGenConfig()

image_processor instance-attribute

image_processor = VaeImageProcessor(
    vae_scale_factor=vae_scale_factor * 2,
    do_convert_rgb=True,
)

od_config instance-attribute

od_config = od_config

scheduler instance-attribute

scheduler = from_pretrained(
    model_path,
    subfolder=scheduler_subfolder,
    local_files_only=local_files_only,
)

text_encoder instance-attribute

text_encoder = None

tokenizer instance-attribute

tokenizer = None

transformer instance-attribute

transformer = MingZImageTransformer2DModel(
    quant_config=None
)

vae instance-attribute

vae = to(_execution_device, dtype=dtype)

vae_scale_factor instance-attribute

vae_scale_factor = 2 ** (len(block_out_channels) - 1)

weights_sources instance-attribute

weights_sources = [
    ComponentSource(
        model_or_path=model_path,
        subfolder=transformer_subfolder,
        revision=revision,
        prefix="transformer.",
        fall_back_to_pt=True,
    ),
    ComponentSource(
        model_or_path=model_path,
        subfolder=vae_subfolder,
        revision=revision,
        prefix="vae.",
    ),
]

forward

Run one text-to-image generation request.

Parameters:

Name Type Description Default
req OmniDiffusionRequest

Diffusion request. The cross-stage thinker hidden states must be present at req.prompts[0]["extra"]["thinker_hidden_states"] as a [N, H] (or [1, N, H]) tensor, placed there by thinker2imagegen.

required

Returns:

Type Description
DiffusionOutput

DiffusionOutput with .output set to a [B, 3, H, W]

DiffusionOutput

image tensor in [-1, 1]. The vllm-omni diffusion engine's

DiffusionOutput

output adapter converts this to PIL/base64 downstream.

MingZImageTransformer2DModel

Bases: ZImageTransformer2DModel

ZImage DiT with Ming's reference-latent support.

forward

forward(
    x: list[Tensor],
    t,
    cap_feats: list[Tensor],
    patch_size=2,
    f_patch_size=1,
)

unpatchify

unpatchify(
    x: list[Tensor],
    size: list[tuple],
    patch_size,
    f_patch_size,
) -> list[Tensor]

T5EncoderBlockByT5Mapper

Bases: ModelMixin

Stacks num_layers T5 encoder blocks on top of byte5 features and projects them to sdxl_channels (= Ming's diffusion_c_input_dim).

blocks instance-attribute

blocks = ModuleList(
    [
        (
            T5Block(
                byte5_config,
                has_relative_attention_bias=i == 0,
                prefix=f"blocks.{i}",
            )
        )
        for i in (range(num_layers))
    ]
)

channel_mapper instance-attribute

channel_mapper = Linear(d_model, sdxl_channels)

final_layer_norm instance-attribute

final_layer_norm = RMSNorm(
    sdxl_channels, eps=layer_norm_epsilon
)

layer_norm instance-attribute

layer_norm = RMSNorm(d_model, eps=layer_norm_epsilon)

forward

forward(
    inputs_embeds: Tensor, attention_mask: Tensor
) -> Tensor

get_extended_attention_mask

get_extended_attention_mask(
    attention_mask: Tensor, dtype: dtype
) -> Tensor

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Load Ming's HF-format byte5_mapper checkpoint into the fused TP-aware layers.

Source format (from byte5_mapper.pt): blocks.{i}.layer.0.SelfAttention.{q,k,v,o}.weight blocks.{i}.layer.1.DenseReluDense.{wi_0,wi_1,wo}.weight blocks.{i}.layer.{0,1}.layer_norm.weight {layer_norm, channel_mapper, final_layer_norm}.{weight,bias}

Target format (after T5Block from t5_encoder.py): blocks.{i}.layer.0.SelfAttention.qkv_proj.weight (fused q+k+v) blocks.{i}.layer.1.DenseReluDense.wi.weight (fused wi_0+wi_1) (others identical)

get_ming_image_post_process_func

get_ming_image_post_process_func(
    od_config: OmniDiffusionConfig,
)

Return a post-process callable that converts the raw VAE tensor to PIL.

The diffusion engine calls post_process_func(output_data) where output_data is the DiffusionOutput.output tensor returned by MingImagePipeline.forward. It has shape [B, 3, H, W] in [-1, 1] (Z-image VAE convention). We run the standard VaeImageProcessor postprocess to convert it to list[PIL.Image] which vllm-omni's OmniRequestOutput.from_diffusion then bubbles up as omni_outputs.images for serving_chat to base64-encode.

Registered via _DIFFUSION_POST_PROCESS_FUNCS["MingImagePipeline"] in vllm_omni/diffusion/registry.py.