vllm_omni.diffusion.models.ming_flash_omni ¶

Modules:

Name	Description
`byte5_encoder`	ByT5 glyph/text encoder for Ming-flash-omni-2.0 image generation.
`condition_encoder`	Ming-flash-omni-2.0 condition encoder for image generation.
`ming_zimage_transformer`	Ming-specific subclass of ZImageTransformer2DModel that supports `ref_x`.
`pipeline_ming_imagegen`	Ming-flash-omni-2.0 imagegen (text-to-image / img2img) diffusion pipeline.
`t5_block_mapper`	T5EncoderBlockByT5Mapper — Ming's per-block T5 stack mapping byte5 features

MingByT5Encoder ¶

Bases: Module

Bundles byte5 tokenizer + T5 encoder + T5EncoderBlockByT5Mapper.

Build with MingByT5Encoder.from_checkpoint(<model>/byte5) when the checkpoint ships byte5 weights; otherwise callers can skip this and the pipeline falls back to no-byte5 conditioning.

mapper `instance-attribute` ¶

mapper = mapper

max_length `instance-attribute` ¶

max_length = max_length

text_encoder `instance-attribute` ¶

text_encoder = text_encoder

tokenizer `instance-attribute` ¶

tokenizer = tokenizer

forward ¶

forward(texts: list[str]) -> Tensor

Tokenize → T5 encode → mapper; masks out padded positions.

Returns [B, max_length, sdxl_channels]. Padded positions are zeroed so the downstream torch.cat with cap_feats doesn't inject garbage.

from_checkpoint `classmethod` ¶

from_checkpoint(
    byte5_dir: Path, *, device: device, dtype: dtype
) -> MingByT5Encoder

MingConditionEncoder ¶

Bases: Module

Wraps a Qwen2 connector + norm/projection, producing DiT condition embeds.

The connector is a Qwen2ForCausalLM loaded from the connector/ subfolder of the Ming checkpoint. We run its base model in a non-causal (bidirectional) mode, because the connector is used as an encoder over the pre-baked query-token hidden states, not as an autoregressive decoder.

Parameters:

Name	Type	Description	Default
`image_gen_config`	`MingImageGenConfig`	`MingImageGenConfig` from `MingFlashOmniConfig`.	required
`thinker_hidden_size`	`int`	Hidden size of the thinker (BailingMoeV2) model. Used to build a `proj_in` layer when the connector embedding dim differs. For the released checkpoint this is 4096.	`4096`
`device`	`device \| str \| None`	Placement for the module.	`None`
`dtype`	`dtype \| None`	Parameter dtype (typically bfloat16 / float16).	`None`

config `instance-attribute` ¶

config = image_gen_config

connector `instance-attribute` ¶

connector: Module | None = None

connector_hidden_size `instance-attribute` ¶

connector_hidden_size: int | None = None

norm `instance-attribute` ¶

norm: Module = nn.Identity()

proj_in `instance-attribute` ¶

proj_in: Module = nn.Identity()

proj_out `instance-attribute` ¶

proj_out: Module = nn.Identity()

thinker_hidden_size `instance-attribute` ¶

thinker_hidden_size = thinker_hidden_size

extra_repr ¶

extra_repr() -> str

forward ¶

forward(
    thinker_hidden_states: Tensor,
    attention_mask: Tensor | None = None,
) -> Tensor

Encode thinker hidden states into DiT condition embeddings.

Parameters:

Name	Type	Description	Default
`thinker_hidden_states`	`Tensor`	`[B, N, thinker_hidden_size]` — sliced at the learnable query-token positions by the stage input processor before being passed here.	required
`attention_mask`	`Tensor \| None`	Optional `[B, N]` mask. Defaults to all-ones.	`None`

Returns:

Type	Description
`Tensor`	`[B, N, diffusion_c_input_dim]` condition tensor ready for the
`Tensor`	ZImage transformer's `cap_feats` input.

load_from_checkpoint ¶

load_from_checkpoint(model_path: str | Path) -> None

Load the Qwen2 connector + optional projection/norm weights.

This uses HF transformers directly (not vllm's weight loader) because the connector is small (~1.5B params) and only runs once per request as an encoder — vllm's distributed loading machinery is overkill.

zero_negative ¶

zero_negative(cap_feats: Tensor) -> Tensor

Return a zero tensor shaped like cap_feats for CFG negatives.

MingImagePipeline ¶

Bases: ZImagePipeline

Ming-flash-omni-2.0 text-to-image diffusion pipeline.

Ming-specific components added on top of the inherited contract

condition_encoder — Qwen2 connector + proj_in/out + F.normalize×1000
byte5 — Optional ByT5 glyph encoder (loaded if checkpoint ships byt5/)

byte5 `instance-attribute` ¶

byte5 = MingByT5Encoder.from_checkpoint(
    byte5_dir, device=self.device, dtype=dtype
)

condition_encoder `instance-attribute` ¶

condition_encoder = MingConditionEncoder(
    self.image_gen_config,
    thinker_hidden_size=self.image_gen_config.thinker_hidden_size,
    device=self.device,
    dtype=dtype,
)

device `instance-attribute` ¶

device = self._execution_device

image_gen_config `instance-attribute` ¶

image_gen_config = MingImageGenConfig()

image_processor `instance-attribute` ¶

image_processor = VaeImageProcessor(
    vae_scale_factor=self.vae_scale_factor * 2,
    do_convert_rgb=True,
)

od_config `instance-attribute` ¶

od_config = od_config

scheduler `instance-attribute` ¶

scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained(
    model_path,
    subfolder=self.image_gen_config.scheduler_subfolder,
    local_files_only=local_files_only,
)

supports_request_batch `class-attribute` `instance-attribute` ¶

supports_request_batch = False

text_encoder `instance-attribute` ¶

text_encoder = None

tokenizer `instance-attribute` ¶

tokenizer = None

transformer `instance-attribute` ¶

transformer = MingZImageTransformer2DModel(
    quant_config=None
)

vae `instance-attribute` ¶

vae = DistributedAutoencoderKL.from_config(vae_config).to(
    self._execution_device, dtype=dtype
)

vae_scale_factor `instance-attribute` ¶

vae_scale_factor = 2 ** (
    len(self.vae.config.block_out_channels) - 1
)

weights_sources `instance-attribute` ¶

weights_sources = [
    DiffusersPipelineLoader.ComponentSource(
        model_or_path=model_path,
        subfolder=self.image_gen_config.transformer_subfolder,
        revision=od_config.revision,
        prefix="transformer.",
        fall_back_to_pt=True,
    ),
    DiffusersPipelineLoader.ComponentSource(
        model_or_path=model_path,
        subfolder=self.image_gen_config.vae_subfolder,
        revision=od_config.revision,
        prefix="vae.",
    ),
]

encode_prompt ¶

encode_prompt(*args, **kwargs)

Return Ming's precomputed conditioning instead of encoding text.

NOTE: Ming has no Z-Image text_encoder; its conditioning (cap_feats, optionally ByT5-augmented) is computed in forward and stashed on self._pending_* immediately before the super().forward call, so we simply hand it back here.

forward ¶

forward(req: DiffusionRequestBatch) -> DiffusionOutput

Run one text-to-image generation request.

Parameters:

Name	Type	Description	Default
`req`	`DiffusionRequestBatch`	Single-request batch. The cross-stage thinker hidden states must be present at `req.prompts[0]["extra"]["thinker_hidden_states"]` as a `[N, H]` (or `[1, N, H]`) tensor, placed there by `thinker2imagegen`.	required

Returns:

Type	Description
`DiffusionOutput`	One DiffusionOutput with `.output` set to a `[B, 3, H, W]`
`DiffusionOutput`	image tensor in `[-1, 1]`. The vllm-omni diffusion engine's
`DiffusionOutput`	output adapter converts this to PIL/base64 downstream.

MingZImageTransformer2DModel ¶

Bases: ZImageTransformer2DModel

ZImage DiT with Ming's reference-latent support.

forward ¶

forward(
    x: list[Tensor],
    t,
    cap_feats: list[Tensor],
    patch_size=2,
    f_patch_size=1,
)

unpatchify ¶

unpatchify(
    x: list[Tensor],
    size: list[tuple],
    patch_size,
    f_patch_size,
) -> list[Tensor]

T5EncoderBlockByT5Mapper ¶

Bases: ModelMixin

Stacks num_layers T5 encoder blocks on top of byte5 features and projects them to sdxl_channels (= Ming's diffusion_c_input_dim).

blocks `instance-attribute` ¶

blocks = nn.ModuleList(
    [
        (
            T5Block(
                byte5_config,
                has_relative_attention_bias=i == 0,
                prefix=f"blocks.{i}",
            )
        )
        for i in (range(num_layers))
    ]
)

channel_mapper `instance-attribute` ¶

channel_mapper = nn.Linear(
    byte5_config.d_model, sdxl_channels
)

final_layer_norm `instance-attribute` ¶

final_layer_norm = RMSNorm(
    sdxl_channels, eps=byte5_config.layer_norm_epsilon
)

layer_norm `instance-attribute` ¶

layer_norm = RMSNorm(
    byte5_config.d_model,
    eps=byte5_config.layer_norm_epsilon,
)

forward ¶

forward(
    inputs_embeds: Tensor, attention_mask: Tensor
) -> Tensor

get_extended_attention_mask ¶

get_extended_attention_mask(
    attention_mask: Tensor, dtype: dtype
) -> Tensor

load_weights ¶

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Load Ming's HF-format byte5_mapper checkpoint into the fused TP-aware layers.

Source format (from byte5_mapper.pt): blocks.{i}.layer.0.SelfAttention.{q,k,v,o}.weight blocks.{i}.layer.1.DenseReluDense.{wi_0,wi_1,wo}.weight blocks.{i}.layer.{0,1}.layer_norm.weight {layer_norm, channel_mapper, final_layer_norm}.{weight,bias}

Target format (after T5Block from t5_encoder.py): blocks.{i}.layer.0.SelfAttention.qkv_proj.weight (fused q+k+v) blocks.{i}.layer.1.DenseReluDense.wi.weight (fused wi_0+wi_1) (others identical)

get_ming_image_post_process_func ¶

get_ming_image_post_process_func(
    od_config: OmniDiffusionConfig,
)

Return a post-process callable that converts the raw VAE tensor to PIL.

The diffusion engine calls post_process_func(output_data) where output_data is the DiffusionOutput.output tensor returned by MingImagePipeline.forward. It has shape [B, 3, H, W] in [-1, 1] (Z-image VAE convention). We run the standard VaeImageProcessor postprocess to convert it to list[PIL.Image] which vllm-omni's OmniRequestOutput.from_diffusion then bubbles up as omni_outputs.images for serving_chat to base64-encode.

Registered via _DIFFUSION_POST_PROCESS_FUNCS["MingImagePipeline"] in vllm_omni/diffusion/registry.py.

vllm_omni.diffusion.models.ming_flash_omni ¶

MingByT5Encoder ¶

mapper instance-attribute ¶

max_length instance-attribute ¶

text_encoder instance-attribute ¶

tokenizer instance-attribute ¶

forward ¶

from_checkpoint classmethod ¶

MingConditionEncoder ¶

config instance-attribute ¶

connector instance-attribute ¶

connector_hidden_size instance-attribute ¶

norm instance-attribute ¶

proj_in instance-attribute ¶

proj_out instance-attribute ¶

thinker_hidden_size instance-attribute ¶

extra_repr ¶

forward ¶

load_from_checkpoint ¶

zero_negative ¶

MingImagePipeline ¶

byte5 instance-attribute ¶

condition_encoder instance-attribute ¶

device instance-attribute ¶

image_gen_config instance-attribute ¶

image_processor instance-attribute ¶

od_config instance-attribute ¶

scheduler instance-attribute ¶

supports_request_batch class-attribute instance-attribute ¶

text_encoder instance-attribute ¶

tokenizer instance-attribute ¶

transformer instance-attribute ¶

vae instance-attribute ¶

vae_scale_factor instance-attribute ¶

weights_sources instance-attribute ¶

encode_prompt ¶

forward ¶

MingZImageTransformer2DModel ¶

forward ¶

unpatchify ¶

T5EncoderBlockByT5Mapper ¶

blocks instance-attribute ¶

channel_mapper instance-attribute ¶

final_layer_norm instance-attribute ¶

layer_norm instance-attribute ¶

forward ¶

get_extended_attention_mask ¶

load_weights ¶

get_ming_image_post_process_func ¶

mapper `instance-attribute` ¶

max_length `instance-attribute` ¶

text_encoder `instance-attribute` ¶

tokenizer `instance-attribute` ¶

from_checkpoint `classmethod` ¶

config `instance-attribute` ¶

connector `instance-attribute` ¶

connector_hidden_size `instance-attribute` ¶

norm `instance-attribute` ¶

proj_in `instance-attribute` ¶

proj_out `instance-attribute` ¶

thinker_hidden_size `instance-attribute` ¶

byte5 `instance-attribute` ¶

condition_encoder `instance-attribute` ¶

device `instance-attribute` ¶

image_gen_config `instance-attribute` ¶

image_processor `instance-attribute` ¶

od_config `instance-attribute` ¶

scheduler `instance-attribute` ¶

supports_request_batch `class-attribute` `instance-attribute` ¶

text_encoder `instance-attribute` ¶

tokenizer `instance-attribute` ¶

transformer `instance-attribute` ¶

vae `instance-attribute` ¶

vae_scale_factor `instance-attribute` ¶

weights_sources `instance-attribute` ¶

blocks `instance-attribute` ¶

channel_mapper `instance-attribute` ¶

final_layer_norm `instance-attribute` ¶

layer_norm `instance-attribute` ¶