vllm_omni.diffusion.models.ming_flash_omni.condition_encoder ¶
Ming-flash-omni-2.0 condition encoder for image generation.
Pipeline (runs inside the imagegen stage):
thinker hidden states at query-token positions [B, N, 4096]
│
▼ proj_in (Linear, bias=True)
[B, N, 1536]
│
▼ Qwen2ForCausalLM connector (is_causal=False)
loaded from <checkpoint>/connector/
[B, N, 1536]
│
▼ proj_out (Linear, bias=True)
[B, N, 2560]
│
▼ F.normalize(dim=-1) × 1000 (text_encoder_norm)
[B, N, 2560]
│
▼
cap_feats consumed by ZImageTransformer2DModel
MingConditionEncoder ¶
Bases: Module
Wraps a Qwen2 connector + norm/projection, producing DiT condition embeds.
The connector is a Qwen2ForCausalLM loaded from the connector/ subfolder of the Ming checkpoint. We run its base model in a non-causal (bidirectional) mode, because the connector is used as an encoder over the pre-baked query-token hidden states, not as an autoregressive decoder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image_gen_config | MingImageGenConfig |
| required |
thinker_hidden_size | int | Hidden size of the thinker (BailingMoeV2) model. Used to build a | 4096 |
device | device | str | None | Placement for the module. | None |
dtype | dtype | None | Parameter dtype (typically bfloat16 / float16). | None |
forward ¶
Encode thinker hidden states into DiT condition embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
thinker_hidden_states | Tensor |
| required |
attention_mask | Tensor | None | Optional | None |
Returns:
| Type | Description |
|---|---|
Tensor |
|
Tensor | ZImage transformer's |
load_from_checkpoint ¶
Load the Qwen2 connector + optional projection/norm weights.
This uses HF transformers directly (not vllm's weight loader) because the connector is small (~1.5B params) and only runs once per request as an encoder — vllm's distributed loading machinery is overkill.
zero_negative ¶
Return a zero tensor shaped like cap_feats for CFG negatives.