vllm_omni.transformers_utils.configs.ming_flash_omni ¶
Configuration for Ming-flash-omni-2.0 model
BailingMM2Config ¶
Bases: PretrainedConfig
audio_config instance-attribute ¶
audio_config = (
WhisperEncoderConfig(**audio_config)
if isinstance(audio_config, dict)
else audio_config
)
ignore_keys_at_rope_validation class-attribute instance-attribute ¶
llm_config instance-attribute ¶
llm_config = (
BailingMoeV2Config(**llm_config)
if isinstance(llm_config, dict)
else llm_config
)
vision_config instance-attribute ¶
vision_config = (
Qwen3VLMoeVisionConfig(**vision_config)
if isinstance(vision_config, dict)
else vision_config
)
BailingMoeV2Config ¶
MingFlashOmniConfig ¶
Bases: PretrainedConfig
Configuration class for unified Ming-flash-omni-2.0 model
sub_configs class-attribute instance-attribute ¶
sub_configs: ClassVar = {
"thinker_config": BailingMM2Config,
"image_gen_config": MingImageGenConfig,
"talker_config": MingFlashOmniTalkerConfig,
}
MingFlashOmniTalkerConfig ¶
Bases: PretrainedConfig
Configuration class for Ming-flash-omni-2.0 talker (TTS) stage.
The talker uses a Qwen2 LLM backbone with CFM (Conditional Flow Matching) via a DiT diffusion transformer, plus an Aggregator that maps generated audio latents back to the LLM embedding space for autoregressive generation.
MingImageGenConfig ¶
Bases: PretrainedConfig
Configuration for Ming-flash-omni-2.0 image generation stage.
Mirrors the layout of the HF checkpoint at https://huggingface.co/inclusionAI/Ming-flash-omni-2.0 where image-gen components live in sibling subfolders (connector/, transformer/, vae/, scheduler/, mlp/).
Qwen3VLMoeVisionConfig ¶
Bases: PretrainedConfig
Configuration class for Qwen3 MoE Vision Transformer
WhisperEncoderConfig ¶
Bases: PretrainedConfig
Configuration class for Whisper audio encoder