vllm_omni.model_executor.models.ming_flash_omni ¶
Modules:
| Name | Description |
|---|---|
audio_encoder | |
audio_vae | |
ming_flash_omni | Ming-flash-omni-2.0 thinker / image-gen wrapper. |
ming_flash_omni_talker | Ming-flash-omni-2.0 talker (TTS) stage model. |
ming_flash_omni_thinker | Ming-flash-omni-2.0 Thinker stage implementation (multimodal understanding). |
modeling_bailing_moe_v2 | |
pipeline | Ming-flash-omni-2.0 pipeline topology (frozen). |
projectors | |
prompt_utils | Ming-flash-omni-2.0 prompt utilities. |
spk_embedding | |
talker_module | |
text_processing | Text segmentation and normalization utilities for Ming TTS. |
vision_encoder | |
voice_presets | |
MingFlashOmniForConditionalGeneration ¶
Bases: Module, SupportsMultiModal, SupportsPP, SupportsMRoPE, CustomProcessMixin
Ming-flash-omni-2.0 thinker + image-gen wrapper.
make_empty_intermediate_tensors instance-attribute ¶
requires_raw_input_tokens class-attribute instance-attribute ¶
requires_raw_input_tokens: bool = True
thinker instance-attribute ¶
thinker = init_vllm_registered_model(
vllm_config=thinker_vllm_config,
prefix=maybe_prefix(prefix, "thinker"),
architectures=[
"MingFlashOmniThinkerForConditionalGeneration"
],
)
embed_input_ids ¶
forward ¶
forward(
input_ids: Tensor,
positions: Tensor,
intermediate_tensors: IntermediateTensors | None = None,
inputs_embeds: Tensor | None = None,
**kwargs,
) -> OmniOutput
get_language_model ¶
Return the language model for upstream MoE detection.
MingFlashOmniTalkerForConditionalGeneration ¶
Bases: Module, CustomProcessMixin
Ming-flash-omni-2.0 talker stage: text -> audio waveform.
Uses Qwen2 LLM + CFM (Conditional Flow Matching with DiT) + Aggregator in an autoregressive loop to produce continuous audio latents, then AudioVAE decodes latents to waveforms.
aggregator instance-attribute ¶
aggregator = Aggregator(
llm_input_dim=hidden_size, **(aggregator)
)
allow_patterns_overrides instance-attribute ¶
audio_generator instance-attribute ¶
audio_generator = MingAudioGenerator(
config=config,
llm_config=llm_config,
model=model,
cfm=cfm,
aggregator=aggregator,
stop_head=stop_head,
audio_vae=audio_vae,
patch_size=patch_size,
his_patch_size=his_patch_size,
latent_dim=latent_dim,
cfg_strength=cfg_strength,
use_cuda_graphs=_use_cuda_graphs,
)
talker_dir instance-attribute ¶
voice_presets instance-attribute ¶
voice_presets = VoicePresetRegistry(
talker_dir=talker_dir,
model_path=_model_path,
download_dir=download_dir,
audio_vae=audio_vae,
aggregator=aggregator,
spk_head=spk_head,
patch_size=patch_size,
)
embed_input_ids ¶
forward ¶
forward(
input_ids: Tensor,
positions: Tensor,
intermediate_tensors: IntermediateTensors | None = None,
inputs_embeds: Tensor | None = None,
runtime_additional_information: list[dict]
| None = None,
**kwargs,
) -> OmniOutput
Run TTS generation and return audio output.
The full autoregressive generation loop is executed inside this method.
get_dummy_runtime_additional_information ¶
load_weights ¶
Load weights for all talker components.
The talker's HF checkpoint (talker/model.safetensors) stores weights with prefixes matching this module's submodule names directly. And AudioVAE weights live in a separate file under talker/vae/
MingFlashOmniThinkerDummyInputsBuilder ¶
MingFlashOmniThinkerForConditionalGeneration ¶
Bases: Module, SupportsMultiModal, SupportsPP, SupportsMRoPE, CustomProcessMixin
Ming Thinker stage: multimodal understanding (text + image + video + audio) -> text generation.
hf_to_vllm_mapper class-attribute instance-attribute ¶
language_model instance-attribute ¶
language_model = BailingMoeV2ForCausalLM(
vllm_config=llm_vllm_config,
prefix=maybe_prefix(prefix, "llm"),
)
linear_proj instance-attribute ¶
linear_proj = VisionProjector(
vision_dim=image_emb_dim,
llm_dim=hidden_size,
mlp_depth=getattr(thinker_config, "mlp_depth", 2),
)
linear_proj_audio instance-attribute ¶
linear_proj_audio = AudioProjector(
audio_dim=audio_emb_dim,
llm_dim=hidden_size,
ds_kernel_size=getattr(audio_cfg, "ds_kernel_size", 3),
ds_stride=getattr(audio_cfg, "ds_stride", 2),
mlp_depth=getattr(thinker_config, "mlp_depth", 1),
)
make_empty_intermediate_tensors instance-attribute ¶
vision instance-attribute ¶
vision = MingVisionEncoder(
vision_config=vision_config,
quant_config=quant_config,
prefix=maybe_prefix(prefix, "vision"),
)
embed_input_ids ¶
embed_input_ids(
input_ids: Tensor,
multimodal_embeddings: MultiModalEmbeddings
| None = None,
*,
is_multimodal: Tensor | None = None,
handle_oov_mm_token: bool = False,
) -> Tensor
extract_audio_feature ¶
extract_audio_feature(
audio_feats: Tensor, audio_feats_lengths: Tensor
) -> tuple[Tensor, ...]
Extract and project audio features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_feats | Tensor | [B, L_total, n_mels] wrapped mel features — multiple audio clips per batch item are concatenated along the time dimension (time-first, as produced by MingWhisperFeatureExtractor). | required |
audio_feats_lengths | Tensor | [B, N] lengths of each audio clip per batch item. N is the max number of clips per item; zero-padded entries are skipped. | required |
Returns:
| Type | Description |
|---|---|
tuple[Tensor, ...] | Tuple of per-clip [T'_i, hidden_size] projected audio embeddings. |
extract_image_feature ¶
Extract and project image features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pixel_values | Tensor | Flattened pixel values from vision processor. | required |
grid_thw | Tensor | [num_images, 3] tensor of (t, h, w) grid dimensions. | required |
Returns:
| Type | Description |
|---|---|
Tensor | [seq_len, hidden_size] L2-normalized image embeddings. |
forward ¶
forward(
input_ids: Tensor,
positions: Tensor,
intermediate_tensors: IntermediateTensors | None = None,
inputs_embeds: Tensor | None = None,
**kwargs,
) -> OmniOutput
get_mrope_input_positions ¶
get_mrope_input_positions(
input_tokens: list[int],
mm_features: list[MultiModalFeatureSpec] | None = None,
**kwargs: object,
) -> tuple[Tensor, int]
Compute M-RoPE input positions using mm_features directly.
iter_mm_features ¶
iter_mm_features(
mm_features: list[MultiModalFeatureSpec],
) -> Iterator[tuple[int, str, dict[str, Any]]]
Iterate over image/video features sorted by token position.
Yields: (offset, modality, feature_data) where feature_data contains: - image: {"grid_t", "grid_h", "grid_w", "second_per_grid_t"} - video: {"grid_t", "grid_h", "grid_w", "second_per_grid_t"}
Audio features are not yielded: Ming assigns them sequential text positions (same T/H/W value) rather than 3D grid positions.
MingFlashOmniThinkerMultiModalProcessor ¶
Bases: BaseMultiModalProcessor[MingFlashOmniThinkerProcessingInfo]
Multimodal processor for Ming-flash-omni Thinker stage.
Handles preprocessing of 1) image, 2) video, and 3) audio inputs, and expands placeholder tokens to the correct number of patch tokens.
MingFlashOmniThinkerProcessingInfo ¶
Bases: Qwen2VLProcessingInfo
get_feature_extractor ¶
get_feature_extractor(**kwargs: object)
Return the audio feature extractor from the processor.
The processor may be loaded via trust_remote_code paths that return a stock transformers WhisperFeatureExtractor rather than vllm-omni's subclass MingWhisperFeatureExtractor. We accept both as long as the caller-needed attributes (sampling_rate) exist.