Skip to content

vllm_omni.model_executor.models.ming_flash_omni

Modules:

Name Description
audio_encoder
audio_vae
ming_flash_omni

Ming-flash-omni-2.0 thinker / image-gen wrapper.

ming_flash_omni_talker

Ming-flash-omni-2.0 talker (TTS) stage model.

ming_flash_omni_thinker

Ming-flash-omni-2.0 Thinker stage implementation (multimodal understanding).

modeling_bailing_moe_v2
pipeline

Ming-flash-omni-2.0 pipeline topology (frozen).

projectors
prompt_utils

Ming-flash-omni-2.0 prompt utilities.

spk_embedding
talker_module
text_processing

Text segmentation and normalization utilities for Ming TTS.

vision_encoder
voice_presets

MingFlashOmniForConditionalGeneration

Bases: Module, SupportsMultiModal, SupportsPP, SupportsMRoPE, CustomProcessMixin

Ming-flash-omni-2.0 thinker + image-gen wrapper.

config instance-attribute

config = config

has_postprocess instance-attribute

has_postprocess = False

has_preprocess instance-attribute

has_preprocess = False

have_multimodal_outputs instance-attribute

have_multimodal_outputs = True

make_empty_intermediate_tensors instance-attribute

make_empty_intermediate_tensors = (
    make_empty_intermediate_tensors
)

model instance-attribute

model = thinker

model_stage instance-attribute

model_stage = model_stage

requires_raw_input_tokens class-attribute instance-attribute

requires_raw_input_tokens: bool = True

sampler property

sampler

supports_multimodal class-attribute instance-attribute

supports_multimodal = True

thinker instance-attribute

thinker = init_vllm_registered_model(
    vllm_config=thinker_vllm_config,
    prefix=maybe_prefix(prefix, "thinker"),
    architectures=[
        "MingFlashOmniThinkerForConditionalGeneration"
    ],
)

compute_logits

compute_logits(
    hidden_states: Tensor, sampling_metadata=None
) -> Tensor | None

embed_input_ids

embed_input_ids(
    input_ids: Tensor,
    multimodal_embeddings=None,
    *,
    is_multimodal=None,
) -> Tensor

embed_multimodal

embed_multimodal(**kwargs)

forward

forward(
    input_ids: Tensor,
    positions: Tensor,
    intermediate_tensors: IntermediateTensors | None = None,
    inputs_embeds: Tensor | None = None,
    **kwargs,
) -> OmniOutput

get_language_model

get_language_model() -> Module

Return the language model for upstream MoE detection.

get_mm_mapping

get_mm_mapping() -> MultiModelKeys

get_mrope_input_positions

get_mrope_input_positions(*args, **kwargs)

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

sample

sample(logits: Tensor, sampling_metadata)

MingFlashOmniTalkerForConditionalGeneration

Bases: Module, CustomProcessMixin

Ming-flash-omni-2.0 talker stage: text -> audio waveform.

Uses Qwen2 LLM + CFM (Conditional Flow Matching with DiT) + Aggregator in an autoregressive loop to produce continuous audio latents, then AudioVAE decodes latents to waveforms.

aggregator instance-attribute

aggregator = Aggregator(
    llm_input_dim=hidden_size, **(aggregator)
)

allow_patterns_overrides instance-attribute

allow_patterns_overrides = ['talker/model*.safetensors']

audio_generator instance-attribute

audio_generator = MingAudioGenerator(
    config=config,
    llm_config=llm_config,
    model=model,
    cfm=cfm,
    aggregator=aggregator,
    stop_head=stop_head,
    audio_vae=audio_vae,
    patch_size=patch_size,
    his_patch_size=his_patch_size,
    latent_dim=latent_dim,
    cfg_strength=cfg_strength,
    use_cuda_graphs=_use_cuda_graphs,
)

cfg_strength instance-attribute

cfg_strength = cfg_strength

cfm instance-attribute

cfm = CFM(
    DiT(llm_input_dim=hidden_size, **(flowmodel)),
    steps=steps,
)

config instance-attribute

config = config

device property

device: device

dtype property

dtype: dtype

fall_back_to_pt_during_load instance-attribute

fall_back_to_pt_during_load = False

has_postprocess instance-attribute

has_postprocess = False

has_preprocess instance-attribute

has_preprocess = False

have_multimodal_outputs instance-attribute

have_multimodal_outputs = True

hidden_size instance-attribute

hidden_size = hidden_size

his_patch_size instance-attribute

his_patch_size = history_patch_size

latent_dim instance-attribute

latent_dim = latent_dim

llm_config instance-attribute

llm_config = llm_config

model instance-attribute

model = Qwen2Model(llm_config)

patch_size instance-attribute

patch_size = patch_size

spk_head instance-attribute

spk_head = Linear(192, hidden_size, bias=True)

stop_head instance-attribute

stop_head = Linear(hidden_size, 2, bias=True)

talker_dir instance-attribute

talker_dir = (
    join(model_path, "talker")
    if isdir(join(model_path, "talker"))
    else model_path
)

tokenizer cached property

tokenizer

vllm_config instance-attribute

vllm_config = vllm_config

voice_presets instance-attribute

voice_presets = VoicePresetRegistry(
    talker_dir=talker_dir,
    model_path=_model_path,
    download_dir=download_dir,
    audio_vae=audio_vae,
    aggregator=aggregator,
    spk_head=spk_head,
    patch_size=patch_size,
)

compute_logits

compute_logits(
    hidden_states: Tensor, sampling_metadata=None
) -> Tensor | None

embed_input_ids

embed_input_ids(
    input_ids: Tensor,
    multimodal_embeddings=None,
    is_multimodal=None,
) -> Tensor

forward

forward(
    input_ids: Tensor,
    positions: Tensor,
    intermediate_tensors: IntermediateTensors | None = None,
    inputs_embeds: Tensor | None = None,
    runtime_additional_information: list[dict]
    | None = None,
    **kwargs,
) -> OmniOutput

Run TTS generation and return audio output.

The full autoregressive generation loop is executed inside this method.

get_dummy_runtime_additional_information

get_dummy_runtime_additional_information(
    num_reqs: int,
) -> list[dict[str, object]]

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Load weights for all talker components.

The talker's HF checkpoint (talker/model.safetensors) stores weights with prefixes matching this module's submodule names directly. And AudioVAE weights live in a separate file under talker/vae/

make_empty_intermediate_tensors

make_empty_intermediate_tensors(
    batch_size: int, dtype: dtype, device: device
) -> IntermediateTensors | None

sample

sample(logits: Tensor, sampling_metadata)

MingFlashOmniThinkerDummyInputsBuilder

Bases: BaseDummyInputsBuilder[MingFlashOmniThinkerProcessingInfo]

get_dummy_mm_data

get_dummy_mm_data(
    seq_len: int,
    mm_counts: Mapping[str, int],
    mm_options: Mapping[str, BaseDummyOptions]
    | None = None,
) -> MultiModalDataDict

get_dummy_text

get_dummy_text(mm_counts: Mapping[str, int]) -> str

MingFlashOmniThinkerForConditionalGeneration

Bases: Module, SupportsMultiModal, SupportsPP, SupportsMRoPE, CustomProcessMixin

Ming Thinker stage: multimodal understanding (text + image + video + audio) -> text generation.

audio instance-attribute

audio = WhisperAudioEncoder(
    **whisper_cfg, use_flash_attn=True
)

config instance-attribute

config = llm_config

have_multimodal_outputs instance-attribute

have_multimodal_outputs = True

hf_to_vllm_mapper class-attribute instance-attribute

hf_to_vllm_mapper = WeightsMapper(
    orig_to_new_prefix={"model.": "language_model."}
)

language_model instance-attribute

language_model = BailingMoeV2ForCausalLM(
    vllm_config=llm_vllm_config,
    prefix=maybe_prefix(prefix, "llm"),
)

linear_proj instance-attribute

linear_proj = VisionProjector(
    vision_dim=image_emb_dim,
    llm_dim=hidden_size,
    mlp_depth=getattr(thinker_config, "mlp_depth", 2),
)

linear_proj_audio instance-attribute

linear_proj_audio = AudioProjector(
    audio_dim=audio_emb_dim,
    llm_dim=hidden_size,
    ds_kernel_size=getattr(audio_cfg, "ds_kernel_size", 3),
    ds_stride=getattr(audio_cfg, "ds_stride", 2),
    mlp_depth=getattr(thinker_config, "mlp_depth", 1),
)

make_empty_intermediate_tensors instance-attribute

make_empty_intermediate_tensors = (
    make_empty_intermediate_tensors
)

query_tokens_dict instance-attribute

query_tokens_dict = ParameterDict()

sampler property

sampler

thinker_config instance-attribute

thinker_config = thinker_config

vision instance-attribute

vision = MingVisionEncoder(
    vision_config=vision_config,
    quant_config=quant_config,
    prefix=maybe_prefix(prefix, "vision"),
)

compute_logits

compute_logits(
    hidden_states: Tensor, sampling_metadata
) -> Tensor | None

embed_input_ids

embed_input_ids(
    input_ids: Tensor,
    multimodal_embeddings: MultiModalEmbeddings
    | None = None,
    *,
    is_multimodal: Tensor | None = None,
    handle_oov_mm_token: bool = False,
) -> Tensor

embed_multimodal

embed_multimodal(**kwargs: object) -> MultiModalEmbeddings

extract_audio_feature

extract_audio_feature(
    audio_feats: Tensor, audio_feats_lengths: Tensor
) -> tuple[Tensor, ...]

Extract and project audio features.

Parameters:

Name Type Description Default
audio_feats Tensor

[B, L_total, n_mels] wrapped mel features — multiple audio clips per batch item are concatenated along the time dimension (time-first, as produced by MingWhisperFeatureExtractor).

required
audio_feats_lengths Tensor

[B, N] lengths of each audio clip per batch item. N is the max number of clips per item; zero-padded entries are skipped.

required

Returns:

Type Description
tuple[Tensor, ...]

Tuple of per-clip [T'_i, hidden_size] projected audio embeddings.

extract_image_feature

extract_image_feature(
    pixel_values: Tensor, grid_thw: Tensor
) -> Tensor

Extract and project image features.

Parameters:

Name Type Description Default
pixel_values Tensor

Flattened pixel values from vision processor.

required
grid_thw Tensor

[num_images, 3] tensor of (t, h, w) grid dimensions.

required

Returns:

Type Description
Tensor

[seq_len, hidden_size] L2-normalized image embeddings.

forward

forward(
    input_ids: Tensor,
    positions: Tensor,
    intermediate_tensors: IntermediateTensors | None = None,
    inputs_embeds: Tensor | None = None,
    **kwargs,
) -> OmniOutput

get_mrope_input_positions

get_mrope_input_positions(
    input_tokens: list[int],
    mm_features: list[MultiModalFeatureSpec] | None = None,
    **kwargs: object,
) -> tuple[Tensor, int]

Compute M-RoPE input positions using mm_features directly.

get_placeholder_str classmethod

get_placeholder_str(modality: str, i: int) -> str | None

iter_mm_features

iter_mm_features(
    mm_features: list[MultiModalFeatureSpec],
) -> Iterator[tuple[int, str, dict[str, Any]]]

Iterate over image/video features sorted by token position.

Yields: (offset, modality, feature_data) where feature_data contains: - image: {"grid_t", "grid_h", "grid_w", "second_per_grid_t"} - video: {"grid_t", "grid_h", "grid_w", "second_per_grid_t"}

Audio features are not yielded: Ming assigns them sequential text positions (same T/H/W value) rather than 3D grid positions.

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

sample

sample(logits: Tensor, sampling_metadata)

MingFlashOmniThinkerMultiModalProcessor

Bases: BaseMultiModalProcessor[MingFlashOmniThinkerProcessingInfo]

Multimodal processor for Ming-flash-omni Thinker stage.

Handles preprocessing of 1) image, 2) video, and 3) audio inputs, and expands placeholder tokens to the correct number of patch tokens.

MingFlashOmniThinkerProcessingInfo

Bases: Qwen2VLProcessingInfo

get_data_parser

get_data_parser()

get_feature_extractor

get_feature_extractor(**kwargs: object)

Return the audio feature extractor from the processor.

The processor may be loaded via trust_remote_code paths that return a stock transformers WhisperFeatureExtractor rather than vllm-omni's subclass MingWhisperFeatureExtractor. We accept both as long as the caller-needed attributes (sampling_rate) exist.

get_hf_config

get_hf_config() -> BailingMM2Config

get_hf_processor

get_hf_processor(**kwargs: object)

get_mm_max_tokens_per_item

get_mm_max_tokens_per_item(
    seq_len: int, mm_counts: Mapping[str, int]
) -> Mapping[str, int]

get_supported_mm_limits

get_supported_mm_limits() -> Mapping[str, int | None]

get_target_channels

get_target_channels() -> int