Skip to content

vllm_omni.model_executor.models.minicpmo_4_5

Modules:

Name Description
minicpmo_4_5_omni
minicpmo_4_5_omni_llm
minicpmo_4_5_omni_tts

MiniCPM-o 4.5 Talker + Token2Wav: MiniCPMTTS with hidden_text_merge condition.

pipeline

MiniCPM-o 4.5 pipeline topology (frozen).

MiniCPMO45OmniForConditionalGeneration

Bases: Module, SupportsMultiModal, SupportsPP, SupportsMRoPE

MiniCPM-o 4.5 Omni model for conditional generation.

Two-stage pipeline: - thinker (model_stage="llm"): image / video / audio encoders + 3D resampler + the omni LLM that emits text + hidden states. - talker (model_stage="tts"): MiniCPMTTS + the in-process Token2Wav vocoder that emits the final audio waveform directly.

config instance-attribute

config = config

have_multimodal_outputs instance-attribute

have_multimodal_outputs = True

make_empty_intermediate_tensors instance-attribute

make_empty_intermediate_tensors = (
    make_empty_intermediate_tensors
    if model_stage == "llm" and thinker is not None
    else (lambda: None)
)

model instance-attribute

model = thinker

model_stage instance-attribute

model_stage = model_stage

multimodal_config instance-attribute

multimodal_config = multimodal_config

sampler cached property

sampler

talker instance-attribute

talker = None

thinker instance-attribute

thinker = init_vllm_registered_model(
    vllm_config=vllm_config,
    prefix=maybe_prefix(prefix, "thinker"),
    hf_config=config,
    architectures=[
        "MiniCPMO45OmniLLMForConditionalGeneration"
    ],
)

vllm_config instance-attribute

vllm_config = vllm_config

compute_logits

compute_logits(
    hidden_states: Tensor | OmniOutput,
) -> Tensor | None

embed_input_ids

embed_input_ids(
    input_ids: Tensor,
    multimodal_embeddings=None,
    *,
    is_multimodal=None,
) -> Tensor

embed_multimodal

embed_multimodal(**kwargs: object)

vLLM V1 encoder profiling calls this; the inherited Protocol stub returns None.

forward

forward(
    input_ids: Tensor,
    positions: Tensor,
    intermediate_tensors: IntermediateTensors | None = None,
    inputs_embeds: Tensor | None = None,
    sampling_metadata: SamplingMetadata | None = None,
    logits_index: int | None = None,
    sampler=None,
    additional_information: dict[str, object] | None = None,
    **kwargs: object,
) -> Tensor | IntermediateTensors | OmniOutput

Forward pass for MiniCPM-o Omni model.

Workflow: 1) Thinker (model_stage="llm"): Image / video / audio encoders + 3D resampler + omni LLM → text + hidden states. 2) Talker (model_stage="tts"): MiniCPMTTS + the in-process Token2Wav vocoder → audio waveform (final pipeline output).

get_input_embeddings

get_input_embeddings(
    input_ids: Tensor, multimodal_embeddings=None
) -> Tensor

get_multimodal_embeddings

get_multimodal_embeddings(**kwargs)

get_placeholder_str classmethod

get_placeholder_str(modality: str, i: int) -> str | None

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Load weights for the active stage of the omni model.

move_submodules_to_devices

move_submodules_to_devices(
    *,
    thinker_device: str | device | None = None,
    talker_device: str | device | None = None,
) -> None

Optionally move thinker / talker to different devices.

Example

model.move_submodules_to_devices( thinker_device='cuda:0', talker_device='cuda:1', )

sample

sample(
    logits: Tensor, sampling_metadata: SamplingMetadata
) -> SamplerOutput | None

MiniCPMO45OmniLLMForConditionalGeneration

Bases: Module, SupportsMultiModal, SupportsPP, SupportsMRoPE

MiniCPM-o Thinker model: Image preprocessing + Vision encoder + 3D Resampler + LLM.

This model processes images through: 1. Image preprocessing (MiniCPMVImageProcessor) 2. Vision encoder (SiglipVisionTransformer) 3. 3D Resampler (Resampler to convert vision features to fixed-size queries) 4. LLM (Qwen2ForCausalLM for text generation)

apm instance-attribute

apm = MiniCPMWhisperEncoder(audio_config)

audio_avg_pooler instance-attribute

audio_avg_pooler = AvgPool1d(
    audio_pool_step, stride=audio_pool_step
)

audio_encoder_layer instance-attribute

audio_encoder_layer = -1

audio_projection_layer instance-attribute

audio_projection_layer = MultiModalProjector(
    in_dim=audio_output_dim, out_dim=embed_dim
)

config instance-attribute

config = config

image_processor instance-attribute

image_processor = MiniCPMVImageProcessor(
    max_slice_nums=max_slice_nums,
    scale_resolution=image_size,
    patch_size=patch_size,
    use_image_id=use_image_id,
    image_feature_size=query_num,
)

llm instance-attribute

llm = init_vllm_registered_model(
    vllm_config=vllm_config,
    prefix=maybe_prefix(prefix, "llm"),
    hf_config=text_config,
    architectures=[llm_arch],
)

make_empty_intermediate_tensors instance-attribute

make_empty_intermediate_tensors = (
    make_empty_intermediate_tensors
)

mm_token_ids instance-attribute

mm_token_ids = set[int]()

multimodal_config instance-attribute

multimodal_config = multimodal_config

resampler instance-attribute

resampler = Resampler(
    num_queries=query_num,
    embed_dim=embed_dim,
    num_heads=embed_dim // 128,
    kv_dim=vision_dim,
    adaptive=True,
)

vpm instance-attribute

vpm = SiglipVisionTransformer(vision_config)

compute_logits

compute_logits(hidden_states: Tensor) -> Tensor | None

Compute logits from hidden states.

forward

forward(
    input_ids: Tensor,
    positions: Tensor,
    intermediate_tensors: IntermediateTensors | None = None,
    inputs_embeds: Tensor | None = None,
    **kwargs: object,
) -> Tensor | IntermediateTensors

Forward pass through thinker model.

get_audio_hidden_states

get_audio_hidden_states(
    data: MiniCPMOAudioFeatureInputs,
) -> list[Tensor]

get_input_embeddings

get_input_embeddings(
    input_ids: Tensor,
    multimodal_embeddings: MultiModalEmbeddings
    | None = None,
) -> Tensor

Get input embeddings combining text and multimodal features.

get_language_model

get_language_model() -> Module

get_multimodal_embeddings

get_multimodal_embeddings(
    **kwargs: object,
) -> MultiModalEmbeddings

get_placeholder_str classmethod

get_placeholder_str(modality: str, i: int) -> str | None

get_vision_hidden_states

get_vision_hidden_states(
    data: MiniCPMVImagePixelInputs,
) -> Tensor

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

Load weights for thinker model components.

subsequent_chunk_mask

subsequent_chunk_mask(
    size: int,
    chunk_size: int,
    num_left_chunks: int = -1,
    device: device = device("cpu"),
    num_lookhead: int = 0,
) -> Tensor

Create mask for subsequent steps (size, size) with chunk size, this is for streaming encoder

Parameters:

Name Type Description Default
size int

size of mask

required
chunk_size int

size of chunk

required
num_left_chunks int

number of left chunks <0: use full chunk

=0: use num_left_chunks

-1
device device

"cpu" or "cuda" or torch.Tensor.device

device('cpu')

Returns:

Type Description
Tensor

torch.Tensor: mask

MiniCPMO45OmniTTSForConditionalGeneration

Bases: Module, SupportsPP

MiniCPM-o 4.5 Talker: MiniCPMTTS + Token2wav in a single forward pass.

audio_tokenizer instance-attribute

audio_tokenizer = None

config instance-attribute

config = config

tts instance-attribute

tts = None

vllm_config instance-attribute

vllm_config = vllm_config

compute_logits

compute_logits(hidden_states, *args, **kwargs)

embed_input_ids

embed_input_ids(input_ids, **kwargs)

forward

forward(
    input_ids=None,
    positions=None,
    intermediate_tensors=None,
    inputs_embeds=None,
    additional_information=None,
    **kwargs,
)

generate_speech

generate_speech(
    tts_token_ids: Tensor, tts_hidden_states: Tensor
) -> ndarray | None

Run full 4.5 TTS pipeline using original MiniCPMTTS.generate.

get_input_embeddings

get_input_embeddings(
    input_ids, multimodal_embeddings=None, **kwargs
)

load_weights

load_weights(weights: Iterable[tuple[str, Tensor]])

sample

sample(logits, sampling_metadata)