vllm_omni.model_executor.models.minicpmo_4_5 ¶
Modules:
| Name | Description |
|---|---|
minicpmo_4_5_omni | |
minicpmo_4_5_omni_llm | |
minicpmo_4_5_omni_tts | MiniCPM-o 4.5 Talker + Token2Wav: MiniCPMTTS with hidden_text_merge condition. |
pipeline | MiniCPM-o 4.5 pipeline topology (frozen). |
MiniCPMO45OmniForConditionalGeneration ¶
Bases: Module, SupportsMultiModal, SupportsPP, SupportsMRoPE
MiniCPM-o 4.5 Omni model for conditional generation.
Two-stage pipeline: - thinker (model_stage="llm"): image / video / audio encoders + 3D resampler + the omni LLM that emits text + hidden states. - talker (model_stage="tts"): MiniCPMTTS + the in-process Token2Wav vocoder that emits the final audio waveform directly.
make_empty_intermediate_tensors instance-attribute ¶
make_empty_intermediate_tensors = (
make_empty_intermediate_tensors
if model_stage == "llm" and thinker is not None
else (lambda: None)
)
thinker instance-attribute ¶
thinker = init_vllm_registered_model(
vllm_config=vllm_config,
prefix=maybe_prefix(prefix, "thinker"),
hf_config=config,
architectures=[
"MiniCPMO45OmniLLMForConditionalGeneration"
],
)
embed_input_ids ¶
embed_multimodal ¶
embed_multimodal(**kwargs: object)
vLLM V1 encoder profiling calls this; the inherited Protocol stub returns None.
forward ¶
forward(
input_ids: Tensor,
positions: Tensor,
intermediate_tensors: IntermediateTensors | None = None,
inputs_embeds: Tensor | None = None,
sampling_metadata: SamplingMetadata | None = None,
logits_index: int | None = None,
sampler=None,
additional_information: dict[str, object] | None = None,
**kwargs: object,
) -> Tensor | IntermediateTensors | OmniOutput
Forward pass for MiniCPM-o Omni model.
Workflow: 1) Thinker (model_stage="llm"): Image / video / audio encoders + 3D resampler + omni LLM → text + hidden states. 2) Talker (model_stage="tts"): MiniCPMTTS + the in-process Token2Wav vocoder → audio waveform (final pipeline output).
get_input_embeddings ¶
load_weights ¶
Load weights for the active stage of the omni model.
move_submodules_to_devices ¶
move_submodules_to_devices(
*,
thinker_device: str | device | None = None,
talker_device: str | device | None = None,
) -> None
Optionally move thinker / talker to different devices.
Example
model.move_submodules_to_devices( thinker_device='cuda:0', talker_device='cuda:1', )
MiniCPMO45OmniLLMForConditionalGeneration ¶
Bases: Module, SupportsMultiModal, SupportsPP, SupportsMRoPE
MiniCPM-o Thinker model: Image preprocessing + Vision encoder + 3D Resampler + LLM.
This model processes images through: 1. Image preprocessing (MiniCPMVImageProcessor) 2. Vision encoder (SiglipVisionTransformer) 3. 3D Resampler (Resampler to convert vision features to fixed-size queries) 4. LLM (Qwen2ForCausalLM for text generation)
audio_avg_pooler instance-attribute ¶
audio_projection_layer instance-attribute ¶
audio_projection_layer = MultiModalProjector(
in_dim=audio_output_dim, out_dim=embed_dim
)
image_processor instance-attribute ¶
image_processor = MiniCPMVImageProcessor(
max_slice_nums=max_slice_nums,
scale_resolution=image_size,
patch_size=patch_size,
use_image_id=use_image_id,
image_feature_size=query_num,
)
llm instance-attribute ¶
llm = init_vllm_registered_model(
vllm_config=vllm_config,
prefix=maybe_prefix(prefix, "llm"),
hf_config=text_config,
architectures=[llm_arch],
)
make_empty_intermediate_tensors instance-attribute ¶
resampler instance-attribute ¶
resampler = Resampler(
num_queries=query_num,
embed_dim=embed_dim,
num_heads=embed_dim // 128,
kv_dim=vision_dim,
adaptive=True,
)
compute_logits ¶
Compute logits from hidden states.
forward ¶
forward(
input_ids: Tensor,
positions: Tensor,
intermediate_tensors: IntermediateTensors | None = None,
inputs_embeds: Tensor | None = None,
**kwargs: object,
) -> Tensor | IntermediateTensors
Forward pass through thinker model.
get_audio_hidden_states ¶
get_audio_hidden_states(
data: MiniCPMOAudioFeatureInputs,
) -> list[Tensor]
get_input_embeddings ¶
get_input_embeddings(
input_ids: Tensor,
multimodal_embeddings: MultiModalEmbeddings
| None = None,
) -> Tensor
Get input embeddings combining text and multimodal features.
load_weights ¶
Load weights for thinker model components.
subsequent_chunk_mask ¶
subsequent_chunk_mask(
size: int,
chunk_size: int,
num_left_chunks: int = -1,
device: device = device("cpu"),
num_lookhead: int = 0,
) -> Tensor
Create mask for subsequent steps (size, size) with chunk size, this is for streaming encoder
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
size | int | size of mask | required |
chunk_size | int | size of chunk | required |
num_left_chunks | int | number of left chunks <0: use full chunk
| -1 |
device | device | "cpu" or "cuda" or torch.Tensor.device | device('cpu') |
Returns:
| Type | Description |
|---|---|
Tensor | torch.Tensor: mask |
MiniCPMO45OmniTTSForConditionalGeneration ¶
Bases: Module, SupportsPP
MiniCPM-o 4.5 Talker: MiniCPMTTS + Token2wav in a single forward pass.
forward ¶
forward(
input_ids=None,
positions=None,
intermediate_tensors=None,
inputs_embeds=None,
additional_information=None,
**kwargs,
)