vllm_omni.model_executor.models.common.whisper_vq ¶
WhisperVQEncoder: HF WhisperEncoder + VQ codebook + inter-layer pooling.
Built on standard WhisperConfig. VQ-specific parameters are patched onto the config object after from_pretrained so that callers never need a separate config class. GLM-TTS (and future VQ-based TTS models) only need::
from transformers import WhisperConfig
from vllm_omni.model_executor.models.common.whisper_vq import WhisperVQEncoder
cfg = WhisperConfig.from_pretrained(checkpoint_dir)
cfg.pooling_kernel_size = None # passed through from checkpoint
cfg.pooling_type = "max" # or set by caller
cfg.pooling_position = 0
cfg.quantize_vocab_size = 32768
cfg.quantize_position = 16
cfg.quantize_encoder_only = True
model = WhisperVQEncoder(cfg)
QuantizedBaseModelOutput dataclass ¶
Bases: BaseModelOutput
quantized_token_ids class-attribute instance-attribute ¶
WhisperVQEncoder ¶
Bases: WhisperEncoder
HF Whisper encoder with optional VQ codebook and pooling.
Uses a standard WhisperConfig with the following VQ-specific attrs patched on by the caller (or loaded from the checkpoint's config.json):
pooling_kernel_size-- int | Nonepooling_type-- "max" | "avg"pooling_position-- int (0-based layer index)quantize_vocab_size-- int | Nonequantize_position-- int (0-based layer index)quantize_encoder_only-- bool
forward ¶
forward(
input_features: Tensor,
attention_mask: Tensor | None = None,
**_: Any,
) -> QuantizedBaseModelOutput
remap_legacy_whisper_vq_state_dict ¶
Map CogAudio/GLM-TTS fork parameter names onto HF WhisperEncoder names.