speculators.models
Modules:
-
attention–Shared attention utilities for speculator models.
-
base_components–Shared base model components for all speculator types.
-
dflash– -
eagle3– -
metrics– -
mtp–MTP (Multi-Token Prediction) speculator implementation.
-
peagle– -
utils–
Classes:
-
DFlashDraftModel– -
DFlashSpeculatorConfig–Configuration for DFlash speculator with vocabulary mapping.
-
Eagle3DraftModel– -
Eagle3SpeculatorConfig–Configuration for EAGLE-3 speculator with vocabulary mapping.
-
MTPDraftModel–MTP speculator model for multi-token prediction.
-
MTPSpeculatorConfig–Configuration for MTP (Multi-Token Prediction) speculator.
-
PEagleDraftModel–P-EAGLE (Parallel EAGLE) draft model for speculative decoding.
-
PEagleSpeculatorConfig–Configuration for P-EAGLE (Parallel EAGLE) speculator.
DFlashDraftModel
Bases: DraftVocabMixin, SpeculatorModel
Methods:
-
from_training_args–Create DFlash model from training arguments.
-
get_trainer_kwargs–Get training and validation kwargs for DFlash.
Attributes:
-
target_layer_ids(list[int]) –Target layer IDs for auxiliary hidden states.
Source code in speculators/models/dflash/core.py
from_training_args classmethod
from_training_args(
verifier_config: PretrainedConfig,
t2d: Tensor | None = None,
d2t: Tensor | None = None,
**kwargs,
) -> DFlashDraftModel
Create DFlash model from training arguments.
Args: verifier_config: Verifier model configuration. This should be a config with num_hidden_layers set to the number of DRAFT layers (created by create_transformer_layer_config in train.py). t2d: Target-to-draft vocabulary mapping tensor (optional) d2t: Draft-to-target vocabulary mapping tensor (optional) **kwargs: Training arguments with DFlash-specific params - draft_vocab_size: Size of draft vocabulary - block_size: Block size for draft predictions (default: 8) - max_anchors: Max anchor positions during training (default: 256) - verifier_name_or_path: Path to verifier model
Returns: Initialized DFlashDraftModel
Note: The number of draft layers is encoded in verifier_config.num_hidden_layers, following the same pattern as EAGLE3.
Source code in speculators/models/dflash/core.py
get_trainer_kwargs staticmethod
Get training and validation kwargs for DFlash.
Args: **kwargs: Training arguments
Returns: Tuple of (train_call_kwargs, val_call_kwargs)
Source code in speculators/models/dflash/core.py
DFlashSpeculatorConfig
Bases: SpeculatorModelConfig
Configuration for DFlash speculator with vocabulary mapping.
DFlash features vocabulary mapping between draft (64K) and target (128K) vocabularies, enabling cross-tokenizer speculation.
Parameters:
-
–transformer_layer_configConfiguration for the transformer decoder layer
-
–draft_vocab_sizeSize of draft model vocabulary for speculation
Methods:
-
serialize_transformer_config–Serialize transformer config to dict.
-
validate_transformer_config–Validate and convert transformer config.
Attributes:
-
target_vocab_size(int) –Get target vocabulary size from transformer config.
Source code in speculators/config.py
target_vocab_size property
Get target vocabulary size from transformer config.
serialize_transformer_config
validate_transformer_config classmethod
Validate and convert transformer config.
Source code in speculators/models/dflash/config.py
Eagle3DraftModel
Bases: DraftVocabMixin, SpeculatorModel
Methods:
-
from_training_args–Create Eagle3 model from training arguments.
-
get_trainer_kwargs–Get training and validation kwargs for Eagle3.
Attributes:
-
target_layer_ids(list[int]) –Target layer IDs for auxiliary hidden states.
Source code in speculators/models/eagle3/core.py
from_training_args classmethod
from_training_args(
verifier_config: PretrainedConfig,
t2d: Tensor | None = None,
d2t: Tensor | None = None,
**kwargs,
) -> Eagle3DraftModel
Create Eagle3 model from training arguments.
Args: verifier_config: Verifier model configuration **kwargs: Training arguments with Eagle3-specific params - num_layers: Number of decoder layers - norm_before_residual: Whether to normalize before residual connection - t2d: Target-to-draft vocabulary mapping tensor - d2t: Draft-to-target vocabulary mapping tensor - ttt_steps: Number of TTT steps - verifier_name_or_path: Path to verifier model
Returns: Initialized Eagle3DraftModel
Source code in speculators/models/eagle3/core.py
get_trainer_kwargs staticmethod
Get training and validation kwargs for Eagle3.
Args: **kwargs: Training arguments
Returns: Tuple of (train_call_kwargs, val_call_kwargs)
Source code in speculators/models/eagle3/core.py
Eagle3SpeculatorConfig
Bases: SpeculatorModelConfig
Configuration for EAGLE-3 speculator with vocabulary mapping.
EAGLE-3 features vocabulary mapping between draft (32K) and target (128K) vocabularies, enabling cross-tokenizer speculation.
Parameters:
-
–transformer_layer_configConfiguration for the transformer decoder layer
-
–draft_vocab_sizeSize of draft model vocabulary for speculation
-
–norm_before_residualApply hidden_norm before storing residual
Methods:
-
serialize_transformer_config–Serialize transformer config to dict.
-
validate_transformer_config–Validate and convert transformer config.
Attributes:
-
target_vocab_size(int) –Get target vocabulary size from transformer config.
Source code in speculators/config.py
target_vocab_size property
Get target vocabulary size from transformer config.
serialize_transformer_config
validate_transformer_config classmethod
Validate and convert transformer config.
Source code in speculators/models/eagle3/config.py
MTPDraftModel
Bases: DraftVocabMixin, SpeculatorModel
MTP speculator model for multi-token prediction.
Predicts multiple future tokens (default: 3) per forward pass using a single layer with weighted multi-step loss for training.
embed_tokens and lm_head are managed by DraftVocabMixin — initialized to NaN, populated via load_verifier_weights() (called automatically by from_pretrained), and excluded from saved checkpoints. verifier_lm_head is created by DraftVocabMixin but not used in the MTP forward pass.
Methods:
-
forward–Forward pass for MTP multi-token prediction (teacher-forced).
-
get_trainer_kwargs–Get training and validation kwargs for MTP.
-
load_verifier_weights–Re-set NaN sentinel before loading — meta-device init may clear
Attributes:
-
layers(ModuleList) –Expose mtp_layers for FSDP wrapping compatibility.
-
target_layer_ids(list[int]) –MTP only uses the last hidden layer (verifier_last_hidden_states).
Source code in speculators/models/mtp/core.py
target_layer_ids property
MTP only uses the last hidden layer (verifier_last_hidden_states).
forward
forward(
input_ids: Tensor,
hidden_states: Tensor,
attention_mask: Tensor | None = None,
position_ids: Tensor | None = None,
loss_mask: Tensor | None = None,
step_weights: list[float] | None = None,
return_dict: bool = True,
**kwargs: Any,
) -> tuple
Forward pass for MTP multi-token prediction (teacher-forced).
At step k, uses ground-truth input_ids[t+k+1] as the embedding input and the MTP output from step k-1 (or verifier hidden states for step 0) as the hidden state input. Hidden states are passed recursively: each step's MTP output feeds the next step.
Targets are derived from input_ids via per-step offset slicing -- no separate label tensor is needed. Use loss_mask to exclude positions (e.g. prompt tokens) from the loss.
Parameters:
-
(input_idsTensor) –Token IDs [batch, seq_len]. Serves as both the embedding source and the prediction target (offset by step+2).
-
(hidden_statesTensor) –Hidden states from verifier [batch, seq_len, hidden_size]
-
(attention_maskTensor | None, default:None) –Optional attention mask [batch, seq_len]
-
(position_idsTensor | None, default:None) –Optional position IDs [batch, seq_len]
-
(loss_maskTensor | None, default:None) –Optional binary mask [batch, seq_len]; 1=compute loss, 0=ignore.
-
(step_weightslist[float] | None, default:None) –Per-step loss weights (None = uniform). Training only.
-
(return_dictbool, default:True) –Unused, kept for interface compatibility.
-
(kwargsAny, default:{}) –Absorbs unexpected batch keys (lengths, verifier_last_hidden_states)
Returns:
-
tuple–Tuple of (logits_list, loss, metrics)
Source code in speculators/models/mtp/core.py
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 | |
get_trainer_kwargs staticmethod
Get training and validation kwargs for MTP.
Step weights are computed from step_weight_beta and num_speculative_steps using the normalized exponential-decay formula from FastMTP (arXiv:2509.18362), Equation 2.
Pass step_weights to override the computed weights.
Source code in speculators/models/mtp/core.py
load_verifier_weights
Re-set NaN sentinel before loading — meta-device init may clear it. Deletes verifier_lm_head after loading since MTP does not use it.
Source code in speculators/models/mtp/core.py
MTPSpeculatorConfig
Bases: SpeculatorModelConfig
Configuration for MTP (Multi-Token Prediction) speculator.
Architecture: a single MTP layer with attention and MLP, combining verifier hidden states with token embeddings via an explicit input projection. embed_tokens and lm_head share the verifier's full vocabulary.
Parameters:
-
–transformer_layer_configConfiguration for the underlying transformer architecture (e.g.,
Qwen2Config). All architecture dimensions are derived from this config. -
–num_nextn_predict_layersNumber of MTP prediction heads in the checkpoint. vLLM reads this field directly to instantiate the correct number of MTP head instances. Currently only
1is supported.
Source code in speculators/config.py
PEagleDraftModel
Bases: Eagle3DraftModel
P-EAGLE (Parallel EAGLE) draft model for speculative decoding.
P-EAGLE extends EAGLE-3 with parallel multi-token prediction using Conditional-On-Distribution (COD) sampling for memory-efficient training.
Methods:
-
forward–Forward pass for P-EAGLE model training with parallel group prediction.
-
from_training_args–Create P-EAGLE model from training arguments.
-
get_trainer_kwargs–Get training and validation kwargs for P-EAGLE.
Source code in speculators/models/peagle/core.py
forward
forward(
hidden_states: Tensor,
input_ids: Tensor,
document_ids: Tensor,
position_ids: Tensor | None = None,
loss_mask: Tensor | None = None,
verifier_last_hidden_states: Tensor | None = None,
loss_fn=kl_div_loss,
**kwargs,
)
Forward pass for P-EAGLE model training with parallel group prediction.
Args: hidden_states: Verifier hidden states [batch, seq_len, 3*hidden_size] input_ids: Input token IDs [batch, seq_len] document_ids: Document IDs [1, seq_len], maps positions to doc index, pad -1 position_ids: Position IDs [batch, seq_len] (optional) loss_mask: Loss mask for which tokens to compute loss on [batch, seq_len] verifier_last_hidden_states: Verifier final hidden states for targets [batch, seq_len, hidden_size]
Returns: Tuple of (draft_tokens, loss, metrics)
Source code in speculators/models/peagle/core.py
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | |
from_training_args classmethod
from_training_args(
verifier_config: PretrainedConfig,
t2d: Tensor | None = None,
d2t: Tensor | None = None,
**kwargs,
) -> PEagleDraftModel
Create P-EAGLE model from training arguments.
Args: verifier_config: Verifier model configuration **kwargs: Training arguments with P-EAGLE-specific params - draft_vocab_size: Size of draft vocabulary - norm_before_residual: Whether to normalize before residual - num_depths: Number of parallel groups (default 8) - down_sample_ratio: COD sampling ratio (default 0.7) - down_sample_ratio_min: Minimum sampling ratio (default 0.2) - mask_token_id: Mask token ID - t2d: Target-to-draft vocabulary mapping - d2t: Draft-to-target vocabulary mapping - verifier_name_or_path: Path to verifier model
Returns: Initialized PEagleDraftModel
Source code in speculators/models/peagle/core.py
get_trainer_kwargs staticmethod
Get training and validation kwargs for P-EAGLE.
Args: **kwargs: Training arguments
Returns: Tuple of (train_call_kwargs, val_call_kwargs)
Source code in speculators/models/peagle/core.py
PEagleSpeculatorConfig
Bases: Eagle3SpeculatorConfig
Configuration for P-EAGLE (Parallel EAGLE) speculator.
P-EAGLE extends EAGLE-3 with parallel multi-token prediction using Conditional Drop Token (COD) sampling for memory-efficient training.
Parameters:
-
–num_depthsNumber of parallel prediction groups (typically 8)
-
–down_sample_ratioGeometric decay ratio for COD sampling (r in [0,1])
-
–down_sample_ratio_minMinimum retention ratio floor
-
–mask_token_idToken ID used for masking