speculators.models
Modules:
-
base_components–Shared base model components for all speculator types.
-
eagle–Speculators implementations providing a unified implementation
-
eagle3– -
mlp–
Classes:
-
Eagle3DraftModel– -
Eagle3SpeculatorConfig–Configuration for EAGLE-3 speculator with vocabulary mapping.
-
EagleSpeculator–A SpeculatorModel implementation for EAGLE and HASS variants for spec decoding:
-
EagleSpeculatorConfig–A SpeculatorModelConfig implementation to be used with the EagleSpeculator
-
MLPSpeculatorConfig–TODO
Eagle3DraftModel
Bases: SpeculatorModel
Methods:
-
from_training_args–Create Eagle3 model from training arguments.
-
get_trainer_kwargs–Get training and validation kwargs for Eagle3.
Source code in speculators/models/eagle3/core.py
from_training_args classmethod
Create Eagle3 model from training arguments.
Args: verifier_config: Verifier model configuration **kwargs: Training arguments with Eagle3-specific params - num_layers: Number of decoder layers - norm_before_residual: Whether to normalize before residual connection - t2d: Target-to-draft vocabulary mapping tensor - d2t: Draft-to-target vocabulary mapping tensor - ttt_steps: Number of TTT steps - verifier_name_or_path: Path to verifier model
Returns: Initialized Eagle3DraftModel
Source code in speculators/models/eagle3/core.py
get_trainer_kwargs staticmethod
Get training and validation kwargs for Eagle3.
Args: **kwargs: Training arguments
Returns: Tuple of (train_call_kwargs, val_call_kwargs)
Source code in speculators/models/eagle3/core.py
Eagle3SpeculatorConfig
Bases: SpeculatorModelConfig
Configuration for EAGLE-3 speculator with vocabulary mapping.
EAGLE-3 features vocabulary mapping between draft (32K) and target (128K) vocabularies, enabling cross-tokenizer speculation.
Parameters:
-
–transformer_layer_configConfiguration for the transformer decoder layer
-
–draft_vocab_sizeSize of draft model vocabulary for speculation
-
–norm_before_residualApply hidden_norm before storing residual
Methods:
-
serialize_transformer_config–Serialize transformer config to dict.
-
validate_transformer_config–Validate and convert transformer config.
Attributes:
-
target_vocab_size(int) –Get target vocabulary size from transformer config.
Source code in speculators/config.py
target_vocab_size property
Get target vocabulary size from transformer config.
serialize_transformer_config
validate_transformer_config classmethod
Validate and convert transformer config.
Source code in speculators/models/eagle3/config.py
EagleSpeculator
EagleSpeculator(
config: EagleSpeculatorConfig,
verifier: str
| PathLike
| PreTrainedModel
| None = None,
verifier_attachment_mode: Literal[
"detached", "full", "train_only"
]
| None = None,
)
Bases: SpeculatorModel
A SpeculatorModel implementation for EAGLE and HASS variants for spec decoding: - Eagle / Eagle v1: https://arxiv.org/abs/2401.15077 - Eagle v2: https://arxiv.org/abs/2406.16858 - HASS: https://arxiv.org/abs/2408.15766
Architecture Overview: The EAGLE speculator consists of: 1. Input embedding layer (shared with verifier) 2. Optional embedding layer normalization 3. Fusion layer: Concatenates and projects input embeddings + verifier hidden states to a latent space of hidden_size 4. Single transformer decoder layer for candidate token generation 5. Optional pre-LM head layer normalization 6. Language model head (shared with verifier)
Speculative Decoding Process: 1. Verifier model processes input and generates hidden states 2. EAGLE speculator uses these hidden states + input embeddings to predict next tokens 3. Multiple candidate tokens generated in parallel using token proposal methods 4. Verifier validates candidates and accepts/rejects based on probability thresholds 5. Process continues iteratively for multi-token speculation
Example:
from speculators import SpeculatorsConfig, VerifierConfig
from speculators.models import EagleSpeculator, EagleSpeculatorConfig
from speculators.proposals import GreedyTokenProposalConfig
from transformers import AutoConfig, AutoTokenizer
config = EagleSpeculatorConfig(
transformer_layer_config=AutoConfig.from_pretrained("meta-llama/Llama-3.1-8B-Instruct"),
speculators_config=SpeculatorsConfig(
algorithm="eagle",
proposal_methods=[
GreedyTokenProposalConfig(),
],
default_proposal_method="greedy",
verifier=VerifierConfig(
name_or_path="meta-llama/Llama-3.1-8B-Instruct",
architectures=["LlamaForCausalLM"],
)
)
speculator = EagleSpeculator(
config, verifier=verifier, verifier_attachment_mode="full"
)
Initializes an EAGLE speculator architecture with configurable components based on the provided configuration. The model starts with verifier-dependent layers (embed_tokens, rotary_emb, lm_head) set to None until a verifier is attached.
Parameters:
-
(configEagleSpeculatorConfig) –Configuration object specifying model architecture, layer settings, and speculative decoding parameters. Must be an instance of EagleSpeculatorConfig containing transformer layer configuration and EAGLE-specific settings.
-
(verifierstr | PathLike | PreTrainedModel | None, default:None) –Optional verifier model to attach for speculative decoding. Can be a path to a model directory, Hugging Face model identifier, or PreTrainedModel instance. If None, must be attached later via attach_verifier() before using the model.
-
(verifier_attachment_modeLiteral['detached', 'full', 'train_only'] | None, default:None) –Mode for verifier attachment. "detached" prevents attachment even if verifier is provided. "full" enables complete integration for both training and generation. "train_only" attaches only components needed for training, optimizing memory usage.
Methods:
-
attach_verifier–Attach a verifier model to the EagleSpeculator for speculative decoding.
-
detach_verifier–Removes the reference to the attached verifier model and frees up the
-
forward–Execute the forward pass for speculative token generation.
-
from_training_args–Create EAGLE model from training arguments.
-
get_trainer_kwargs–Get training and validation kwargs for EAGLE.
Source code in speculators/models/eagle.py
attach_verifier
attach_verifier(
verifier: str | PathLike | PreTrainedModel,
mode: Literal["full", "train_only"] | None = None,
)
Attach a verifier model to the EagleSpeculator for speculative decoding. Utilizes the verifier's embed_tokens, rotary_emb, and lm_head layers for the speculator's forward pass and generation methods. Additionally, for generate, it uses the verifier's hidden states to generate speculative token predictions.
If mode is "full", the verifier is fully integrated for use with both generate and forward methods.
If mode is "train_only", only the verifier's layers required for a forward pass are attached, allowing for better resource utilization during training. generate will not be available until a full verifier is attached.
Example:
# Load and attach a verifier
verifier = EagleSpeculator(...)
# For generation
speculator.attach_verifier(verifier)
outputs = speculator.generate(input_ids)
speculator.detach_verifier()
# For training
speculator.attach_verifier(verifier, mode="train_only")
outputs = speculator(input_ids, hidden_states)
speculator.detach_verifier()
Parameters:
-
(verifierstr | PathLike | PreTrainedModel) –The verifier model to attach. This can be a path to a local model directory, a Hugging Face model identifier, or an instance of PreTrainedModel. If a path or identifier is provided, the model will be loaded automatically. If an instance is provided, it will be used directly.
-
(modeLiteral['full', 'train_only'] | None, default:None) –The mode for attaching the verifier. Can be "full" or "train_only". If None, defaults to "full". In "train_only" mode, only the layers required for a forward pass are attached, and the speculator cannot perform generation until a full verifier is attached.
Returns:
- –
The PreTrainedModel instance for the verifier that was attached.
Source code in speculators/models/eagle.py
detach_verifier
Removes the reference to the attached verifier model and frees up the associated memory. After calling this method, the speculator will not be able to perform forward passes or generation until a new verifier is attached.
Source code in speculators/models/eagle.py
forward
forward(
input_ids: LongTensor,
hidden_states: FloatTensor,
attention_mask: Tensor | None = None,
position_ids: LongTensor | None = None,
past_key_values: tuple[tuple[FloatTensor]]
| None = None,
use_cache: bool | None = None,
output_attentions: bool | None = None,
output_hidden_states: bool | None = None,
return_dict: bool | None = None,
) -> torch.FloatTensor | CausalLMOutputWithPast
Execute the forward pass for speculative token generation.
Processes input tokens and verifier hidden states through the EAGLE architecture to generate candidate tokens for speculative decoding. The method combines input embeddings with verifier hidden states via a fusion layer, processes them through a transformer decoder layer, and produces logits for next token prediction.
Parameters:
-
(input_idsLongTensor) –Token IDs for the current input sequence. Shape: (batch_size, sequence_length). These represent the tokens that will be converted to embeddings and combined with verifier hidden states.
-
(hidden_statesFloatTensor) –Hidden state representations from the verifier model corresponding to the input sequence. Shape: (batch_size, sequence_length, hidden_size). These capture the verifier's understanding of the context.
-
(attention_maskTensor | None, default:None) –Optional attention mask to avoid attending to padding tokens. Shape: (batch_size, sequence_length) for 2D or (batch_size, 1, sequence_length, sequence_length) for 4D causal mask.
-
(position_idsLongTensor | None, default:None) –Optional position indices for tokens in the sequence. Shape: (batch_size, sequence_length). If None, auto-generated based on sequence length and past key values.
-
(past_key_valuestuple[tuple[FloatTensor]] | None, default:None) –Optional cached key-value states from previous forward passes for efficient generation. Tuple of layer key-value pairs.
-
(use_cachebool | None, default:None) –Whether to return key-value states for caching in subsequent forward passes. Useful for autoregressive generation efficiency.
-
(output_attentionsbool | None, default:None) –Whether to return attention weights from the transformer layer. Used for analysis and visualization.
-
(output_hidden_statesbool | None, default:None) –Whether to return hidden states from the transformer layer. Currently not implemented in this model.
-
(return_dictbool | None, default:None) –Whether to return structured CausalLMOutputWithPast instead of raw logits. If None, uses config.use_return_dict default.
Returns:
-
FloatTensor | CausalLMOutputWithPast–Either raw logits tensor (batch_size, sequence_length, vocab_size) if return_dict=False, or CausalLMOutputWithPast containing logits, past key values, and optional attention weights.
Raises:
-
ValueError–If verifier components (embed_tokens, rotary_emb, lm_head) are not attached. Call attach_verifier() before using forward().
Source code in speculators/models/eagle.py
444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 | |
from_training_args classmethod
Create EAGLE model from training arguments.
Args: verifier_config: Verifier model configuration **kwargs: Training arguments with EAGLE-specific params - layernorms: Whether to include layer normalization layers - fusion_bias: Whether to add bias to fusion layer - transformer_layer_architecture: Name of transformer decoder layer class - verifier_name_or_path: Path to verifier model
Returns: Initialized EagleSpeculator
Source code in speculators/models/eagle.py
get_trainer_kwargs staticmethod
Get training and validation kwargs for EAGLE.
EAGLE doesn't require any special forward pass arguments during training, so this returns empty dictionaries.
Args: **kwargs: Training arguments (unused)
Returns: Tuple of (train_call_kwargs, val_call_kwargs), both empty dicts
Source code in speculators/models/eagle.py
EagleSpeculatorConfig
Bases: SpeculatorModelConfig
A SpeculatorModelConfig implementation to be used with the EagleSpeculator for EAGLE and HASS variants for spec decoding: - Eagle / Eagle v1: https://arxiv.org/abs/2401.15077 - Eagle v2: https://arxiv.org/abs/2406.16858 - HASS: https://arxiv.org/abs/2408.15766
Model Configurations: - EAGLE1: layernorms=False, fusion_bias=False - EAGLE2: layernorms=False, fusion_bias=False - HASS: layernorms=False, fusion_bias=True
Example:
from speculators import SpeculatorsConfig, VerifierConfig
from speculators.models import EagleSpeculatorConfig
from speculators.proposals import GreedyTokenProposalConfig
from transformers import AutoConfig
config = EagleSpeculatorConfig(
transformer_layer_config=AutoConfig.from_pretrained("meta-llama/Llama-3.1-8B-Instruct"),
speculators_config=SpeculatorsConfig(
algorithm="eagle",
proposal_methods=[
GreedyTokenProposalConfig(),
],
default_proposal_method="greedy",
verifier=VerifierConfig(
name_or_path="meta-llama/Llama-3.1-8B-Instruct",
architectures=["LlamaForCausalLM"],
)
)
Methods:
-
check_add_architectures–Automatically adds the transformer layer architecture to the
-
serialize_transformer_layer_config–Serialize the transformer_layer_config to a dictionary for JSON storage.
-
validate_transformer_layer_config–Validate and convert transformer_layer_config to a PretrainedConfig instance.
Source code in speculators/config.py
check_add_architectures
Automatically adds the transformer layer architecture to the architectures list if it's not already present.
Returns:
-
Self–The validated configuration instance with updated architectures
Source code in speculators/models/eagle.py
serialize_transformer_layer_config
Serialize the transformer_layer_config to a dictionary for JSON storage.
Converts the PretrainedConfig object to its dictionary representation using to_diff_dict() to only include non-default values.
Parameters:
-
(valuePretrainedConfig) –The PretrainedConfig instance to serialize
Returns:
-
dict–Dictionary representation of the transformer layer configuration
Source code in speculators/models/eagle.py
validate_transformer_layer_config classmethod
Validate and convert transformer_layer_config to a PretrainedConfig instance.
Accepts either a dictionary that can be converted to a PretrainedConfig or an existing PretrainedConfig instance.
Parameters:
-
(valueAny) –The value to validate (dict or PretrainedConfig)
Returns:
-
PretrainedConfig–A validated PretrainedConfig instance
Raises:
-
ValueError–If the value cannot be converted to a PretrainedConfig
Source code in speculators/models/eagle.py
MLPSpeculatorConfig
Bases: SpeculatorModelConfig
TODO