Skip to content

Algorithm Decision Guide

Speculators currently supports four speculative decoding algorithms: Eagle-3, P-EAGLE, DFlash, and MTP. All are lossless -- they produce output from the same distribution as the target model.

How They Differ

Eagle-3 predicts draft tokens autoregressively, one at a time.

P-EAGLE extends Eagle-3 with parallel multi-token prediction across multiple depths, using COD sampling for memory-efficient training.

DFlash predicts all draft tokens in a single forward pass using block-based prediction with anchor points.

MTP finetunes the model's native multi-token prediction head on domain-specific data. Unlike the other algorithms, MTP does not train from scratch -- it starts from pre-existing MTP layers and is only available for models with native MTP support.

Eagle-3, P-EAGLE, and DFlash can be paired with any supported verifier model (including quantized variants) -- the draft architecture is independent of the verifier architecture. MTP requires a model with native MTP layers (e.g. Qwen3-Next, Qwen3.5).

Current Support

Eagle-3 P-EAGLE DFlash MTP
Draft layers Llama-style Llama-style Qwen3-style Native MTP layers
Verifier models Any supported Any supported Any supported Models with native MTP only
Training mode From scratch From scratch From scratch Finetune existing MTP head
Speculators Mature Newer, growing fast Newer, growing fast Newer, growing fast
vLLM Mature Newer, growing fast Newer, growing fast Newer, growing fast

Eagle-3 has been available longer and has broader support in both Speculators and vLLM. P-EAGLE, DFlash, and MTP were added more recently and support is improving rapidly.

Which Should I Use?

If you're unsure, start with Eagle-3 -- it has the most mature tooling and documentation. If you want parallel multi-token prediction with an Eagle-3-based architecture, try P-EAGLE. If you want to experiment with DFlash's single-forward-pass block prediction approach, the training workflow is the same. If your model already has native MTP layers (e.g. Qwen3-Next, Qwen3.5), MTP finetuning lets you improve the existing MTP head on domain-specific data without training a separate draft model.

For more details on each algorithm, see: