Getting Started
What is Speculative Decoding?
Speculative decoding is a technique that speeds up LLM inference without changing the output. A small, fast draft model (the "speculator") proposes several tokens at once, and the larger target model verifies them in a single forward pass. Accepted tokens are guaranteed to come from the same distribution as what the target model would have produced on its own, so the speedup is lossless.
Serve a Speculator with vLLM
The fastest way to try speculative decoding is to serve a pre-trained speculator with vLLM:
vLLM reads from the model's config, loads both the speculator and the target model, and enables speculative decoding automatically. See the Serve in vLLM tutorial for more options.
Pre-trained Speculator Models
Browse the full collection of ready-to-use speculator models on Hugging Face:
For a list of models that have been trained and validated end-to-end by our team, see the supported models table.
Supported Algorithms
Speculators supports multiple speculative decoding algorithms:
For help choosing between them, see the Algorithm Decision Guide.
Train Your Own Speculator
If a pre-trained speculator isn't available for your target model, you can train one with Speculators. Training requires internal hidden states from the target model, which are extracted by serving the target model with vLLM. The library supports both online and offline training modes:
- Online training -- Hidden states are generated on-the-fly during training. Easier to get started, lower disk usage.
- Offline training -- Hidden states are pre-generated and cached.
Tutorials
- Train Eagle-3 Online -- Recommended starting point
- Train Eagle-3 Offline
- Train DFlash Online -- Starting point for DFlash
- Evaluating Model Performance -- Benchmark your trained speculator
- Response Regeneration -- Improve training data quality
See all tutorials in the Tutorials section.