Speculators Documentation

Overview

Speculators is a unified library for building, training and storing speculative decoding algorithms for large language model (LLM) inference, including in frameworks like vLLM. Speculative decoding is a lossless technique that speeds up LLM inference by using a smaller, faster draft model (i.e "the speculator") to propose tokens, which are then verified by the larger base model, reducing latency without compromising output quality.

The speculator intelligently drafts multiple tokens ahead of time, and the base model verifies them in a single forward pass. This approach boosts performance without sacrificing output quality, as every accepted token is guaranteed to come from the same distribution as using the main model on its own.

Speculators standardizes this process by providing a productionized end-to-end framework to train draft models with reusable formats and tools. Trained models can seamlessly run in vLLM, enabling the deployment of speculative decoding in production-grade inference servers.

Speculators user flow diagram

Key Features

Offline Training Data Generation using vLLM: Enable the generation of hidden states using vLLM. Data samples are saved to disk and can be used for draft model training.
Draft Model Training Support: E2E training support of single and multi-layer draft models. Training is supported for MoE, non-MoE, and Vision Language models.
Standardized, Extensible Format: Provides a Hugging Face-compatible format for defining speculative models, with tools to convert from external research repositories into a standard speculators format for easy adoption.
Seamless vLLM Integration: Built for direct deployment into vLLM, enabling low-latency, production-grade inference with minimal overhead.

Tip

Read more about Speculators features in this vLLM blog post.

Quick Start

To try out a speculative decoding model, you can get started by running a pre-trained one with vLLM. After installing vLLM, run:

vllm serve RedHatAI/Qwen3-8B-speculator.eagle3

(Or choose another model from the RedHatAI/speculator-models collection.)

Behind the scenes, this reads the model from Hugging Face, parses the speculators_config, and sets up both the speculator and verifier models to run together.

To create a speculative decoding model for a different verifier model, there are two approaches you can choose:

Train a new speculative decoding model - See Getting Started and Tutorials.
Convert an existing model from a third-party library to the Speculators format for easy deployment with vLLM - See Features.

Supported Models

The following table summarizes the models that have been trained end-to-end by our team:

Verifier Architecture	Verifier Size	Training Support	vLLM Deployment Support
Llama	8B-Instruct	Eagle-3 ✅	✅
	70B-Instruct	Eagle-3 ✅	✅

Qwen3	8B	Eagle-3 ✅	✅
	14B	Eagle-3 ✅	✅
	32B	Eagle-3 ✅	✅
gpt-oss	20b	Eagle-3 ✅	✅
gpt-oss	120b	Eagle-3 ✅	✅
Qwen3 MoE	30B-Instruct	Eagle-3 ✅	✅
	235B-Instruct	Eagle-3 ✅	✅
	235B	Eagle-3 ✅	✅
Qwen3-VL	235B-A22B	Eagle-3 ✅	✅
Mistral 3 Large	675B-Instruct	Eagle-3 ⏳	⏳
Gemma 4	31B-it	Eagle-3 ✅ DFlash ✅	✅
Gemma 4 MoE	26B-A4B-it	Eagle-3 ✅	✅

✅ = Supported, ⏳ = In Progress, ❌ = Not Yet Supported

Installation

Prerequisites

Before installing, ensure you have the following:

Operating System: Linux or macOS
Python: 3.10 or higher
Package Manager: pip (recommended) or conda

Install from PyPI (Recommended)

Install the latest stable release from PyPI:

pip install speculators

Install from Source

For the latest development version or to contribute to the project:

git clone https://github.com/vllm-project/speculators.git
cd speculators

pip install -e .

For development with additional tools:

pip install -e ".[dev]"

Community & Support

💬 Join us on the vLLM Community Slack and share your questions, thoughts, or ideas in:

#speculators
#feat-spec-decode

🎥 Watch our Office Hours presentation: Video | Slides

For more community resources, see the Community page.

Next Steps

User Guide - Learn how to use Speculators
Getting Started - Quick start guide for training models
Tutorials - Step-by-step walkthroughs
API Reference - Python API documentation
CLI Reference - Command-line tools documentation

License

Speculators is licensed under the Apache License 2.0.

Citation

If you find Speculators helpful in your research or projects, please consider citing it:

@misc{speculators2025,
  title={Speculators: A Unified Library for Speculative Decoding Algorithms in LLM Serving},
  author={Red Hat},
  year={2025},
  howpublished={\url{https://github.com/vllm-project/speculators}},
}