Serve in vLLM
This tutorial shows you how to deploy your trained speculator model for production inference using vLLM.
After training a speculator, deploying it with vLLM is straightforward. vLLM automatically recognizes the speculators_config in your model and enables speculative decoding.
Basic Serving
That's it! vLLM will:
- Load your speculator model
- Read the
speculators_configfrom config.json - Load the target/verifier model
- Enable speculative decoding automatically
Long form command
vllm serve Qwen/Qwen3-8B \
-tp 1 \
--speculative-config '{
"model": "RedHatAI/Qwen3-8B-speculator.eagle3",
"num_speculative_tokens": 3,
"method": "eagle3"
}'
For the long form command, pass in the target model first (e.g. Qwen/Qwen3-8B) and then specify the draft model in the speculative config (e.g. RedHatAI/Qwen3-8B-speculator.eagle3).
The long form command can be used to change the num_speculative_tokens or to use a different target model with the speculator. This can be used to combine a quantized target model, with a speculative decoding model for even better performance.