Dynamic Speculative Decoding¶
Why is Dynamic SD needed?¶
SD methods need to verify K tokens for each sequence during decoding. As BS increases, the effective BS becomes BS*K which increases the compute requirement during verification. When this BS*K goes beyond a critical BS then SD negatively impacts the decode speed (TPOT). DSD helps by tuning the K to an optimal value such that we continue to reap the benefits from SD.
Use cases¶
- Variable concurrency workload using same deployment. K would decrease as concurrency increases.
- During RL rollout where we start off with high BS but then end up with small BS due to very few long tail request which end up generating a lot of tokens stalling the progress of the current rollout. Here K would go up during the end of rollout.
--speculative-config schema¶
To use Dynamic SD, add num_speculative_tokens_per_batch_size to the config of an SD method which is a list of list. Here, an entry is [start_bs, end_bs, optimal_K] which means when the concurrency is within range [start_bs, end_bs] then optimal_K number of draft tokens are used. For e.g.,
--speculative-config '{
"method": "eagle",
"model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B",
"num_speculative_tokens": 3,
"num_speculative_tokens_per_batch_size": [
[1, 64, 3],
[65, 128, 1],
[129, 512, 0]
]
}'
implies that:
- K=3 will be used when the concurrency is in range [1, 64]
- K=1 will be used when the concurrency is in range [65, 128]
- K=0 will be used when the concurrency is in range [129, 512], i.e., no draft tokens will be produced.
Online Examples¶
Dynamic SD Eagle Drafter¶
VLLM_USE_V2_MODEL_RUNNER=0 vllm serve meta-llama/Llama-3.1-8B-Instruct \
--speculative-config '{
"method": "eagle",
"model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B",
"num_speculative_tokens": 3,
"num_speculative_tokens_per_batch_size": [
[1, 64, 3],
[65, 128, 1],
[129, 512, 0]
]
}'
Dynamic SD Eagle3 Drafter¶
VLLM_USE_V2_MODEL_RUNNER=0 vllm serve meta-llama/Llama-3.1-8B-Instruct \
--speculative-config '{
"method": "eagle3",
"model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
"num_speculative_tokens": 3,
"num_speculative_tokens_per_batch_size": [
[1, 16, 5],
[17, 32, 4],
[33, 64, 3],
[65, 128, 1],
[129, 512, 0]
]
}'
Limitations¶
- only tested with Eagle and Eagle-3. Other SD methods may or may not work out of the box
- only usable with Model Runner V1
- not compatible with full cuda graph so we force piece-wise cuda graph with this feature
We are working on enabling it on MRv2 with full cuda graph support.