Speculative Decoding¶
vllm-metal supports two speculative decoding methods on the paged attention path. Both require synchronous scheduling and greedy sampling.
| MTP | Draft Model | |
|---|---|---|
--speculative-config method |
mtp |
draft_model |
| Target models | Gemma4 (paged) | Any paged-attention model |
| Draft source | MTP assistant checkpoint (reads target KV cache) | Separate smaller model (own KV cache) |
num_speculative_tokens |
1 | Configurable (3–5 typical) |
| Extra memory | None (reads target KV cache) | Second un-budgeted KV cache |
Both methods:
- Require
--no-async-scheduling(vLLM auto-disables async for most spec-decode methods; pass it explicitly to be safe). - Only accelerate greedy requests (
temperature=0). Non-greedy requests skip drafting silently. - Are lossless under greedy decoding.
Gemma4 MTP¶
Gemma4 MTP speculative decoding is experimental on Metal. It uses vLLM's scheduler contract and verifies scheduled draft tokens on the target paged-decode path.
Use matching target and assistant checkpoint families (assistant checkpoints
use gemma4_assistant or gemma4_mtp model config):
| Target | Assistant |
|---|---|
| Gemma4 E2B-it | Gemma4 E2B-it assistant bf16 |
| Gemma4 E4B-it | Gemma4 E4B-it assistant bf16 |
| Gemma4 31B-it bf16 | Gemma4 31B-it assistant bf16 |
The path is intentionally narrow. Unsupported speculative decode behavior fails early instead of falling back silently.
Serve¶
export TARGET=/path/to/gemma-4-E2B-it
export ASSISTANT=/path/to/gemma-4-E2B-it-assistant-bf16
VLLM_METAL_MEMORY_FRACTION=0.5 \
vllm serve "$TARGET" \
--max-model-len 1024 \
--max-num-batched-tokens 1024 \
--max-num-seqs 4 \
--no-async-scheduling \
--speculative-config "{\"method\":\"mtp\",\"model\":\"$ASSISTANT\",\"num_speculative_tokens\":1}"
For remote Hugging Face checkpoints, use the same shape and set model to the
assistant repository name. Pin revision in the JSON when publishing benchmark
numbers.
Benchmark¶
Use the in-tree offline benchmark for before/after runs. It runs one mode per
process: omit --assistant-model for the baseline, then add it for MTP.
source .venv-vllm-metal/bin/activate
export VLLM_ENABLE_V1_MULTIPROCESSING=0
export VLLM_METAL_MEMORY_FRACTION=0.5
export TARGET=/path/to/gemma-4-E2B-it
export ASSISTANT=/path/to/gemma-4-E2B-it-assistant-bf16
python -m tools.benchmark.gemma4_mtp_benchmark \
--model "$TARGET" \
--batch-size 4 \
--max-tokens 64 \
--repeats 1 \
--warmup 0 \
--ignore-eos \
--max-model-len 1024 \
--max-num-batched-tokens 512 \
--label e2b-baseline-bs4-64 \
--output-json /tmp/gemma4-e2b-baseline-bs4-64.json
python -m tools.benchmark.gemma4_mtp_benchmark \
--model "$TARGET" \
--assistant-model "$ASSISTANT" \
--batch-size 4 \
--max-tokens 64 \
--repeats 1 \
--warmup 0 \
--ignore-eos \
--max-model-len 1024 \
--max-num-batched-tokens 512 \
--label e2b-mtp-bs4-64 \
--output-json /tmp/gemma4-e2b-mtp-bs4-64.json
The JSON records the model paths, runtime package versions, relevant environment variables, prompts, generated token IDs, elapsed time, and output tokens per second. Attach those JSON files to performance PRs instead of copying machine-specific results into the docs.
Draft Model¶
Draft-model speculative decoding uses a separate, smaller model to propose
num_speculative_tokens per step, verified greedily against the target. Any
paged-attention model can serve as the target; the draft must share the
target's tokenizer and vocabulary.
Serve¶
VLLM_METAL_MEMORY_FRACTION=0.55 \
vllm serve Qwen/Qwen3-8B \
--max-model-len 2048 \
--no-async-scheduling \
--speculative-config '{"method":"draft_model","model":"Qwen/Qwen3-0.6B","num_speculative_tokens":3}'
Confirm speculative decoding is active: the server log shows
Draft model loaded for speculative decoding: <model> and periodic
SpecDecoding metrics ... Avg Draft acceptance rate.
Limitations¶
- Draft KV cache is un-budgeted. The draft gets its own KV cache sized to
the target's
num_blocks, allocated after the target KV budget and not subtracted from it. KeepVLLM_METAL_MEMORY_FRACTIONwell below 1.0. - SWA draft models untested. The draft block allocator assumes full attention.
- No pipeline parallelism.
pipeline_parallel_size>1with speculative config raises at startup.
Performance¶
On an M5 Pro (64 GB) with Qwen3-8B target + Qwen3-0.6B draft, natural prompts from RedHatAI/speculator_benchmarks:
- Single-stream: 1.36–1.48x TPOT improvement (K=3–5).
- Batched (concurrency 32): K=3 gives +11% throughput; K=5 turns net-negative as draft compute exceeds the savings.