Speculative Decoding Guide¶

This guide shows how to use Speculative Decoding with vLLM Ascend. Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.

Overview¶

vLLM Ascend implements speculative decoding through a proposer-verifier architecture:

Proposer (vllm_ascend/spec_decode/): Generates draft (speculative) tokens using various methods — from simple n-gram matching to neural-network-based draft models.
Rejection Sampler (vllm_ascend/sample/): Verifies draft tokens against the target model's output, accepting matches and rejecting mismatches, with optional optimizations including Block Verify and Entropy Verify.

The following speculative decoding methods are supported:

Method	Description
`ngram`	Match n-grams from the prompt
`suffix`	Suffix-based pattern matching (requires Arctic Inference)
`medusa`	Medusa heads embedded in the target model
`eagle`	EAGLE-based draft model
`eagle3`	EAGLE-3 based draft model
`mtp`	Multi-Token Prediction with shared embedding head
`dflash`	Block diffusion-based parallel draft model
`draft_model`	Generic external draft LLM
`extract_hidden_states`	Extract hidden states for EAGLE training

Common Configuration¶

All speculative decoding methods are configured through the speculative_config parameter when initializing the model or starting the server:

method (str, required): The speculative decoding method. Must be one of the supported method names listed in the table above.
num_speculative_tokens (int, required): Number of speculative tokens to generate per forward pass. Auto-filled from the draft model's n_predict config (e.g., MTP) or suffix_decoding_max_tree_depth (suffix method) when available.
model (str, optional): Path or HF repo ID for the draft model. Required for eagle, eagle3, dflash, medusa, and draft_model. Automatically resolved for mtp (reuses target model), ngram, suffix, and extract_hidden_states.
draft_tensor_parallel_size (int, optional): Tensor parallelism size for the draft model. Can only be 1 or the same as the target model's tensor parallel size.
disable_padded_drafter_batch (bool, default: False): Disable input padding for speculative decoding. If set to True, speculative input batches can contain sequences of different lengths, which may only be supported by certain attention backends. Note: Only effective with eagle, eagle3, mtp, dflash, draft_model, and extract_hidden_states methods.

Offline inference — pass speculative_config as a Python dict to LLM():

from vllm import LLM

llm = LLM(
    model="path/to/target/model",
    speculative_config={
        "method": "eagle3",
        "model": "path/to/draft/model",
        "num_speculative_tokens": 3,
    },
)

Online serving — pass --speculative-config (or -sc) as a JSON string:

vllm serve path/to/target/model \
  --speculative-config '{"method": "eagle3", "model": "path/to/draft/model", "num_speculative_tokens": 3}'

[!NOTE] On Ascend NPUs, the npu_fused_infer_attention_score operator supports a maximum of 16 tokens per decode round. Therefore, (num_speculative_tokens + 1) must be ≤ 15.

Speculating by matching n-grams in the prompt¶

The following code configures vLLM Ascend to use speculative decoding where proposals are generated by matching n-grams in the prompt.

Offline inference

from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,
    speculative_config={
        "method": "ngram",
        "num_speculative_tokens": 5,
        "prompt_lookup_max": 4,
    },
)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Speculating using EAGLE based draft models¶

The following code configures vLLM Ascend to use speculative decoding where proposals are generated by an EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) based draft model.

In v0.12.0rc1 of vLLM Ascend, the async scheduler is more stable and ready to be enabled. We have adapted it to support EAGLE, and you can use it by setting async_scheduling=True as follows. If you encounter any issues, please feel free to open an issue on GitHub. As a workaround, you can disable this feature by unsetting async_scheduling=True when initializing the model.

Offline inference

from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    tensor_parallel_size=4,
    distributed_executor_backend="mp",
    enforce_eager=True,
    async_scheduling=True,
    speculative_config={
        "method": "eagle",
        "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B",
        "draft_tensor_parallel_size": 1,
        "num_speculative_tokens": 2,
    },
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

A few important things to consider when using the EAGLE based draft models:

The EAGLE draft models available in the HF repository for EAGLE models should be loaded and used directly by vLLM. This functionality was added in PR #4893. If you are using a vLLM version released before this pull request was merged, please update to a more recent version.
The EAGLE based draft models need to be run without tensor parallelism (i.e. draft_tensor_parallel_size is set to 1 in speculative_config), although it is possible to run the main model using tensor parallelism (see example above).
When using EAGLE-3 based draft model, option "method" must be set to "eagle3". That is, to specify "method": "eagle3" in speculative_config.
After enabling EAGLE, the main model needs to verify (1 + K) tokens generated by the main model and the draft model in one decoding process. And the fullgraph mode will fix the number of tokens during the verification stage, so cudagraph_capture_sizes must be a list of capture sizes, where each size is calculated as n * (K + 1) for each batch size n you want to support. For instance, to support batch sizes from 1 to 4 with num_speculative_tokens = 4, cudagraph_capture_sizes should be set to [5, 10, 15, 20].

Speculating using MTP¶

MTP (Multi-Token Prediction) boosts inference performance by parallelizing the prediction of multiple tokens, shifting from single-token to multi-token generation. This approach significantly increases generation throughput and achieves multiplicative acceleration in inference speed — all without compromising output quality.

Online inference

vllm serve /deepseek-ai/DeepSeek-V3.2-Exp-W8A8 \
--port 20004 \
--data-parallel-size 1 \
--tensor-parallel-size 16 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name dsv3 \
--max-model-len 36768 \
--max-num-batched-tokens 5000 \
--max-num-seqs 10 \
--quantization ascend \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--speculative-config '{"num_speculative_tokens": 2, "method":"mtp", "disable_padded_drafter_batch": false}'

[!NOTE] Due to the fact that only a single layer of weights is exposed in DeepSeek's MTP, accuracy and performance are not effectively guaranteed in scenarios where num_speculative_tokens > 1 (especially ≥ 3).

In the fullgraph mode with num_speculative_tokens > 1, the capture size of each ACLGraph must be an integer multiple of (num_speculative_tokens + 1).

Speculating using Suffix Decoding¶

The following code configures vLLM to use speculative decoding where proposals are generated using Suffix Decoding (SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications).

Like n-gram, Suffix Decoding can generate draft tokens by pattern-matching using the last n generated tokens. Unlike n-gram, Suffix Decoding (1) can pattern-match against both the prompt and previous generations, (2) uses frequency counts to propose the most likely continuations, and (3) speculates an adaptive number of tokens for each request at each iteration to get better acceptance rates.

Suffix Decoding can achieve better performance for tasks with high repetition, such as code-editing, agentic loops (e.g. self-reflection, self-consistency), and RL rollouts.

[!NOTE] Suffix Decoding requires Arctic Inference. You can install it with pip install arctic-inference.

Offline inference

```python from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,
    enforce_eager=True,
    speculative_config={
        "method": "suffix",
        "num_speculative_tokens": 15,
    },
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

Extracting Hidden States¶

The extract_hidden_states method is a special speculative decoding mode that does not perform actual speculation. Instead, it extracts hidden states from specified layers of the target model and saves them to disk. This is primarily used for collecting training data for EAGLE-style draft models.

[!NOTE] This method produces only 1 output token per request. The primary output is the hidden states saved to disk, not the generated text.

Offline inference

import tempfile

from safetensors import safe_open
from vllm import LLM, SamplingParams

def main():
    with tempfile.TemporaryDirectory() as tmpdirname:
        llm = LLM(
            model="Qwen/Qwen3-8B",
            tensor_parallel_size=1,
            speculative_config={
                "method": "extract_hidden_states",
                "num_speculative_tokens": 1,
                "draft_model_config": {
                    "hf_config": {
                        # Layer indices to extract hidden states from
                        "eagle_aux_hidden_state_layer_ids": [2, 18, 34],
                    }
                },
            },
            kv_transfer_config={
                "kv_connector": "ExampleHiddenStatesConnector",
                "kv_role": "kv_producer",
                "kv_connector_extra_config": {
                    "shared_storage_path": tmpdirname,
                },
            },
        )

        prompts = ["Hello, how are you?", "What is machine learning?"]
        sampling_params = SamplingParams(max_tokens=1)
        outputs = llm.generate(prompts, sampling_params)

        for output in outputs:
            print("Prompt:", output.prompt)
            print("Prompt token ids:", output.prompt_token_ids)

            hidden_states_path = output.kv_transfer_params.get("hidden_states_path")
            print("Hidden states saved to:", hidden_states_path)

            with safe_open(hidden_states_path, "pt") as f:
                token_ids = f.get_tensor("token_ids")
                hidden_states = f.get_tensor("hidden_states")
                print("Shape:", hidden_states.shape)
                # Shape: (num_tokens, num_layers, hidden_size)

if __name__ == "__main__":
    main()

Key configuration parameters:

num_speculative_tokens: Must be set to 1. This method does not perform actual speculation, so the value is fixed.
eagle_aux_hidden_state_layer_ids: List of layer indices from which to extract hidden states. For example, [2, 18, 34] extracts from layers 2, 18, and 34.
kv_connector: Must be set to "ExampleHiddenStatesConnector" to enable saving hidden states to disk.
kv_role: Must be set to "kv_producer" for the extraction mode.
shared_storage_path: Directory where hidden states will be saved as .safetensors files (one per request).

Block Verify and Entropy Verify¶

vLLM Ascend provides two optional optimizations for the rejection sampler in speculative decoding: Block Verify and Entropy Verify. These features trade a small amount of output precision for improved inference throughput.

[!WARNING] Both Block Verify and Entropy Verify modify the token acceptance criteria and may cause minor precision degradation (e.g., slightly different output tokens compared to the standard rejection sampler). Evaluate the quality impact on your specific workload before enabling them in production.

Block Verify¶

Block Verify evaluates all draft tokens as a block using cumulative probability products, rather than checking each token independently. This can improve the acceptance rate and reduce the overhead of rejection sampling, especially when num_speculative_tokens >= 3.

Entropy Verify¶

Entropy Verify adjusts the acceptance threshold based on the entropy of the target distribution:

High entropy (uncertain distribution) → lower effective threshold → more tokens accepted
Low entropy (confident distribution) → higher effective threshold → stricter rejection

This entropy-aware threshold is controlled by two parameters:

posterior_threshold (default: 0.95, range: (0, 1]): The upper bound of the modified threshold. Even when entropy is very low, the effective threshold will not exceed this value.
posterior_alpha (default: 0.4, range: >= 0): Controls how strongly entropy influences the threshold. A higher alpha makes the threshold more sensitive to entropy changes, resulting in a higher acceptance rate for speculative tokens but also greater precision loss. You need to tune this value based on your specific model and dataset. When alpha is 0, entropy has no effect and the threshold equals posterior_threshold.

Usage¶

Online inference

vllm serve <model> --additional-config \
    '{"rejection_sampler_config": {"enable_block_verify": true, \
    "enable_entropy_verify": true, "posterior_threshold": 0.95, \
    "posterior_alpha": 0.4}}'

Offline inference

llm = LLM(
    model,
    additional_config={
        "rejection_sampler_config": {
            "enable_block_verify": True,
            "enable_entropy_verify": True,
            "posterior_threshold": 0.95,
            "posterior_alpha": 0.4,
        }
    },
)

Both features can be enabled independently or together. When used together, the cumulative acceptance from Block Verify is combined with the entropy-adjusted threshold from Entropy Verify.