后缀投机解码

后缀投机解码#

简介#

后缀解码是一种基于模式匹配的投机解码优化技术。它同时从提示词和已生成内容中检索重复序列，利用频率统计来预测最有可能的 token 延续。与传统的投机解码方法不同，后缀解码完全在 CPU 上运行，无需额外的 GPU 资源或草稿模型，从而为 AI Agent 和代码生成等重复性任务带来显著的加速效果。

本文档提供了如何在 Atlas A2 硬件上部署和测试 vllm-ascend 所支持的后缀解码投机推理技术的逐步指南。配置使用单台 Atlas 800T A2 节点，以 4 卡方式部署 Qwen3-32B 模型实例。基准测试使用真实的开源数据集进行，涵盖以下类别：

数据集类别	数据集名称
代码生成	HumanEval
常识推理	ARC
数学推理	gsm8k
自然语言理解	SuperGLUE_BoolQ
综合考试	AGIEval
多轮对话	ShareGPT

本教程使用的基准测试工具是 AISBench，它支持上述所有数据集的性能测试。教程的最后一部分展示了在不同数据集和并发级别下，满足 SLO TPOT < 50ms 的条件下，启用与禁用后缀解码的性能对比。验证表明，Qwen3-32B 模型在启用后缀解码后，在各种真实数据集上的吞吐量提升了约 20% 到 80%。

下载 vllm-ascend 镜像#

本教程使用官方镜像，版本 v0.13.0rc1。使用以下命令下载：

docker pull quay.io/ascend/vllm-ascend:v0.13.0rc1

使用 Docker 运行#

容器启动命令：

# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.13.0rc1
export NAME=vllm-ascend

# Run the container using the defined variables
# This test uses four Atlas A2 NPU cards to create the container.
# Mount the hccn.conf file from the host node into the container.

docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:\
/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /root/.cache:/root/.cache \
-it $IMAGE bash

安装 arctic-inference#

在 Ascend 上启用后缀解码投机推理之前，必须安装 Arctic Inference 插件。Arctic Inference 是 Snowflake 推出的开源插件，专门用于优化 LLM 推理速度。详细技术原理请参考以下文章：Fastest Speculative Decoding in vLLM with Arctic Inference and Arctic Training。使用以下命令在容器内安装：

pip install arctic-inference

vLLM 实例部署#

使用以下命令启动容器服务实例。通过 --speculative-config 参数启用投机推理，其中 method 设置为 suffix。本次测试中，num_speculative_tokens 统一设置为 3。

# set the NPU device number:
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
# Set the operator dispatch pipeline level to 1 and disable manual memory control in ACLGraph
export TASK_QUEUE_ENABLE=1
# Enable the AIVector core to directly schedule ROCE communication.
export HCCL_OP_EXPANSION_MODE="AIV"
# Enable FlashComm_v1 optimization when tensor parallel is enabled.
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1

vllm serve /data/Qwen3-32B \
  --served-model-name qwen3 \
  --trust-remote-code \
  --distributed-executor-backend mp \
  --tensor-parallel-size 4 \
  --max-model-len 5500 \
  --max-num-batched-tokens 40960 \
  --speculative-config '{"method": "suffix", "num_speculative_tokens": 3}' \
  --gpu-memory-utilization 0.9 \
  --additional-config '{"pa_shape_list":[48,64,72,80], "weight_prefetch_config":{"enable":true}}' \
  --port 8011

AISbench 基准测试#

所有开源数据集的性能均使用 AISbench 进行测试。具体说明请参考使用 AISBench 进行性能评估。

模型配置：

# "ignore_eos" must be set to "False", and "max_out_len" should be set to a large value to allow the model to output completely and naturally.

from ais_bench.benchmark.models import VLLMCustomAPIChatStream

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChatStream,
        abbr='vllm-api-stream-chat',
        path="<path_to_your_model>/Qwen3-32B",
        model="qwen3",
        request_rate = 0,
        retry = 2,
        host_ip = "<your_server_ip>",
        host_port = 8011,
        max_out_len = 4000,
        batch_size= 16,
        trust_remote_code=False,
        generation_kwargs = dict(
            temperature = 0,
            ignore_eos = False
        )
    )
]

性能基准测试命令：

# Example command to test gsm8k dataset performance using the first 100 prompts. Commands for other datasets are similar.
ais_bench --models vllm-api-stream-chat \
  --datasets gsm8k_gen_0_shot_cot_str_perf \
  --debug --summarizer default_perf --mode perf --num-prompts 100

测试结果#

以下是本次评估中六个开源数据集的详细测试结果。与基线性能相比，启用后缀解码后，不同并发级别下的 TPOT 和吞吐量性能提升因数据集而异。各数据集启用后缀解码后的提升幅度不尽相同。以下是结果汇总：

数据集类别	典型代表	吞吐量提升（BS=1-10）	SLO TPOT
高收益	AGIEval、GSM8K	> 50%	< 50ms
中低收益	ARC、ShareGPT	20% ~ 30%	< 50ms

以下是原始详细测试结果：

并发数	平均输入长度	平均输出长度	请求数	基线 TPOT（毫秒）	基线吞吐量（TPS）	后缀解码 TPOT（毫秒）	后缀解码吞吐量（TPS）	接受率	TPOT 提升	TPS 提升
Humaneval
1	150	2700	100	55.1	18.1	37.9	26.3	27.0%	45.2%	45.1%
15	150	2700	100	61.6	233.8	45.8	318.2	27.0%	34.6%	36.1%
26	150	2700	100	64.7	403.8	50.9	519.2	27.0%	27.2%	28.6%
ARC
1	76	960	100	52.8	18.9	39.5	25.4	23.9%	33.7%	34.6%
8	76	960	100	59.1	125.4	47.0	163.1	23.9%	25.7%	30.0%
15	76	960	100	59.8	245.8	48.9	311.7	23.9%	22.3%	26.8%
GSM8K
1	67	1570	100	55.5	18.0	35.7	28.5	31.1%	55.6%	58.4%
17	67	1570	100	61.5	279.8	45.4	403.0	31.1%	35.6%	44.0%
26	67	1570	100	63.9	396.4	50.0	527.6	31.1%	27.8%	33.1%
ShareGPT
1	666	231	327	54.1	18.3	39.2	24.1	23.9%	37.9%	31.5%
8	666	231	327	58.8	125.0	46.2	153.2	23.9%	27.1%	22.5%
14	666	231	327	61.8	227.0	49.9	273.9	23.9%	23.8%	20.7%
SuperGLUE_BoolQ
1	207	314	100	54.1	18.4	36.1	26.8	33.4%	49.8%	45.6%
16	207	314	100	60.0	229.7	43.5	303.9	33.4%	38.0%	32.3%
32	207	314	100	62.7	396.4	47.8	507.5	33.4%	31.3%	28.0%
AGIEval
1	735	1880	100	53.1	18.7	31.8	34.1	50.3%	66.8%	81.9%
24	735	1880	100	64.0	381.2	43.3	629.0	50.3%	47.8%	65.0%
34	735	1880	100	70.0	494.6	50.2	768.4	50.3%	39.4%	55.3%