后缀推测解码

后缀推测解码#

简介#

后缀解码是一种基于模式匹配的推测解码优化技术。它同时从提示词和已生成内容中检索重复序列，利用频率统计来预测最可能的后续标记。与传统的推测解码方法不同，后缀解码完全在CPU上运行，无需额外的GPU资源或草稿模型，从而在AI智能体和代码生成等重复性任务上实现卓越的加速效果。

本文档提供了在Atlas A2硬件上部署和基准测试vllm-ascend支持的后缀解码推测推理技术的分步指南。该设置使用单个Atlas 800T A2节点，部署了4卡的Qwen3-32B模型实例。基准测试使用涵盖以下类别的真实开源数据集进行：

数据集类别	数据集名称
代码生成	HumanEval
常识推理	ARC
数学推理	gsm8k
自然语言理解	SuperGLUE_BoolQ
综合评测	AGIEval
多轮对话	ShareGPT

本教程使用的基准测试工具是AISBench，它支持对上述所有数据集进行性能测试。本教程最后一节展示了在不同数据集和并发级别下，满足SLO TPOT < 50ms条件时，启用与禁用后缀解码的性能对比。验证表明，启用后缀解码后，Qwen3-32B模型在各种真实数据集上实现了约20%至80%的吞吐量提升。

下载 vllm-ascend 镜像#

本教程使用官方镜像，版本为v0.13.0rc1。使用以下命令下载：

docker pull quay.io/ascend/vllm-ascend:v0.13.0rc1

使用 Docker 运行#

容器启动命令：

# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.13.0rc1
export NAME=vllm-ascend

# Run the container using the defined variables
# This test uses four Atlas A2 NPU cards to create the container.
# Mount the hccn.conf file from the host node into the container.

docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:\
/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /root/.cache:/root/.cache \
-it $IMAGE bash

安装 arctic-inference#

在Ascend上启用后缀解码推测推理之前，必须安装Arctic Inference插件。Arctic Inference是Snowflake推出的一个开源插件，专门用于优化LLM推理速度。详细技术原理请参考以下文章：Fastest Speculative Decoding in vLLM with Arctic Inference and Arctic Training。在容器内使用以下命令安装：

pip install arctic-inference

vLLM 实例部署#

使用以下命令启动容器服务实例。通过--speculative-config参数启用推测推理，其中method设置为suffix。本次测试中，num_speculative_tokens统一设置为3。

# set the NPU device number:
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
# Set the operator dispatch pipeline level to 1 and disable manual memory control in ACLGraph
export TASK_QUEUE_ENABLE=1
# Enable the AIVector core to directly schedule ROCE communication.
export HCCL_OP_EXPANSION_MODE="AIV"
# Enable FlashComm_v1 optimization when tensor parallel is enabled.
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1

vllm serve /data/Qwen3-32B \
  --served-model-name qwen3 \
  --trust-remote-code \
  --distributed-executor-backend mp \
  --tensor-parallel-size 4 \
  --max-model-len 5500 \
  --max-num-batched-tokens 40960 \
  --speculative-config '{"method": "suffix", "num_speculative_tokens": 3}' \
  --gpu-memory-utilization 0.9 \
  --additional-config '{"pa_shape_list":[48,64,72,80], "weight_prefetch_config":{"enable":true}}' \
  --port 8011

AISbench 基准测试#

所有开源数据集的性能均使用AISbench进行测试。具体操作说明请参考使用AISBench进行性能评估。

模型配置：

# "ignore_eos" must be set to "False", and "max_out_len" should be set to a large value to allow the model to output completely and naturally.

from ais_bench.benchmark.models import VLLMCustomAPIChatStream

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChatStream,
        abbr='vllm-api-stream-chat',
        path="<path_to_your_model>/Qwen3-32B",
        model="qwen3",
        request_rate = 0,
        retry = 2,
        host_ip = "<your_server_ip>",
        host_port = 8011,
        max_out_len = 4000,
        batch_size= 16,
        trust_remote_code=False,
        generation_kwargs = dict(
            temperature = 0,
            ignore_eos = False
        )
    )
]

性能基准测试命令：

# Example command to test gsm8k dataset performance using the first 100 prompts. Commands for other datasets are similar.
ais_bench --models vllm-api-stream-chat \
  --datasets gsm8k_gen_0_shot_cot_str_perf \
  --debug --summarizer default_perf --mode perf --num-prompts 100

测试结果#

以下是本次评估中六个开源数据集的详细测试结果。与基线性能相比，启用后缀解码后，不同并发级别下的TPOT和吞吐量性能提升程度因数据集而异。启用后缀解码后的提升幅度在不同数据集间存在差异。以下是结果总结：

数据集类别	典型代表	吞吐量提升 (BS=1-10)	SLO TPOT
高增益	AGIEval, GSM8K	> 50%	< 50ms
中低增益	ARC, ShareGPT	20% ~ 30%	< 50ms

以下是原始详细测试结果：

并发数	平均输入长度	平均输出长度	请求数	基线 TPOT(ms)	基线吞吐量(TPS)	后缀解码 TPOT(ms)	后缀解码吞吐量(TPS)	接受率	TPOT 增益	TPS 增益
Humaneval
1	150	2700	100	55.1	18.1	37.9	26.3	27.0%	45.2%	45.1%
15	150	2700	100	61.6	233.8	45.8	318.2	27.0%	34.6%	36.1%
26	150	2700	100	64.7	403.8	50.9	519.2	27.0%	27.2%	28.6%
ARC
1	76	960	100	52.8	18.9	39.5	25.4	23.9%	33.7%	34.6%
8	76	960	100	59.1	125.4	47.0	163.1	23.9%	25.7%	30.0%
15	76	960	100	59.8	245.8	48.9	311.7	23.9%	22.3%	26.8%
GSM8K
1	67	1570	100	55.5	18.0	35.7	28.5	31.1%	55.6%	58.4%
17	67	1570	100	61.5	279.8	45.4	403.0	31.1%	35.6%	44.0%
26	67	1570	100	63.9	396.4	50.0	527.6	31.1%	27.8%	33.1%
ShareGPT
1	666	231	327	54.1	18.3	39.2	24.1	23.9%	37.9%	31.5%
8	666	231	327	58.8	125.0	46.2	153.2	23.9%	27.1%	22.5%
14	666	231	327	61.8	227.0	49.9	273.9	23.9%	23.8%	20.7%
SuperGLUE_BoolQ
1	207	314	100	54.1	18.4	36.1	26.8	33.4%	49.8%	45.6%
16	207	314	100	60.0	229.7	43.5	303.9	33.4%	38.0%	32.3%
32	207	314	100	62.7	396.4	47.8	507.5	33.4%	31.3%	28.0%
AGIEval
1	735	1880	100	53.1	18.7	31.8	34.1	50.3%	66.8%	81.9%
24	735	1880	100	64.0	381.2	43.3	629.0	50.3%	47.8%	65.0%
34	735	1880	100	70.0	494.6	50.2	768.4	50.3%	39.4%	55.3%