性能基准测试#

本文档详细介绍了 vllm-ascend 的基准测试方法,旨在评估其在多种工作负载下的性能。为与 vLLM 保持一致,我们使用 vllm 项目提供的 benchmark 脚本。

基准测试覆盖范围:我们测量离线端到端延迟与吞吐量,以及固定 QPS 的在线服务基准测试。更多详情,请参阅 vllm-ascend 基准测试脚本

图例说明

  • ✅ = 已支持

  • 🟡 = 部分支持 / 开发中

  • 🚧 = 开发中

1.运行 Docker 容器#

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:v0.20.2rc1
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-it $IMAGE \
/bin/bash

2.安装依赖项#

cd /workspace/vllm-ascend
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install -r benchmarks/requirements-bench.txt

3.运行基础基准测试#

本节介绍如何使用 VLLM 内置的基准测试套件进行性能测试。

3.1数据集#

VLLM 支持多种数据集

数据集

在线

离线

数据路径

ShareGPT

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

ShareGPT4V(图像)

wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json
注意:图像需要单独下载。例如,要下载 COCO 2017 训练图像:
wget http://images.cocodataset.org/zips/train2017.zip

ShareGPT4Video(视频)

git clone https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video

BurstGPT

wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv

Sonnet(已弃用)

本地文件:benchmarks/sonnet.txt

随机

synthetic

随机多模态(图像/视频)

🟡

🚧

synthetic

随机重排序

synthetic

前缀重复

synthetic

HuggingFace-VisionArena

lmarena-ai/VisionArena-Chat

HuggingFace-MMVU

yale-nlp/MMVU

HuggingFace-InstructCoder

likaixin/InstructCoder

HuggingFace-AIMO

AI-MO/aimo-validation-aime, AI-MO/NuminaMath-1.5, AI-MO/NuminaMath-CoT

HuggingFace-其他

lmms-lab/LLaVA-OneVision-Data, Aeala/ShareGPT_Vicuna_unfiltered

HuggingFace-MTBench

philschmid/mt-bench

HuggingFace-Blazedit

vdaita/edit_5k_char, vdaita/edit_10k_char

Spec Bench

wget https://raw.githubusercontent.com/hemingkx/Spec-Bench/refs/heads/main/data/spec_bench/question.jsonl

自定义

本地文件:data.jsonl

备注

上述数据集均为 Hugging Face 上数据集的链接。数据集的 dataset-name 应设置为 hf。对于本地的 dataset-path,请将其 hf-name 设置为对应的 Hugging Face ID,例如:

--dataset-path /datasets/VisionArena-Chat/ --hf-name lmarena-ai/VisionArena-Chat

3.2运行基础基准测试#

3.2.1在线服务#

首先启动模型服务:

export VLLM_USE_MODELSCOPE=True
vllm serve Qwen/Qwen3-8B

然后运行基准测试脚本:

# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
export VLLM_USE_MODELSCOPE=True
vllm bench serve \
  --backend vllm \
  --model Qwen/Qwen3-8B \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 10

如果成功,您将看到以下输出:

============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  19.92     
Total input tokens:                      1374      
Total generated tokens:                  2663      
Request throughput (req/s):              0.50      
Output token throughput (tok/s):         133.67    
Peak output token throughput (tok/s):    312.00    
Peak concurrent requests:                10.00     
Total Token throughput (tok/s):          202.64    
---------------Time to First Token----------------
Mean TTFT (ms):                          127.10    
Median TTFT (ms):                        136.29    
P99 TTFT (ms):                           137.83    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          25.85     
Median TPOT (ms):                        25.78     
P99 TPOT (ms):                           26.64     
---------------Inter-token Latency----------------
Mean ITL (ms):                           25.78     
Median ITL (ms):                         25.74     
P99 ITL (ms):                            28.85     
==================================================

3.2.2离线吞吐量基准测试#

export VLLM_USE_MODELSCOPE=True
vllm bench throughput \
  --model Qwen/Qwen3-8B \
  --dataset-name random \
  --input-len 128 \
  --output-len 128

如果成功,您将看到以下输出:

Processed prompts: 100%|| 10/10 [00:03<00:00,  2.74it/s, est. speed input: 351.02 toks/s, output: 351.02 toks/s]
Throughput: 2.73 requests/s, 699.93 total tokens/s, 349.97 output tokens/s
Total num prompt tokens:  1280
Total num output tokens:  1280

3.2.4多模态基准测试#

export VLLM_USE_MODELSCOPE=True
vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
  --dtype bfloat16 \
  --limit-mm-per-prompt '{"image": 1}' \
  --allowed-local-media-path /path/to/sharegpt4v/images
export HF_ENDPOINT="https://hf-mirror.com"
vllm bench serve --model Qwen/Qwen2.5-VL-7B-Instruct \
--backend "openai-chat" \
--dataset-name hf \
--hf-split train \
--endpoint "/v1/chat/completions" \
--dataset-path "lmarena-ai/vision-arena-bench-v0.1" \
--num-prompts 10 \
--no-stream
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  4.89      
Total input tokens:                      7191      
Total generated tokens:                  951       
Request throughput (req/s):              2.05      
Output token throughput (tok/s):         194.63    
Peak output token throughput (tok/s):    290.00    
Peak concurrent requests:                10.00     
Total Token throughput (tok/s):          1666.35   
---------------Time to First Token----------------
Mean TTFT (ms):                          722.22    
Median TTFT (ms):                        589.81    
P99 TTFT (ms):                           1377.02   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          44.13     
Median TPOT (ms):                        34.58     
P99 TPOT (ms):                           124.72    
---------------Inter-token Latency----------------
Mean ITL (ms):                           33.14     
Median ITL (ms):                         28.01     
P99 ITL (ms):                            182.28    
==================================================

3.2.5嵌入基准测试#

vllm serve Qwen/Qwen3-Embedding-8B --trust-remote-code
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
export VLLM_USE_MODELSCOPE=True
vllm bench serve \
  --model Qwen/Qwen3-Embedding-8B \
  --backend openai-embeddings \
  --endpoint /v1/embeddings \
  --dataset-name sharegpt \
  --num-prompts 10 \
  --dataset-path <your dataset path>/datasets/ShareGPT_V3_unfiltered_cleaned_split.json
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  0.18      
Total input tokens:                      1372      
Request throughput (req/s):              56.32     
Total Token throughput (tok/s):          7726.76   
----------------End-to-end Latency----------------
Mean E2EL (ms):                          154.06    
Median E2EL (ms):                        165.57    
P99 E2EL (ms):                           166.66    
==================================================