使用 AISBench

使用 AISBench#

This document guides you to conduct accuracy testing using AISBench. AISBench provides accuracy and performance evaluation for many datasets.

在线服务器#

1.启动 vLLM 服务器#

您可以运行 docker 容器在单个 NPU 上启动 vLLM 服务器：

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.18.0
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
/bin/bash

在 docker 中运行 vLLM 服务器。

vllm serve Qwen/Qwen2.5-0.5B-Instruct --max-model-len 35000 &

备注

--max-model-len 应大于 35000，这适用于大多数数据集。否则可能会影响精度评估。

如果看到如下日志，则 vLLM 服务器启动成功：

INFO:     Started server process [9446]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

2.使用 AISBench 运行不同数据集#

安装 AISBench#

Refer to AISBench for details. Install AISBench from source.

git clone https://github.com/AISBench/benchmark.git
cd benchmark/
pip3 install -e ./ --use-pep517

安装额外的 AISBench 依赖项。

pip3 install -r requirements/api.txt
pip3 install -r requirements/extra.txt

运行 ais_bench -h 以检查安装。

下载数据集#

您可以选择一个或多个数据集来执行精度评估。

C-Eval 数据集。

Take C-Eval dataset as an example. You can refer to Datasets for more datasets. Each dataset has a README.md with detailed download and installation instructions.

下载数据集并安装到指定路径。

cd ais_bench/datasets
mkdir ceval/
mkdir ceval/formal_ceval
cd ceval/formal_ceval
wget https://www.modelscope.cn/datasets/opencompass/ceval-exam/resolve/master/ceval-exam.zip
unzip ceval-exam.zip
rm ceval-exam.zip

MMLU 数据集。

cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu.zip
unzip mmlu.zip
rm mmlu.zip

GPQA 数据集。

cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gpqa.zip
unzip gpqa.zip
rm gpqa.zip

MATH 数据集。

cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/math.zip
unzip math.zip
rm math.zip

LiveCodeBench 数据集。

cd ais_bench/datasets
git lfs install
git clone https://huggingface.co/datasets/livecodebench/code_generation_lite

AIME 2024 数据集。

cd ais_bench/datasets
mkdir aime/
cd aime/
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime.zip
unzip aime.zip
rm aime.zip

GSM8K 数据集。

cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip
unzip gsm8k.zip
rm gsm8k.zip

配置#

更新文件 benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py。有几个参数需要根据您的环境进行更新。

attr：推理后端类型的标识符，固定为 service（基于服务的推理）或 local（本地模型）。
type：用于选择不同的后端 API 类型。
abbr：本地任务的唯一标识符，用于区分多个任务。
path：更新为您的模型权重路径。
model：更新为您的 vLLM 中的模型名称。
host_ip 和 host_port：更新为您的 vLLM 服务器的 IP 和端口。
max_out_len：注意 max_out_len + LLM 输入长度应小于 max_model_len（在您的 vllm 服务器中配置），32768 适用于大多数数据集。
batch_size：根据您的数据集进行更新。
temperature：更新推理参数。

from ais_bench.benchmark.models import VLLMCustomAPIChat
from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr='vllm-api-general-chat',
        path="xxxx",
        model="xxxx",
        request_rate = 0,
        retry = 2,
        host_ip = "localhost",
        host_port = 8000,
        max_out_len = xxx,
        batch_size = xxx,
        trust_remote_code=False,
        generation_kwargs = dict(
            temperature = 0.6,
            top_k = 10,
            top_p = 0.95,
            seed = None,
            repetition_penalty = 1.03,
        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content)
    )
]

执行精度评估#

运行以下代码以执行不同的精度评估。

# run C-Eval dataset
ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds

# run MMLU dataset
ais_bench --models vllm_api_general_chat --datasets mmlu_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds

# run GPQA dataset
ais_bench --models vllm_api_general_chat --datasets gpqa_gen_0_shot_str.py --mode all --dump-eval-details --merge-ds

# run MATH-500 dataset
ais_bench --models vllm_api_general_chat --datasets math500_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds

# run LiveCodeBench dataset
ais_bench --models vllm_api_general_chat --datasets livecodebench_code_generate_lite_gen_0_shot_chat.py --mode all --dump-eval-details --merge-ds

# run AIME 2024 dataset
ais_bench --models vllm_api_general_chat --datasets aime2024_gen_0_shot_chat_prompt.py --mode all --dump-eval-details --merge-ds

# run GSM8K dataset
ais_bench --models vllm_api_general_chat --datasets gsm8k_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds

每个数据集执行后，您可以从保存的文件（例如 outputs/default/20250628_151326）中获取结果，示例如下：

20250628_151326/
├── configs # Combined configuration file for model tasks, dataset tasks, and result presentation tasks
│   └── 20250628_151326_29317.py
├── logs # Execution logs; if --debug is added to the command, no intermediate logs are saved to disk (all are printed directly to the screen)
│   ├── eval
│   │   └── vllm-api-general-chat
│   │       └── demo_gsm8k.out # Logs of the accuracy evaluation process based on inference results in the predictions/ folder
│   └── infer
│       └── vllm-api-general-chat
│           └── demo_gsm8k.out # Logs of the inference process
├── predictions
│   └── vllm-api-general-chat
│       └── demo_gsm8k.json # Inference results (all outputs returned by the inference service)
├── results
│   └── vllm-api-general-chat
│       └── demo_gsm8k.json # Raw scores calculated from the accuracy evaluation
└── summary
    ├── summary_20250628_151326.csv # Final accuracy scores (in table format)
    ├── summary_20250628_151326.md # Final accuracy scores (in Markdown format)
    └── summary_20250628_151326.txt # Final accuracy scores (in text format)

执行性能评估#

纯文本基准测试：

# run C-Eval dataset
ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf

# run MMLU dataset
ais_bench --models vllm_api_general_chat --datasets mmlu_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf

# run GPQA dataset
ais_bench --models vllm_api_general_chat --datasets gpqa_gen_0_shot_str.py --summarizer default_perf --mode perf

# run MATH-500 dataset
ais_bench --models vllm_api_general_chat --datasets math500_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf

# run LiveCodeBench dataset
ais_bench --models vllm_api_general_chat --datasets livecodebench_code_generate_lite_gen_0_shot_chat.py --summarizer default_perf --mode perf

# run AIME 2024 dataset
ais_bench --models vllm_api_general_chat --datasets aime2024_gen_0_shot_chat_prompt.py --summarizer default_perf --mode perf

# run GSM8K dataset
ais_bench --models vllm_api_general_chat --datasets gsm8k_gen_0_shot_cot_str_perf.py --summarizer default_perf --mode perf

多模态基准测试（文本 + 图像）：

# run textvqa dataset
ais_bench --models vllm_api_stream_chat --datasets textvqa_gen_base64 --summarizer default_perf --mode perf

执行后，您可以从保存的文件中获取结果，示例如下：

20251031_070226/
|-- configs # Combined configuration file for model tasks, dataset tasks, and result presentation tasks
|   `-- 20251031_070226_122485.py
|-- logs
|   `-- performances
|       `-- vllm-api-general-chat
|           `-- cevaldataset.out # Logs of the performance evaluation process
`-- performances
    `-- vllm-api-general-chat
        |-- cevaldataset.csv # Final performance results (in table format)
        |-- cevaldataset.json # Final performance results (in json format)
        |-- cevaldataset_details.h5 # Final performance results in details
        |-- cevaldataset_details.json # Final performance results in details
        |-- cevaldataset_plot.html # Final performance results (in html format)
        `-- cevaldataset_rps_distribution_plot_with_actual_rps.html # Final performance results (in html format)

3.故障排除#

无效图像路径错误#

如果您按照 AISBench 文档下载 TextVQA 数据集：

cd ais_bench/datasets
git lfs install
git clone https://huggingface.co/datasets/maoxx241/textvqa_subset
mv textvqa_subset/ textvqa/
mkdir textvqa/textvqa_json/
mv textvqa/*.json textvqa/textvqa_json/
mv textvqa/*.jsonl textvqa/textvqa_json/

您可能会遇到以下错误：

AISBench - ERROR - /vllm-workspace/benchmark/ais_bench/benchmark/clients/base_client.py - raise_error - 35 - [AisBenchClientException] Request failed: HTTP status 400. Server response: {"error":{"message":"1 validation error for ChatCompletionContentPartImageParam\nimage_url\n  Input should be a valid dictionary [type=dict_type, input_value='data/textvqa/train_images/b2ae0f96dfbea5d8.jpg', input_type=str]\n    For further information visit https://errors.pydantic.dev/2.12/v/dict_type None","type":"BadRequestError","param":null,"code":400}}

您需要手动将数据集图像路径替换为绝对路径，将 /path/to/benchmark/ais_bench/datasets/textvqa/train_images/ 更改为图像存储的实际绝对目录：

cd ais_bench/datasets/textvqa/textvqa_json
sed -i 's#data/textvqa/train_images/#/path/to/benchmark/ais_bench/datasets/textvqa/train_images/#g' textvqa_val.json