Hunyuan-A13B-Instruct#
简介#
Hunyuan-A13B-Instruct 是腾讯开发的细粒度混合专家模型(MoE)。该模型总参数量为800亿,激活参数量为130亿,支持256K超长上下文,并具备原生思维链(CoT)推理能力。
环境准备#
模型权重#
Hunyuan-A13B-Instruct(BF16版本):下载模型权重。
建议将模型权重下载到多节点的共享目录,例如 /root/.cache/
安装#
运行 Docker 容器:
# Update the vllm-ascend image
# For Atlas A2 machines:
# export IMAGE=quay.io/ascend/vllm-ascend:v0.20.2rc1
# For Atlas A3 machines:
export IMAGE=quay.io/ascend/vllm-ascend:v0.20.2rc1-a3
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
从源码构建:
# Install vLLM.
git clone --depth 1 --branch v0.20.2 https://github.com/vllm-project/vllm
cd vllm
VLLM_TARGET_DEVICE=empty pip install -e .
cd ..
# Install vLLM Ascend.
git clone --depth 1 --branch v0.20.2rc1 https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
git submodule update --init --recursive
pip install -e .
cd ..
软件栈版本验证#
该环境基于 GiteeAI 平台内置的 CANN,并通过 Python 3.11.6 Conda 环境成功运行了 vLLM |vllm_ascend_version| 和 vLLM-Ascend:|vllm_ascend_version|。
部署#
单节点部署(4-NPU)#
export HCCL_INTRA_ROCE_ENABLE=1
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export HF_HOME=/data
export MODEL_PATH="Hunyuan-A13B-Instruct"
vllm serve ${MODEL_PATH} \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000 \
--served-model-name Hunyuan \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
关键性能指标#
基于已验证的 CANN 8.5.1 测试日志:
权重内存占用:每个 NPU 的静态内存占用约为 37.46 GB。
图编译(ACL Graph):启用 PIECEWISE 模式后,系统大约在 18 秒内自动捕获图,这可以显著加速后续推理。
KV 缓存容量:剩余的 NPU 内存可为大约 529,152 个 token 提供并发缓存空间。
功能验证#
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Hunyuan",
"messages": [{"role": "user", "content": "Give me a short introduction to large language models."}],
"max_tokens": 100,
"temperature": 0.7
}'
预期输出:
{"id":"chatcmpl-9a60df2b23bb539f","object":"chat.completion","created":1774751760,"model":"Hunyuan","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\nOkay, I need to write a short introduction to large language models. Let me start by recalling what I know. First, what are LLMs? They're machine learning models trained on vast amounts of text data. The key here is \"large\"—so they have a huge number of parameters. Maybe mention the scale, like billions or trillions of parameters.\n\nThen, how are they trained? They're trained on diverse text sources—books, websites, articles, etc. The","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":12,"total_tokens":112,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
精度评估#
在 GiteeAI 平台上,使用 AISBench 工具在 GSM8K 基准集上对该模型进行了测试验证:在 7cd45e 版本配置下,该模型在精度生成模式下达到了 94.77% 的准确率。
ais_bench --models vllm_api_general_chat --datasets gsm8k_gen_0_shot_cot_chat_prompt --summarizer example --debug
输出:
03/29 03:20:03 - AISBench - INFO - Running 1-th replica of evaluation
03/29 03:20:03 - AISBench - INFO - Task [vllm-api-general-chat/gsm8k]: {'accuracy': 94.76876421531463}
03/29 03:20:03 - AISBench - INFO - time elapsed: 2.15s
03/29 03:20:04 - AISBench - INFO - Evaluation tasks completed.
03/29 03:20:04 - AISBench - INFO - Summarizing evaluation results...
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 94.77
03/29 03:20:04 - AISBench - INFO - write summary to /data/outputs/default/20260329_025345/summary/summary_20260329_025345.txt
03/29 03:20:04 - AISBench - INFO - write csv to /data/outputs/default/20260329_025345/summary/summary_20260329_025345.csv
Markdown 格式化的结果如下:
数据集 |
版本 |
指标 |
模式 |
vllm-api-general-chat |
|---|---|---|---|---|
gsm8k |
7cd45e |
准确率 |
生成 |
94.77 |
性能#
使用 AISBench#
ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer default_perf --mode perf
输出:
[2026-04-08 05:27:40,180] [ais_bench] [INFO] Performance Results of task [vllm-api-stream-chat/demo_gsm8k]:
╒══════════════════════════╤═════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════╡
│ E2EL │ total │ 29982.6 ms │ 16472.9 ms │ 41147.2 ms │ 30919.1 ms │ 33514.9 ms │ 39413.8 ms │ 40973.9 ms │ 8 │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────┤
│ TTFT │ total │ 238.6 ms │ 107.9 ms │ 276.7 ms │ 254.0 ms │ 265.6 ms │ 272.4 ms │ 276.3 ms │ 8 │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────┤
│ TPOT │ total │ 60.1 ms │ 57.7 ms │ 61.3 ms │ 60.4 ms │ 60.8 ms │ 61.2 ms │ 61.3 ms │ 8 │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────┤
│ ITL │ total │ 59.7 ms │ 0.0 ms │ 219.7 ms │ 51.7 ms │ 64.1 ms │ 81.9 ms │ 146.2 ms │ 8 │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────┤
│ InputTokens │ total │ 1457.5 │ 1426.0 │ 1511.0 │ 1456.5 │ 1465.25 │ 1481.6 │ 1508.06 │ 8 │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────┤
│ OutputTokens │ total │ 497.5 │ 268.0 │ 710.0 │ 508.5 │ 555.75 │ 666.6 │ 705.66 │ 8 │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────┤
│ OutputTokenThroughput │ total │ 16.5261 token/s │ 16.2402 token/s │ 17.2551 token/s │ 16.4461 token/s │ 16.5728 token/s │ 16.9063 token/s │ 17.2202 token/s │ 8 │
╘══════════════════════════╧═════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 41161.2934 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 8 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 8 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 5.8273 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 16 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 0.1944 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 11660 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 6108.0184 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Generated Tokens │ total │ 3980 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 283.2758 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 96.6928 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 379.9686 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
使用 vLLM Benchmark#
以运行 Hunyuan-A13B-Instruct 的性能评估为例。
更多详情请参考 vllm benchmark。
vllm bench 有三个子命令:
latency:对单批请求的延迟进行基准测试。serve:对在线服务吞吐量进行基准测试。throughput:对离线推理吞吐量进行基准测试。
以 serve 为例。运行代码如下。
vllm bench serve \
--model ./Hunyuan-A13B-Instruct/ \
--port 8000 \
--dataset-name random \
--random-input 200 \
--num-prompts 200 \
--request-rate 1 \
--save-result \
--result-dir ./perf_results/ \
--trust-remote-code
大约几分钟后,您将获得性能评估结果。