Hunyuan-A13B-Instruct¶

简介¶

Hunyuan-A13B-Instruct 是腾讯开发的一款细粒度混合专家模型（MoE）。该模型总参数量达800亿，激活参数量为130亿，支持256K超长上下文，并具备原生思维链（CoT）推理能力。

环境准备¶

模型权重¶

Hunyuan-A13B-Instruct（BF16版本）：下载模型权重。

建议将模型权重下载到多节点共享目录中，例如 /root/.cache/

安装¶

运行 Docker 容器：

# Update the vllm-ascend image
# For Atlas A2 machines:
# export IMAGE=quay.io/ascend/vllm-ascend:v0.22.1rc1
# For Atlas A3 machines:
export IMAGE=quay.io/ascend/vllm-ascend:v0.22.1rc1-a3
docker run --rm \
  --name vllm-ascend \
  --shm-size=1g \
  --device /dev/davinci0 \
  --device /dev/davinci1 \
  --device /dev/davinci2 \
  --device /dev/davinci3 \
  --device /dev/davinci_manager \
  --device /dev/devmm_svm \
  --device /dev/hisi_hdc \
  -v /usr/local/dcmi:/usr/local/dcmi \
  -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
  -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
  -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
  -v /etc/ascend_install.info:/etc/ascend_install.info \
  -v /root/.cache:/root/.cache \
  -p 8000:8000 \
  -it $IMAGE bash

从源码构建：

# Install vLLM.
git clone --depth 1 --branch v0.22.1 https://github.com/vllm-project/vllm
cd vllm
VLLM_TARGET_DEVICE=empty pip install -e .
cd ..

# Install vLLM Ascend.
git clone --depth 1 --branch v0.22.1rc1 https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
git submodule update --init --recursive
pip install -e .
cd ..

软件栈版本验证¶

环境基于 GiteeAI 平台内置的 CANN，并通过 Python 3.11.6 Conda 环境成功运行 vLLM v0.22.1rc1 和 vLLM-Ascend:v0.22.1rc1。

部署¶

单节点部署（4-NPU）¶

export HCCL_INTRA_ROCE_ENABLE=1
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export HF_HOME=/data
export MODEL_PATH="Hunyuan-A13B-Instruct"

vllm serve ${MODEL_PATH} \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name Hunyuan \
    --tensor-parallel-size 4 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90

关键性能指标¶

基于已验证的 CANN 8.5.1 测试日志：

权重内存占用：每个 NPU 的静态内存占用约为 37.46 GB。
图编译（ACL Graph）：启用 PIECEWISE 模式后，系统约在 18 秒内自动捕获计算图，可显著加速后续推理。
KV 缓存容量：剩余 NPU 内存可为约 529,152 个 token 提供并发缓存空间。

功能验证¶

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Hunyuan",
        "messages": [{"role": "user", "content": "Give me a short introduction to large language models."}],
        "max_tokens": 100,
        "temperature": 0.7
    }'

预期输出：

{"id":"chatcmpl-9a60df2b23bb539f","object":"chat.completion","created":1774751760,"model":"Hunyuan","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\nOkay, I need to write a short introduction to large language models. Let me start by recalling what I know. First, what are LLMs? They're machine learning models trained on vast amounts of text data. The key here is \"large\"—so they have a huge number of parameters. Maybe mention the scale, like billions or trillions of parameters.\n\nThen, how are they trained? They're trained on diverse text sources—books, websites, articles, etc. The","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":12,"total_tokens":112,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

精度评估¶

在 GiteeAI 平台上，使用 AISBench 工具在 GSM8K 基准集上对模型进行了测试验证：在 7cd45e 版本配置下，模型在精度生成模式下达到了 94.77% 的准确率。

ais_bench --models vllm_api_general_chat --datasets gsm8k_gen_0_shot_cot_chat_prompt --summarizer example --debug

输出：

03/29 03:20:03 - AISBench - INFO - Running 1-th replica of evaluation
03/29 03:20:03 - AISBench - INFO - Task [vllm-api-general-chat/gsm8k]: {'accuracy': 94.76876421531463}
03/29 03:20:03 - AISBench - INFO - time elapsed: 2.15s
03/29 03:20:04 - AISBench - INFO - Evaluation tasks completed.
03/29 03:20:04 - AISBench - INFO - Summarizing evaluation results...
dataset    version    metric    mode      vllm-api-general-chat
---------  ---------  --------  ------  -----------------------
gsm8k      7cd45e     accuracy  gen                       94.77
03/29 03:20:04 - AISBench - INFO - write summary to /data/outputs/default/20260329_025345/summary/summary_20260329_025345.txt
03/29 03:20:04 - AISBench - INFO - write csv to /data/outputs/default/20260329_025345/summary/summary_20260329_025345.csv

Markdown 格式的结果如下：

数据集	版本	指标	模式	vllm-api-general-chat
gsm8k	7cd45e	accuracy	gen	94.77

性能¶

使用 AISBench¶

ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer default_perf --mode perf

输出：

[2026-04-08 05:27:40,180] [ais_bench] [INFO] Performance Results of task [vllm-api-stream-chat/demo_gsm8k]: 
╒══════════════════════════╤═════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════╕
│ Performance Parameters   │ Stage   │ Average         │ Min             │ Max             │ Median          │ P75             │ P90             │ P99             │  N  │
╞══════════════════════════╪═════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════╡
│ E2EL                     │ total   │ 29982.6 ms      │ 16472.9 ms      │ 41147.2 ms      │ 30919.1 ms      │ 33514.9 ms      │ 39413.8 ms      │ 40973.9 ms      │  8  │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────┤
│ TTFT                     │ total   │ 238.6 ms        │ 107.9 ms        │ 276.7 ms        │ 254.0 ms        │ 265.6 ms        │ 272.4 ms        │ 276.3 ms        │  8  │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────┤
│ TPOT                     │ total   │ 60.1 ms         │ 57.7 ms         │ 61.3 ms         │ 60.4 ms         │ 60.8 ms         │ 61.2 ms         │ 61.3 ms         │  8  │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────┤
│ ITL                      │ total   │ 59.7 ms         │ 0.0 ms          │ 219.7 ms        │ 51.7 ms         │ 64.1 ms         │ 81.9 ms         │ 146.2 ms        │  8  │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────┤
│ InputTokens              │ total   │ 1457.5          │ 1426.0          │ 1511.0          │ 1456.5          │ 1465.25         │ 1481.6          │ 1508.06         │  8  │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────┤
│ OutputTokens             │ total   │ 497.5           │ 268.0           │ 710.0           │ 508.5           │ 555.75          │ 666.6           │ 705.66          │  8  │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────┤
│ OutputTokenThroughput    │ total   │ 16.5261 token/s │ 16.2402 token/s │ 17.2551 token/s │ 16.4461 token/s │ 16.5728 token/s │ 16.9063 token/s │ 17.2202 token/s │  8  │
╘══════════════════════════╧═════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric            │ Stage   │ Value             │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration       │ total   │ 41161.2934 ms     │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests           │ total   │ 8                 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests          │ total   │ 0                 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests         │ total   │ 8                 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency              │ total   │ 5.8273            │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency          │ total   │ 16                │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput       │ total   │ 0.1944 req/s      │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens       │ total   │ 11660             │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total   │ 6108.0184 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Generated Tokens   │ total   │ 3980              │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput   │ total   │ 283.2758 token/s  │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput  │ total   │ 96.6928 token/s   │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput   │ total   │ 379.9686 token/s  │
╘══════════════════════════╧═════════╧═══════════════════╛

使用 vLLM Benchmark¶

以 Hunyuan-A13B-Instruct 为例运行性能评估。

更多详情请参考 vllm benchmark。

vllm bench 包含三个子命令：

latency：对单批次请求的延迟进行基准测试。
serve：对在线服务吞吐量进行基准测试。
throughput：对离线推理吞吐量进行基准测试。

以 serve 为例，运行如下代码。

vllm bench serve \
    --model ./Hunyuan-A13B-Instruct/ \
    --port 8000 \
    --dataset-name random \
    --random-input 200 \
    --num-prompts 200 \
    --request-rate 1 \
    --save-result \
    --result-dir ./perf_results/ \
    --trust-remote-code

大约几分钟后，即可获得性能评估结果。