MSProbe 调试指南¶

在推理或训练过程中，我们常会遇到精度异常问题，例如输出偏离预期、出现数值不稳定（NaN/Inf）现象，或预测结果与标签不再匹配。要定位根本原因，必须监控并捕获模型执行过程中产生的中间数据——包括特征图、权重、激活值和各层输出。通过在特定阶段捕获关键张量、记录核心层的输入输出对，并保留上下文元数据（提示词、张量数据类型、硬件配置等），我们可以系统性地追踪精度退化或数值误差的源头。本指南描述了诊断 AI 模型精度问题的端到端工作流（重点针对 vllm-ascend 服务）：准备工作、数据采集以及分析与验证。

更多详情请参阅 Ascend/msprobe。

0. 背景概念¶

msprobe 支持三种精度级别：

L0：在模块级别转储张量，并生成 construct.json，以便可视化工具重建网络结构。必须传入模型或子模块句柄。
L1：仅收集算子级别的统计信息，适用于轻量级问题排查。
mix：同时捕获结构信息和算子统计信息，适用于需要同时进行图重建和数值比较的场景。

1. 前提条件¶

1.1 安装 `msprobe`¶

使用 pip 安装 msprobe：

pip install mindstudio-probe

1.2 图模式转储（可选）¶

如果需要采集 cudagraph 图，请从源码安装：

从源代码安装 aclgraph_dump：

git clone https://gitcode.com/Ascend/msprobe.git
cd msprobe
pip install uv
python3 build.py -e include-mod=aclgraph_dump -e no-check=true
pip install artifacts/mindstudio_probe*.whl

2. 使用 `msprobe` 收集数据¶

我们通常采用由粗到细的策略采集数据。首先确定问题出现的 token，然后决定围绕该 token 需要采样的范围。典型工作流程如下所述。

2.1 准备转储配置内容¶

准备可被 PrecisionDebugger 解析的配置内容。你可以使用以下任一方式：

通过 --additional-config.dump_config 直接传入配置对象。
通过 --additional-config.dump_config_path 传入配置文件路径。

常见字段如下：

字段	描述	必填	即时模式	图模式
`task`	转储任务类型。常见的 PyTorch 值包括 `"statistics"` 和 `"tensor"`。统计任务收集张量统计信息（均值、方差、最大值、最小值等），而张量任务捕获任意张量。	是	✅	✅
`dump_path`	存储转储结果的目录。省略时，`msprobe` 使用其默认路径。	否	✅	✅
`rank`	要采样的卡号。空列表表示收集所有卡号。对于单卡任务，必须将此字段设置为 `[]`。	否	✅	✅
`step`	要采样的 token 迭代次数。空列表表示每次迭代。	否	✅	❌
`level`	转储级别字符串（`"L0"`、`"L1"` 或 `"mix"`）。`L0` 针对 `nn.Module`，`L1` 针对 `torch.api`，`mix` 收集两者。	是	✅	✅
`async_dump`	是否启用异步转储（支持 PyTorch `statistics`/`tensor` 任务）。默认为 `false`。	否	✅	❌
`scope`	要采样的模块范围。空列表表示收集所有模块。	否	✅	❌
`dump_enable`	在一次运行的训练/推理任务中，用于在 `PrecisionDebugger` 中启用/禁用转储的动态开关。这允许在同一任务中按需打开或关闭转储。	否	✅	❌
`list`	要采样的算子范围。空列表表示收集所有算子。	否	✅	✅

如需限制捕获的算子范围，可配置 list 块：

scope (list[str])：在 PyTorch PyNative 场景下，此字段限制转储范围。提供两个符合工具命名约定的模块或 API 名称来锁定一个范围；仅转储这两个名称之间的数据。示例：

"scope": ["Module.conv1.Conv2d.forward.0", "Module.fc2.Linear.forward.0"]
"scope": ["Cell.conv1.Conv2d.forward.0", "Cell.fc2.Dense.forward.0"]
"scope": ["Tensor.add.0.forward", "Functional.square.2.forward"]

level 设置决定了可提供的内容——level=L0 时使用模块名，level=L1 时使用 API 名，level=mix 时两者皆可。

list (list[str])：自定义算子列表。选项包括：
- 在 PyTorch pynative 场景下，提供特定 API 的全名以仅转储这些 API。示例："list": ["Tensor.permute.1.forward", "Tensor.transpose.2.forward", "Torch.relu.3.forward"]。
- 当 level=mix 时，可以提供模块名称，以便转储扩展到模块运行时产生的所有内容。示例："list": ["Module.module.language_model.encoder.layers.0.mlp.ParallelMlp.forward.0"]。
- 提供子字符串，例如 "list": ["relu"]，以转储名称中包含该子字符串的所有 API。当 level=mix 时，名称中包含该子字符串的模块也会被扩展。

配置示例：即时模式：

{
  "task": "statistics",
  "dump_path": "/home/data_dump",
  "rank": [],
  "step": [],
  "level": "L1",
  "async_dump": false,

  "statistics": {
    "scope": [],
    "list": [],
    "tensor_list": [],
    "data_mode": ["all"],
    "summary_mode": "statistics"
  }
}

图模式：

{
  "task": "statistics",
  "level": "L1",
  "dump_path": "/home/data_dump",
  "statistics": {
    "list": []
  }
}

3. 在 vllm-ascend 中启用 `msprobe`¶

启动 vLLM 并通过 --additional-config 传入转储配置内容：

vllm serve Qwen/Qwen2.5-0.5B-Instruct \
  --dtype bfloat16 \
  --host 0.0.0.0 \
  --port 8000 \
  --additional-config '{
    "dump_config": {
      "task": "statistics",
      "level": "L1",
      "dump_path": "/data/msprobe_dump",
      "statistics": {
        "list": []
      }
    }
  }' &

兼容模式（legacy）仍然支持：

vllm serve Qwen/Qwen2.5-0.5B-Instruct \
  --dtype bfloat16 \
  --host 0.0.0.0 \
  --port 8000 \
  --additional-config '{"dump_config_path": "/data/msprobe_config.json"}' &

4. 发送请求并收集转储¶

像往常一样发送推理请求，例如：

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "Qwen/Qwen2.5-0.5B-Instruct",
        "prompt": "Explain gravity in one sentence.",
        "max_completion_tokens": 32,
        "temperature": 0
      }' | python -m json.tool

每个请求驱动序列 msprobe: start -> forward -> stop -> step。运行器在每个代码路径上调用 step()，因此即使推理提前返回，您也始终能获得完整的数据集。
转储文件写入 dump_path。它们通常包含：
按算子/模块分组的张量文件。
dump.json，记录元数据，如 dtype、shape、min/max 和 requires_grad。
construct.json，当 level 为 L0 或 mix 时生成（可视化所需）。

示例目录结构：即时模式：

├── dump_path
│   ├── step0
│   │   ├── rank0
│   │   │   ├── dump_tensor_data
│   │   │   │    ├── Tensor.permute.1.forward.pt                       # Format: {api_type}.{api_name}.{call_count}.forward.{input/output}.{arg_index}.
│   │   │   │    │                                              # arg_index is the nth input or output of the API. If an input is a list, keep numbering with decimals (e.g., 1.1 is the first element of the first argument).
│   │   │   │    ├── Module.conv1.Conv2d.forward.0.input.0.pt          # Format: {Module}.{module_name}.{class_name}.forward.{call_count}.{input/output}.{arg_index}.
│   │   │   │    └── Module.conv1.Conv2d.forward.0.parameters.bias.pt  # Module parameter data: {Module}.{module_name}.{class_name}.forward.{call_count}.parameters.{parameter_name}.
│   │   │   │                                                          # When the `model` argument passed to dump is a List[torch.nn.Module] or Tuple[torch.nn.Module], module-level data names also include the index inside the list ({Module}.{index}.*), e.g., Module.0.conv1.Conv2d.forward.0.input.0.pt.
│   │   │   ├── dump.json
│   │   │   ├── stack.json
│   │   │   ├── dump_error_info.log
│   │   │   └── construct.json
│   │   ├── rank1
│   │   │   ├── dump_tensor_data
│   │   │   │   └── ...
│   │   │   ├── dump.json
│   │   │   ├── stack.json
│   │   │   ├── dump_error_info.log
│   │   │   └── construct.json
│   │   ├── ...
│   │   │
│   │   └── rank7
│   ├── step1
│   │   ├── ...
│   ├── step2

rank：设备ID。每张卡将其数据写入对应的 rank{ID} 目录。在非分布式场景下，目录直接命名为 rank。
dump_tensor_data：收集到的张量载荷。
dump.json：每个API或模块前向数据的统计信息，包括名称、数据类型、形状、最大值、最小值、均值、L2范数（L2方差的平方根），以及当 summary_mode="md5" 时的CRC-32。详见 dump.json文件说明。
dump_error_info.log：仅当转储工具遇到错误时存在，记录失败日志。
stack.json：API/模块的调用栈。
construct.json：层次结构描述。当 level=L1 时为空。

图模式：

L0_dump
├── step0
│   └── rank0
│       └── dump.json
├── step1
│   └── rank0
│       └── dump.json
├── step2
│   └── rank0
│       └── dump.json
├── step3
│   └── rank0
│       └── dump.json
├── step4
│   └── rank0
│       └── dump.json
└── step5
    └── rank0
        └── dump.json

dump.json：每个API或模块前向数据的统计信息，包括名称、数据类型、形状、最大值、最小值、均值、L2范数（L2方差的平方根），以及当 summary_mode="md5" 时的CRC-32。详见 dump.json文件说明。

5. 分析结果¶

5.1 前提条件¶

通常需要两个采集数据集：一个来自“问题侧”（暴露精度或数值错误的运行），另一个来自“基准侧”（良好的基线）。这些数据集不必完全相同——它们可以来自不同的分支、框架版本，甚至不同的实现（算子替换、不同的图优化开关等）。只要它们使用相同或相似的输入、硬件拓扑和采样点（步数/token），msprobe 就能进行比较并定位差异节点。如果找不到完全干净的基准，可以先采集问题侧数据，手动构建最小可复现案例，然后进行自比较。下文中，我们假设问题采集目录为 problem_dump，基准采集目录为 bench_dump。

5.2 可视化¶

使用 msprobe graph_visualize 构建或对比图，然后通过 TensorBoard（tb_graph_ascend 插件）打开生成的 *.vis.db 文件。

确保转储数据已准备好进行可视化：
转储级别必须为 L0 或 mix，以确保 construct.json 非空。
每个rank目录应包含 dump.json、stack.json 和 construct.json。
选择命令模式：

单图构建：

msprobe graph_visualize -tp <target_path> -o <output_path>

图对比：

msprobe graph_visualize -tp <target_path> -gp <golden_path> -o <output_path>

通用可选标志：
- -oc / --overflow_check：启用溢出标记
- -fm / --fuzzy_match：启用节点映射的模糊匹配
- -lm / --layer_mapping [mapping.yaml]：跨框架/层映射对比
- -tensor_log：打印每个节点的对比日志（张量转储场景）
- -progress_log：打印详细的进度日志
graph_visualize 自动检测路径粒度：
单rank：.../step0/rank0
多rank（批量）：.../step0
多步（批量）：包含 step* 的转储根路径
输出文件：
单图构建：build_{timestamp}.vis.db
图对比：compare_{timestamp}.vis.db
使用输出目录启动 TensorBoard：

tensorboard --logdir <output_path> --bind_all --port <optional_port>

在可视化界面中，检查结构和数值差异：
切换 rank/step 以快速定位不稳定节点。
使用搜索/过滤功能聚焦于目标操作/模块。
对于对比模式，优先处理高亮显示的高差异节点，并追踪其周围的输入/输出/参数。

6. 故障排除¶

RuntimeError: Please enforce eager mode：重启 vLLM 并添加 --enforce-eager 标志。
没有转储文件：确认 JSON 路径正确且每个节点都有写入权限。在分布式场景中，设置 keep_all_ranks 以便每个 rank 写入自己的转储。
转储文件过大：从 statistics 任务开始，定位异常张量，然后使用 scope/list/tensor_list、filters、token_range 等缩小范围。

附录¶

dump.json 文件说明¶

L0 级别¶

L0 级别的 dump.json 包含模块的前向输入输出以及参数。以 PyTorch 的 Conv2d 为例，网络代码如下：

output = self.conv2(input) # self.conv2 = torch.nn.Conv2d(64, 128, 5, padding=2, bias=True)

dump.json 包含以下条目：

Module.conv2.Conv2d.forward.0：模块的前向数据。input_args 表示位置输入，input_kwargs 表示关键字输入，output 存储前向输出，parameters 存储权重/偏置。

注意：当传递给采集 API 的 model 参数为 List[torch.nn.Module] 或 Tuple[torch.nn.Module] 时，模块级名称会包含其在列表中的索引（{Module}.{index}.*）。例如：Module.0.conv1.Conv2d.forward.0。

{
 "task": "tensor",
 "level": "L0",
 "framework": "pytorch",
 "dump_data_dir": "/dump/path",
 "data": {
  "Module.conv2.Conv2d.forward.0": {
   "input_args": [
    {
     "type": "torch.Tensor",
     "dtype": "torch.float32",
     "shape": [
      8,
      16,
      14,
      14
     ],
     "Max": 1.638758659362793,
     "Min": 0.0,
     "Mean": 0.2544615864753723,
     "Norm": 70.50277709960938,
     "requires_grad": true,
     "data_name": "Module.conv2.Conv2d.forward.0.input.0.pt"
    }
   ],
   "input_kwargs": {},
   "output": [
    {
     "type": "torch.Tensor",
     "dtype": "torch.float32",
     "shape": [
      8,
      32,
      10,
      10
     ],
     "Max": 1.6815717220306396,
     "Min": -1.5120246410369873,
     "Mean": -0.025344856083393097,
     "Norm": 149.65576171875,
     "requires_grad": true,
     "data_name": "Module.conv2.Conv2d.forward.0.output.0.pt"
    }
   ],
   "parameters": {
    "weight": {
     "type": "torch.Tensor",
     "dtype": "torch.float32",
     "shape": [
      32,
      16,
      5,
      5
     ],
     "Max": 0.05992485210299492,
     "Min": -0.05999220535159111,
     "Mean": -0.0006165213999338448,
     "Norm": 3.421217441558838,
     "requires_grad": true,
     "data_name": "Module.conv2.Conv2d.forward.0.parameters.weight.pt"
    },
    "bias": {
     "type": "torch.Tensor",
     "dtype": "torch.float32",
     "shape": [
      32
     ],
     "Max": 0.05744686722755432,
     "Min": -0.04894155263900757,
     "Mean": 0.006410328671336174,
     "Norm": 0.17263513803482056,
     "requires_grad": true,
     "data_name": "Module.conv2.Conv2d.forward.0.parameters.bias.pt"
    }
   }
  }
 }
}

L1 级别¶

L1 级别的 dump.json 记录 API 的前向输入输出。以 PyTorch 的 relu 函数为例（output = torch.nn.functional.relu(input)），文件包含：

Functional.relu.0.forward：API 的前向数据。input_args 是位置输入，input_kwargs 是关键字输入，output 存储前向输出。

{
 "task": "tensor",
 "level": "L1",
 "framework": "pytorch",
 "dump_data_dir":"/dump/path",
 "data": {
  "Functional.relu.0.forward": {
   "input_args": [
    {
     "type": "torch.Tensor",
     "dtype": "torch.float32",
     "shape": [
      32,
      16,
      28,
      28
     ],
     "Max": 1.3864083290100098,
     "Min": -1.3364859819412231,
     "Mean": 0.03711778670549393,
     "Norm": 236.20692443847656,
     "requires_grad": true,
     "data_name": "Functional.relu.0.forward.input.0.pt"
    }
   ],
   "input_kwargs": {},
   "output": [
    {
     "type": "torch.Tensor",
     "dtype": "torch.float32",
     "shape": [
      32,
      16,
      28,
      28
     ],
     "Max": 1.3864083290100098,
     "Min": 0.0,
     "Mean": 0.16849493980407715,
     "Norm": 175.23345947265625,
     "requires_grad": true,
     "data_name": "Functional.relu.0.forward.output.0.pt"
    }
   ]
  }
 }
}

mix 级别¶

mix 级别的 dump.json 同时包含 L0 和 L1 级别的数据；文件格式与上述示例相同。