MSProbe 调试指南

MSProbe 调试指南#

在推理或训练过程中，我们经常会遇到输出偏离预期、出现 NaN/Inf 等数值不稳定现象，或者模型预测与标签不一致等精度异常。要定位根因，就必须监控并采集模型执行过程中的中间数据——例如特征图、权重、激活值及各层输出。通过在关键阶段捕获核心张量、记录核心层的输入输出对，并保留提示词、张量 dtype、硬件配置等上下文元数据，我们可以系统追踪精度退化或数值错误的源头。本指南聚焦 vllm-ascend 服务，介绍 AI 模型精度问题排查的完整流程：准备、数据采集以及分析与验证。

0.前置概念#

msprobe 支持三种精度级别：

L0：在nn.Module级别保存tensor，并生成 construct.json 以便可视化工具还原网络结构，需要传入模型或子模块句柄。
L1：仅采集算子级统计信息，适合轻量排查。
mix：同时获取结构信息与算子统计，适用于既要构图又要进行数值对比的场景。

1.前提条件#

1.1 安装 `msprobe`#

使用 pip 安装 msprobe：

pip install mindstudio-probe==8.3.0

1.2 可视化依赖（可选）#

如需对采集的数据进行可视化，请安装以下依赖。

安装 tb_graph_ascend：
```
pip install tb_graph_ascend
```

2.使用 `msprobe` 采集数据#

采集通常遵循由粗到细的策略：先确定问题出现的 token，再围绕该 token 决定采样范围，常规流程如下。

2.1 准备 dump 配置文件#

创建可被 PrecisionDebugger 解析的 config.json 并放置在可访问路径，常见字段如下：

字段	说明	必填
`task`	dump 任务类型。PyTorch 常见取值包括 `"statistics"` 和 `"tensor"`：statistics 任务采集张量统计量（均值、方差、最大值、最小值等），tensor 任务可采集任意张量。	是
`dump_path`	dump 结果保存目录，未配置时使用 `msprobe` 默认路径。	否
`rank`	指定需要采集的设备 rank，空列表表示全部 rank；单卡任务必须配置为 `[]`。	否
`step`	指定采集的 token 轮次，空列表表示全部迭代。	否
`level`	dump 级别字符串（`"L0"`、`"L1"`、`"mix"`），L0 面向 `nn.Module`，L1 面向 `torch.api`，mix 同时采集两者。	是
`async_dump`	是否启用异步 dump（PyTorch `statistics`/`tensor` 任务可用），默认 `false`。	否
`scope`	指定需要采集的模块范围，空列表表示全部模块。	否
`list`	指定需要采集的算子范围，空列表表示全部算子。	否

如需进一步限定算子范围，请配置 list：

scope（list[str]）：在 PyTorch 动态图场景下用于限定 dump 区间。按照工具命名格式提供两个模块或 API 名称，只会 dump 这一区间内的数据。示例：
```
"scope": ["Module.conv1.Conv2d.forward.0", "Module.fc2.Linear.forward.0"]
"scope": ["Cell.conv1.Conv2d.forward.0", "Cell.fc2.Dense.forward.0"]
"scope": ["Tensor.add.0.forward", "Functional.square.2.forward"]
```
level 的取值决定可配置内容：level=L0 填模块名，level=L1 填 API 名，level=mix 则二者皆可。
list（list[str]）：用于自定义采集的算子范围，常见方式包括：
- 在 PyTorch 动态图场景中配置 API 全称，仅 dump 这些 API，例如 "list": ["Tensor.permute.1.forward", "Tensor.transpose.2.forward", "Torch.relu.3.forward"]。
- 当 level=mix 时可以填写模块名称，工具会在该模块执行期间展开并 dump 所有数据，例如 "list": ["Module.module.language_model.encoder.layers.0.mlp.ParallelMlp.forward.0"]。
- 也可以仅提供子串（如 "list": ["relu"]），会 dump 名称包含该字符串的 API，且 level=mix 时会展开名称包含该字符串的模块。

示例配置：

{
  "task": "statistics",
  "dump_path": "/home/data_dump",
  "rank": [],
  "step": [],
  "level": "L1",
  "async_dump": false,

  "statistics": {
    "scope": [],
    "list": [],
    "tensor_list": [],
    "data_mode": ["all"],
    "summary_mode": "statistics"
  }
}

3.在 vllm-ascend 中启用 `msprobe`#

通过添加 --enforce-eager 以 eager 模式启动 vLLM（静态图暂不支持），并通过 --additional-config 传入配置路径：

vllm serve Qwen/Qwen2.5-0.5B-Instruct \
  --dtype float16 \
  --enforce-eager \
  --host 0.0.0.0 \
  --port 8000 \
  --additional-config '{"dump_config_path": "/data/msprobe_config.json"}' &

4.发送请求并采集 dump#

按常规方式发送推理请求，例如：

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "Qwen/Qwen2.5-0.5B-Instruct",
        "prompt": "Explain gravity in one sentence.",
        "max_completion_tokens": 32,
        "temperature": 0
      }' | python -m json.tool

每个请求都会执行 msprobe: start -> forward -> stop -> step，Runner 在所有路径都会调用 step()，即使推理提前结束也能拿到完整数据。

dump 文件写入 dump_path，通常包含：

按算子或模块划分的张量文件。
描述 dtype、shape、最小/最大值以及 requires_grad 等信息的 dump.json。
当级别为 L0 或 mix 时生成的 construct.json（可视化必需）。

目录结构示例：

├── dump_path
│   ├── step0
│   │   ├── rank0
│   │   │   ├── dump_tensor_data
│   │   │   │    ├── Tensor.permute.1.forward.pt                       # Format: {api_type}.{api_name}.{call_count}.forward.{input/output}.{arg_index}.
│   │   │   │    │                                              # arg_index is the nth input or output of the API. If an input is a list, keep numbering with decimals (e.g., 1.1 is the first element of the first argument).
│   │   │   │    ├── Module.conv1.Conv2d.forward.0.input.0.pt          # Format: {Module}.{module_name}.{class_name}.forward.{call_count}.{input/output}.{arg_index}.
│   │   │   │    └── Module.conv1.Conv2d.forward.0.parameters.bias.pt  # Module parameter data: {Module}.{module_name}.{class_name}.forward.{call_count}.parameters.{parameter_name}.
│   │   │   │                                                          # When the `model` argument passed to dump is a List[torch.nn.Module] or Tuple[torch.nn.Module], module-level data names also include the index inside the list ({Module}.{index}.*), e.g., Module.0.conv1.Conv2d.forward.0.input.0.pt.
│   │   │   ├── dump.json
│   │   │   ├── stack.json
│   │   │   ├── dump_error_info.log
│   │   │   └── construct.json
│   │   ├── rank1
│   │   │   ├── dump_tensor_data
│   │   │   │   └── ...
│   │   │   ├── dump.json
│   │   │   ├── stack.json
│   │   │   ├── dump_error_info.log
│   │   │   └── construct.json
│   │   ├── ...
│   │   │
│   │   └── rank7
│   ├── step1
│   │   ├── ...
│   ├── step2

rank：设备 ID。每张卡写入对应的 rank{ID} 目录，非分布式场景目录名称为 rank。
dump_tensor_data：采集到的张量数据。
dump.json：各 API 或模块前向数据的统计信息，包括名称、dtype、shape、最大值、最小值、平均值、L2 范数（L2 方差的平方根），以及在 summary_mode="md5" 时的 CRC-32 值。详见 dump.json 文件说明。
dump_error_info.log：仅在 dump 工具遇到错误时生成，记录失败日志。
stack.json：API/模块的调用栈信息。
construct.json：分层结构描述，当 level=L1 时为空。

5.分析结果#

5.1 前置条件#

通常需要两份 dump 数据集：一份来自“问题侧”（暴露精度或数值错误的运行），另一份来自“标杆侧”（良好的基线）。这些数据集不必完全相同——它们可以来自不同的分支、框架版本，甚至是替代实现（算子替换、不同的图优化开关等）。只要它们使用相同或相似的输入、硬件拓扑和采样点（step/token），msprobe 就可以比较它们并定位差异节点。如果找不到完全干净的标杆，可以先捕获问题侧数据，手动构建最小的可复现案例，并进行自比较。下文假设问题侧 dump 为 problem_dump，标杆侧 dump 为 bench_dump。

5.2 可视化#

使用 msprobe -f pytorch graph 生成结果，可在 tb_graph_ascend 中打开。

确保 dump 包含 construct.json（即 level = L0 或 level = mix）。

准备一个比较文件，例如 compare.json。其格式和生成流程在 msprobe_visualization.md 的 3.1.3 节中描述。示例（最小可运行片段）：

{
  "npu_path": "./problem_dump",
  "bench_path": "./bench_dump",
  "is_print_compare_log": true
}

在调用 msprobe -f pytorch graph 之前，将路径替换为你的 dump 目录。如果只需要构建单个图，省略 bench_path 以可视化一个 dump。多 rank 场景（单 rank、多 rank 或多 step 多 rank）也受支持。npu_path 或 bench_path 必须包含名为 rank+数字 的文件夹，并且每个 rank 文件夹必须包含一个非空的 construct.json 以及 dump.json 和 stack.json。如果任何 construct.json 为空，请验证 dump 级别是否包含 L0 或 mix。比较图时，npu_path 和 bench_path 必须包含相同的 rank 文件夹集合，以便它们可以一一配对。

├── npu_path or bench_path
|   ├── rank0
|   |   ├── dump_tensor_data (only when the `tensor` option is enabled)
|   |   |    ├── Tensor.permute.1.forward.pt
|   |   |    ├── MyModule.0.forward.input.pt
|   |   |    ...
|   |   |    └── Functional.linear.5.forward.output.pt
|   |   ├── dump.json         # Tensor metadata
|   |   ├── stack.json        # Operator call stack information
|   |   └── construct.json    # Hierarchical structure; empty when `level=L1`
|   ├── rank1
|   |   ├── dump_tensor_data
|   |   |   └── ...
|   |   ├── dump.json
|   |   ├── stack.json
|   |   └── construct.json
|   ├── ...
|   |
|   └── rankn

运行：
```
msprobe -f pytorch graph \
    --input_path ./compare.json \
    --output_path ./graph_output
```
比较完成后，会在 graph_output 下创建一个 *.vis.db 文件。
- 图构建：build_{timestamp}.vis.db
- 图比较：compare_{timestamp}.vis.db
启动 tensorboard 并加载输出目录，以检查结构差异、数值比较、溢出检测结果、跨设备通信节点以及过滤器/搜索。将包含 .vis.db 文件的目录传递给 --logdir：
```
tensorboard --logdir out_path --bind_all --port [optional_port]
```
检查可视化界面。UI 通常显示包含算子、参数和张量 I/O 的整体模型结构。点击任何节点以展开其子节点。
- 差异可视化：比较结果用不同颜色突出显示差异节点（差异越大，节点越红）。点击节点可查看其详细信息，包括张量输入/输出、参数和算子类型。分析数据差异和周围连接，以精确定位确切的差异点。
- 辅助功能：
  - 切换 rank/step：快速检查不同 rank 和 step 上的差异节点。
  - 搜索/过滤：使用搜索框按算子名称等过滤节点。
  - 手动映射：自动映射无法覆盖所有情况，因此该工具允许你在生成比较结果之前，手动映射问题图和标杆图之间的节点。

6.故障排除#

RuntimeError: Please enforce eager mode：重启 vLLM 并添加 --enforce-eager 标志。
没有 dump 文件：确认 JSON 路径正确且每个节点都有写权限。在分布式场景中，设置 keep_all_ranks 以便每个 rank 写入自己的 dump。
Dump 文件过大：从 statistics 任务开始，定位异常张量，然后使用 scope/list/tensor_list、filters、token_range 等缩小范围。

附录#

dump.json 文件说明#

L0 级别#

L0 级别的 dump.json 包含模块的前向 I/O 以及参数。以 PyTorch 的 Conv2d 为例，网络代码如下：

output = self.conv2(input) # self.conv2 = torch.nn.Conv2d(64, 128, 5, padding=2, bias=True)

dump.json 包含以下条目：

Module.conv2.Conv2d.forward.0：模块的前向数据。input_args 表示位置输入，input_kwargs 表示关键字输入，output 存储前向输出，parameters 存储权重/偏置。

注意：当传递给 dump API 的 model 参数是 List[torch.nn.Module] 或 Tuple[torch.nn.Module] 时，模块级名称包含列表内的索引（{Module}.{index}.*）。例如：Module.0.conv1.Conv2d.forward.0。

{
 "task": "tensor",
 "level": "L0",
 "framework": "pytorch",
 "dump_data_dir": "/dump/path",
 "data": {
  "Module.conv2.Conv2d.forward.0": {
   "input_args": [
    {
     "type": "torch.Tensor",
     "dtype": "torch.float32",
     "shape": [
      8,
      16,
      14,
      14
     ],
     "Max": 1.638758659362793,
     "Min": 0.0,
     "Mean": 0.2544615864753723,
     "Norm": 70.50277709960938,
     "requires_grad": true,
     "data_name": "Module.conv2.Conv2d.forward.0.input.0.pt"
    }
   ],
   "input_kwargs": {},
   "output": [
    {
     "type": "torch.Tensor",
     "dtype": "torch.float32",
     "shape": [
      8,
      32,
      10,
      10
     ],
     "Max": 1.6815717220306396,
     "Min": -1.5120246410369873,
     "Mean": -0.025344856083393097,
     "Norm": 149.65576171875,
     "requires_grad": true,
     "data_name": "Module.conv2.Conv2d.forward.0.output.0.pt"
    }
   ],
   "parameters": {
    "weight": {
     "type": "torch.Tensor",
     "dtype": "torch.float32",
     "shape": [
      32,
      16,
      5,
      5
     ],
     "Max": 0.05992485210299492,
     "Min": -0.05999220535159111,
     "Mean": -0.0006165213999338448,
     "Norm": 3.421217441558838,
     "requires_grad": true,
     "data_name": "Module.conv2.Conv2d.forward.0.parameters.weight.pt"
    },
    "bias": {
     "type": "torch.Tensor",
     "dtype": "torch.float32",
     "shape": [
      32
     ],
     "Max": 0.05744686722755432,
     "Min": -0.04894155263900757,
     "Mean": 0.006410328671336174,
     "Norm": 0.17263513803482056,
     "requires_grad": true,
     "data_name": "Module.conv2.Conv2d.forward.0.parameters.bias.pt"
    }
   }
  }
 }
}

L1 级别#

L1 级别的 dump.json 记录 API 的前向 I/O。以 PyTorch 的 relu 函数为例（output = torch.nn.functional.relu(input)），该文件包含：

Functional.relu.0.forward：API 的前向数据。input_args 是位置输入，input_kwargs 是关键字输入，output 存储前向输出。

{
 "task": "tensor",
 "level": "L1",
 "framework": "pytorch",
 "dump_data_dir":"/dump/path",
 "data": {
  "Functional.relu.0.forward": {
   "input_args": [
    {
     "type": "torch.Tensor",
     "dtype": "torch.float32",
     "shape": [
      32,
      16,
      28,
      28
     ],
     "Max": 1.3864083290100098,
     "Min": -1.3364859819412231,
     "Mean": 0.03711778670549393,
     "Norm": 236.20692443847656,
     "requires_grad": true,
     "data_name": "Functional.relu.0.forward.input.0.pt"
    }
   ],
   "input_kwargs": {},
   "output": [
    {
     "type": "torch.Tensor",
     "dtype": "torch.float32",
     "shape": [
      32,
      16,
      28,
      28
     ],
     "Max": 1.3864083290100098,
     "Min": 0.0,
     "Mean": 0.16849493980407715,
     "Norm": 175.23345947265625,
     "requires_grad": true,
     "data_name": "Functional.relu.0.forward.output.0.pt"
    }
   ]
  }
 }
}  

mix 级别#

mix 级别的 dump.json 包含 L0 和 L1 级别的数据；文件格式与上述示例相同。