服务性能分析指南

服务性能分析指南#

In an inference service process, it is sometimes necessary to monitor the internal execution flow of the inference service framework to identify performance issues. By collecting start and end timestamps of key processes, identifying key functions or iterations, recording critical events, and gathering various types of information, performance bottlenecks can be quickly located.

This guide will walk you through the process of collecting performance data from the vLLM-Ascend service framework and operators. It covers the complete workflow from preparation, collection, analysis, to visualization, helping you quickly get started with performance collection tools.

Two performance collection solutions are provided below: Ascend PyTorch Profiler and MS Service Profiler. You can choose the appropriate tool for performance analysis and troubleshooting based on your actual requirements.

Solution Comparison#

Feature	Ascend PyTorch Profiler	MS Service Profiler
Installation Method	Built-in, no additional installation required	Requires pip installation of msserviceprofiler
Collection Granularity	PyTorch operator level	Service framework function level
Control Method	API request control	Configuration file control
Applicable Scenarios	Model operator performance analysis	Service framework workflow analysis
Data Format	ascend_pt format	Chrome Tracing + CSV
Main Advantage	Operator-level performance analysis	Service framework workflow visualization

Quick Selection Guide#

Model Operator Performance → Use Ascend PyTorch Profiler
Service Framework Workflow → Use MS Service Profiler

Ascend PyTorch Profiler#

0. Installation and Configuration#

No additional packages need to be installed; it can be enabled through command-line configuration. Currently, vLLM enables python stack by default, which can significantly inflate the collected performance data. If you do not wish to collect python stack, you can disable it using torch_profiler_with_stack=false.

1. Preparation for Collection#

Start the online service and set the --profiler-config parameter to control the path for saving performance files. After the parameter is set, the collection function is enabled.

VLLM_PROMPT_SEQ_BUCKET_MAX=128
VLLM_PROMPT_SEQ_BUCKET_MIN=128
python3 -m vllm.entrypoints.openai.api_server \
--port 8080 \
--model "facebook/opt-125m" \
--tensor-parallel-size 1 \
--max-num-seqs 128 \
--profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile", "torch_profiler_with_stack": false}' \
--dtype bfloat16 \
--max-model-len 256

Note:January 19, 2026: The vLLM mainline has deprecated the VLLM_TORCH_PROFILER_DIR environment variable.Related PR When using the vLLM Ascend mainline code to collect profiler data, remember to use the --profiler-config (online) parameter or the profiler_config (offline) parameter.

2. Start Collection#

Performance collection is controlled by sending API requests. You can start collection after stabilizing the actual business data and collect profiling for a few seconds before stopping; or you can start collection first, then send business requests, and finally stop.

Send the following request to start the profiling service:

curl -X POST http://localhost:8080/start_profile

Send the following request to stop the profiling service:

curl -X POST http://localhost:8080/stop_profile

3. Send Requests#

Send requests according to your actual business data. After sending the requests, stop the profiling service, and the data will be automatically saved to the previously configured path:

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0
}'

curl -X POST http://localhost:8080/stop_profile

4. Analyze Data#

Navigate to the ./vllm_profile directory and locate the generated *ascend_pt folder. This folder needs to be analyzed before profiling data can be examined.

from torch_npu.profiler.profiler import analyse
analyse("./vllm_profile/localhost.localdomain_XXXXXXXXXX_ascend_pt/")

5. View Results#

After analysis, the *ascend_pt directory will contain many files, with the main analysis focus being the ASCEND_PROFILER_OUTPUT folder. This directory will include the following files:

analysis.db: Performance data in database format
api_statistic.csv: API call statistics
ascend_pytorch_profiler_0.db: Performance data in database format
kernel_details.csv: Kernel-level related data
operator_details.csv: Operator-level related data
op_statistic.csv: Operator utilization data
step_trace_time.csv: Scheduling data
trace_view.json: Chrome tracing format data, can be opened with MindStudio Insight

↑ Back to Top

MS Service Profiler#

0. Installation#

使用 pip 安装 msserviceprofiler 包：

pip install msserviceprofiler==1.2.2

1. Preparation#

在启动服务前，设置环境变量 SERVICE_PROF_CONFIG_PATH 指向性能分析配置文件，并设置环境变量 PROFILING_SYMBOLS_PATH 来指定需要导入的符号 YAML 配置文件。之后，根据您的部署方式启动 vLLM 服务。

cd ${path_to_store_profiling_files}
# Set environment variable
export SERVICE_PROF_CONFIG_PATH=ms_service_profiler_config.json
export PROFILING_SYMBOLS_PATH=service_profiling_symbols.yaml

# Start vLLM service
vllm serve Qwen/Qwen2.5-0.5B-Instruct &

ms_service_profiler_config.json 文件是性能分析配置文件。如果指定路径下不存在该文件，系统将自动生成一份默认配置。如有需要，您可以根据下文性能分析配置文件章节的说明提前进行自定义配置。

service_profiling_symbols.yaml 是需要导入的性能分析点配置文件。您也可以选择不设置环境变量 PROFILING_SYMBOLS_PATH，此时将使用默认配置文件；如果您指定的路径下不存在该文件，系统同样会在该路径生成一份配置文件以便后续修改。您可以参考下文符号配置文件章节进行自定义配置。

2. Enable Profiling#

要启用性能数据采集开关，请将配置文件 ms_service_profiler_config.json 中的 enable 字段从 0 修改为 1。可以通过执行以下 sed 命令实现：

sed -i 's/"enable":\s*0/"enable": 1/' ./ms_service_profiler_config.json

3. Send Requests#

根据实际性能分析需求选择合适的请求发送方式：

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json"  \
    -d '{
         "model": "Qwen/Qwen2.5-0.5B-Instruct",
        "prompt": "Beijing is a",
        "max_tokens": 5,
        "temperature": 0
}' | python3 -m json.tool

4. Analyze Data#

# xxxx-xxxx is the directory automatically created based on vLLM startup time
cd /root/.ms_server_profiler/xxxx-xxxx

# Analyze data
msserviceprofiler analyze --input-path=./ --output-path output

5. View Results#

分析完成后，output 目录下将包含以下文件：

chrome_tracing.json：Chrome 追踪格式数据，可在 MindStudio Insight 中打开。
profiler.db：数据库格式的性能数据。
request.csv：请求相关数据。
request_summary.csv：请求总体指标统计。
kvcache.csv：KV Cache 相关数据。
batch.csv：批次调度相关数据。
batch_summary.csv：批次调度总体指标统计。
service_summary.csv：服务层面总体指标统计。

6. Appendix related to MS Service Profiler#

6.1 Profiling Configuration File#

性能分析配置文件用于控制性能分析的参数与行为。

配置文件格式#

配置文件为 JSON 格式，主要参数如下：

参数	说明	是否必选
enable	性能分析开关： 0：禁用 1：启用默认值：0	是
prof_dir	性能数据存储目录。默认值：$HOME/.ms_service_profiler	否
profiler_level	数据采集等级。默认值为 "INFO"（普通级别）。	否
host_system_usage_freq	主机 CPU 和内存指标采样频率。默认禁用。范围：整数 1–50，单位：Hz（次/秒）。设置为 -1 可禁用。注意：启用此功能可能会消耗较多内存。	否
npu_memory_usage_freq	NPU 内存使用率采样频率。默认禁用。范围：整数 1–50，单位：Hz（次/秒）。设置为 -1 可禁用。注意：启用此功能可能会消耗较多内存。	否
acl_task_time	采集算子下发延迟和执行延迟的开关： 0：禁用（默认值；0 或无效值表示禁用）。 1：启用；使用 `ACL_PROF_TASK_TIME_L0` 参数调用 `aclprofCreateConfig`。 2：启用基于 MSPTI 的数据落盘；使用 MSPTI 进行性能分析，需要设置：`export LD_PRELOAD=$ASCEND_TOOLKIT_HOME/lib64/libmspti.so`	否
acl_prof_task_time_level	性能分析的级别和时长： L0：仅采集算子下发和执行延迟；开销较低（不采集算子基本信息）。 L1：采集 AscendCL 接口性能（主机-设备和设备间同步/异步内存复制延迟），以及算子下发、执行和基本信息，用于全面分析。 time：性能分析时长，整数 1–999，单位：秒。如未设置，默认为 L0 直到程序退出；无效值将回退到默认值。级别和时长可以组合，例如 `"acl_prof_task_time_level": "L1,10"`。	否
api_filter	Filter to select API performance data to dump. For example, specifying "matmul" dumps all API data whose `name` contains "matmul". String, case-sensitive; use ";" to separate multiple targets. Empty means dump all. Effective only when `acl_task_time` is 2.	否
kernel_filter	Filter to select kernel performance data to dump. For example, specifying "matmul" dumps all kernel data whose `name` contains "matmul". String, case-sensitive; use ";" to separate multiple targets. Empty means dump all. Effective only when `acl_task_time` is 2.	否
timelimit	服务性能分析时长。进程将在此时长后自动停止。范围：整数 0–7200，单位：秒。默认值 0 表示无限制。	否
domain	限制性能分析到指定的域以减少数据量。字符串类型，以分号分隔，区分大小写，例如："Request; KVCache"。为空表示所有可用域。可用域包括：Request、KVCache、ModelExecute、BatchSchedule、Communication。注意：如果选择的域不完整，由于数据缺失，分析输出可能会显示警告。参见参考表 1。	否

配置示例#

{
  "enable": 1,
  "prof_dir": "vllm_prof",
  "profiler_level": "INFO",
  "acl_task_time": 0,
  "acl_prof_task_time_level": "",
  "timelimit": 0
}

6.2 Symbols Configuration File#

符号配置文件用于定义需要分析的函数/方法，并支持通过自定义属性采集进行灵活配置。

File Name and Loading#

Default load path:~/.config/vllm_ascend/service_profiling_symbols.MAJOR.MINOR.PATCH.yaml(According to the installed version of vllm )

如果需要自定义性能分析点，强烈建议通过设置环境变量 PROFILING_SYMBOLS_PATH，将一份性能分析配置文件复制到工作目录进行修改使用。

Field Descriptions#

字段	说明	示例
symbol	Python 导入路径 + 属性链	`"vllm.v1.core.kv_cache_manager:KVCacheManager.free"`
handler	处理函数类型	`"timer"`（默认）或 `"pkg.mod:func"`（自定义）
domain	分析域标识	`"KVCache"`, `"ModelExecute"`
name	事件名称	`"EngineCoreExecute"`
min_version	最高版本约束	`"0.9.1"`
max_version	最低版本约束	`"0.11.0"`
attributes	自定义属性采集	Only supported for `"timer"` handler. See the section below

Examples#

示例 1：自定义处理函数

- symbol: vllm.v1.core.kv_cache_manager:KVCacheManager.free
  handler: vllm_profiler.config.custom_handler_example:kvcache_manager_free_example_handler
  domain: Example
  name: example_custom

示例 2：默认计时器

- symbol: vllm.v1.engine.core:EngineCore.execute_model
  domain: ModelExecute
  name: EngineCoreExecute

示例 3：版本约束

- symbol: vllm.v1.executor.abstract:Executor.execute_model
  min_version: "0.9.1"
  # No handler specified -> default timer

Custom Attribute Collection#

attributes 字段支持灵活的自定义属性采集，并允许对函数参数和返回值进行操作与转换。

基本语法#

参数访问：直接使用参数名，例如 input_ids
返回值访问：使用 return 关键字
管道操作：使用 | 连接多个操作
属性访问：使用 attr 访问对象属性

示例#

- symbol: vllm_ascend.worker.model_runner_v1:NPUModelRunner.execute_model
  name: ModelRunnerExecuteModel
  domain: ModelExecute
  attributes:
  - name: device
    expr: args[0] | attr device | str
  - name: dp
    expr: args[0] | attr dp_rank | str
  - name: batch_size
    expr: args[0] | attr input_batch | attr _req_ids | len

表达式说明#

len(input_ids)：获取参数 input_ids 的长度。
len(return) | str：获取返回值的长度并转换为字符串（等价于 str(len(return))）。
return[0] | attr input_ids | len：获取返回值第一个元素的 input_ids 属性长度。

支持的表达式类型#

基础操作：len(), str(), int(), float()
索引访问：return[0], return['key']
属性访问：return | attr attr_name
管道组合：使用 | 连接多个操作

高级示例#

attributes:
  # Get tensor shape
  - name: tensor_shape
    expr: input_tensor | attr shape | str
  
  # Get specific value from a dict
  - name: batch_size
    expr: kwargs['batch_size']
  
  # Conditional expression (requires custom handler support)
  - name: is_training_mode
    expr: training | bool
  
  # Complex data processing
  - name: processed_data_len
    expr: data | attr items | len | str

Custom Handler#

当 handler 字段指定自定义处理函数时，该函数需符合以下签名：

def custom_handler(original_func, this, *args, **kwargs):
    """
    Custom handler
    
    Args:
        original_func: the original function object
        this: the bound object (for methods)
        *args: positional arguments
        **kwargs: keyword arguments
    
    Returns:
        processing result
    """
    # Custom logic
    pass

如果自定义处理函数导入失败，系统将自动回退到默认计时器模式。

↑ Back to Top