服务性能分析指南#
In an inference service process, it is sometimes necessary to monitor the internal execution flow of the inference service framework to identify performance issues. By collecting start and end timestamps of key processes, identifying key functions or iterations, recording critical events, and gathering various types of information, performance bottlenecks can be quickly located.
This guide will walk you through the process of collecting performance data from the vLLM-Ascend service framework and operators. It covers the complete workflow from preparation, collection, analysis, to visualization, helping you quickly get started with performance collection tools.
Two performance collection solutions are provided below: Ascend PyTorch Profiler and MS Service Profiler. You can choose the appropriate tool for performance analysis and troubleshooting based on your actual requirements.
Solution Comparison#
Feature |
Ascend PyTorch Profiler |
MS Service Profiler |
|---|---|---|
Installation Method |
Built-in, no additional installation required |
Requires pip installation of msserviceprofiler |
Collection Granularity |
PyTorch operator level |
Service framework function level |
Control Method |
API request control |
Configuration file control |
Applicable Scenarios |
Model operator performance analysis |
Service framework workflow analysis |
Data Format |
ascend_pt format |
Chrome Tracing + CSV |
Main Advantage |
Operator-level performance analysis |
Service framework workflow visualization |
Quick Selection Guide#
Ascend PyTorch Profiler#
0. Installation and Configuration#
No additional packages need to be installed; it can be enabled through command-line configuration. Currently, vLLM enables python stack by default, which can significantly inflate the collected performance data. If you do not wish to collect python stack, you can disable it using torch_profiler_with_stack=false.
1. Preparation for Collection#
Start the online service and set the --profiler-config parameter to control the path for saving performance files. After the parameter is set, the collection function is enabled.
VLLM_PROMPT_SEQ_BUCKET_MAX=128
VLLM_PROMPT_SEQ_BUCKET_MIN=128
python3 -m vllm.entrypoints.openai.api_server \
--port 8080 \
--model "facebook/opt-125m" \
--tensor-parallel-size 1 \
--max-num-seqs 128 \
--profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile", "torch_profiler_with_stack": false}' \
--dtype bfloat16 \
--max-model-len 256
Note:January 19, 2026: The vLLM mainline has deprecated the VLLM_TORCH_PROFILER_DIR environment variable.Related PR When using the vLLM Ascend mainline code to collect profiler data, remember to use the
--profiler-config(online) parameter or theprofiler_config(offline) parameter.
2. Start Collection#
Performance collection is controlled by sending API requests. You can start collection after stabilizing the actual business data and collect profiling for a few seconds before stopping; or you can start collection first, then send business requests, and finally stop.
Send the following request to start the profiling service:
curl -X POST http://localhost:8080/start_profile
Send the following request to stop the profiling service:
curl -X POST http://localhost:8080/stop_profile
3. Send Requests#
Send requests according to your actual business data. After sending the requests, stop the profiling service, and the data will be automatically saved to the previously configured path:
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
curl -X POST http://localhost:8080/stop_profile
4. Analyze Data#
Navigate to the ./vllm_profile directory and locate the generated *ascend_pt folder. This folder needs to be analyzed before profiling data can be examined.
from torch_npu.profiler.profiler import analyse
analyse("./vllm_profile/localhost.localdomain_XXXXXXXXXX_ascend_pt/")
5. View Results#
After analysis, the *ascend_pt directory will contain many files, with the main analysis focus being the ASCEND_PROFILER_OUTPUT folder. This directory will include the following files:
analysis.db: Performance data in database formatapi_statistic.csv: API call statisticsascend_pytorch_profiler_0.db: Performance data in database formatkernel_details.csv: Kernel-level related dataoperator_details.csv: Operator-level related dataop_statistic.csv: Operator utilization datastep_trace_time.csv: Scheduling datatrace_view.json: Chrome tracing format data, can be opened with MindStudio Insight
MS Service Profiler#
0. Installation#
使用 pip 安装 msserviceprofiler 包:
pip install msserviceprofiler==1.2.2
1. Preparation#
在启动服务前,设置环境变量 SERVICE_PROF_CONFIG_PATH 指向性能分析配置文件,并设置环境变量 PROFILING_SYMBOLS_PATH 来指定需要导入的符号 YAML 配置文件。之后,根据您的部署方式启动 vLLM 服务。
cd ${path_to_store_profiling_files}
# Set environment variable
export SERVICE_PROF_CONFIG_PATH=ms_service_profiler_config.json
export PROFILING_SYMBOLS_PATH=service_profiling_symbols.yaml
# Start vLLM service
vllm serve Qwen/Qwen2.5-0.5B-Instruct &
ms_service_profiler_config.json 文件是性能分析配置文件。如果指定路径下不存在该文件,系统将自动生成一份默认配置。如有需要,您可以根据下文性能分析配置文件章节的说明提前进行自定义配置。
service_profiling_symbols.yaml 是需要导入的性能分析点配置文件。您也可以选择不设置环境变量 PROFILING_SYMBOLS_PATH,此时将使用默认配置文件;如果您指定的路径下不存在该文件,系统同样会在该路径生成一份配置文件以便后续修改。您可以参考下文符号配置文件章节进行自定义配置。
2. Enable Profiling#
要启用性能数据采集开关,请将配置文件 ms_service_profiler_config.json 中的 enable 字段从 0 修改为 1。可以通过执行以下 sed 命令实现:
sed -i 's/"enable":\s*0/"enable": 1/' ./ms_service_profiler_config.json
3. Send Requests#
根据实际性能分析需求选择合适的请求发送方式:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"prompt": "Beijing is a",
"max_tokens": 5,
"temperature": 0
}' | python3 -m json.tool
4. Analyze Data#
# xxxx-xxxx is the directory automatically created based on vLLM startup time
cd /root/.ms_server_profiler/xxxx-xxxx
# Analyze data
msserviceprofiler analyze --input-path=./ --output-path output
5. View Results#
分析完成后,output 目录下将包含以下文件:
chrome_tracing.json:Chrome 追踪格式数据,可在 MindStudio Insight 中打开。profiler.db:数据库格式的性能数据。request.csv:请求相关数据。request_summary.csv:请求总体指标统计。kvcache.csv:KV Cache 相关数据。batch.csv:批次调度相关数据。batch_summary.csv:批次调度总体指标统计。service_summary.csv:服务层面总体指标统计。