性能分析(Profiling)#
在vime中,我们可以通过vLLM提供的profiling接口对**rollout(vLLM推理)**过程做详细的性能分析。Profiling针对vLLM engine侧,不是Megatron训练侧。
典型流程:
启动train(sleep_rollout + vllm-profiler-config)
等待vLLM engine与router就绪
从日志确认router/worker地址
start_profile
发送少量推理请求 -(可选)stop_profile;或达到max_iterations后自动落盘
在torch_profiler_dir查看trace文件
1. 使Rollout进入等待状态(sleep_rollout)#
为了更灵活地压测和profiling,通常让rollout在初始化完成后进入等待,而不是立即开始生成。
在 train.py 启动参数中替换 rollout_function_path 即可,无需改代码:
python train.py \
--rollout-function-path vime.rollout.sleep_rollout.sleep \
... (其他参数)
该函数会让rollout进程进入无限循环等待,便于手动发HTTP请求或运行压测工具。
2. 启用vLLM Profiler(启动train时配置)#
vLLM只有在启动时配置了--profiler-config,才会注册/start_profile与/stop_profile路由。在vime中通过**--vllm-profiler-config**转发给vllm serve子进程。
2.1 使用JSON整包传参#
--vllm-profiler-config '{"profiler":"torch","torch_profiler_dir":"/root/logs/vllm_profile","max_iterations":3,"ignore_frontend":true}'
常用JSON字段:
字段 |
说明 |
|---|---|
|
|
|
trace输出目录(绝对路径) |
|
worker记录超过N步后自动stop并落盘(条件为 |
|
建议 |
防止stop_profile时RPC超时: vLLM APIServer与EngineCore/worker之间通过内部RPC通信。手动调用stop_profile触发trace落盘可能耗时数分钟,而默认VLLM_RPC_TIMEOUT仅10秒(10000 ms),容易导致flush中断或trace不完整。Profiling时建议设为30分钟(1800000 ms)。
该变量须在启动train、拉起vLLM之前传入Ray worker环境(仅在本机shell export不一定会进入Ray job)。在ray job submit的runtime-env-json中写入,例如:
export VLLM_RPC_TIMEOUT="${VLLM_RPC_TIMEOUT:-1800000}"
RUNTIME_ENV_JSON="{
\"env_vars\": {
\"PYTHONPATH\": \"/root/Megatron-LM\",
\"CUDA_DEVICE_MAX_CONNECTIONS\": \"1\",
\"VLLM_RPC_TIMEOUT\": \"${VLLM_RPC_TIMEOUT}\"
}
}"
ray job submit --address=\"http://127.0.0.1:8265\" \
--runtime-env-json=\"${RUNTIME_ENV_JSON}\" \
-- python3 train.py \
... \
--vllm-profiler-config '{\"profiler\":\"torch\",\"torch_profiler_dir\":\"/root/logs/vllm_profile\",...}'
2.2 验证是否生效#
启动train后,在日志中确认以下三点(缺任一项说明profiler未正确启用):
参数已解析:出现
vllm_profiler_config ... profiler='torch'(及torch_profiler_dir路径)。已转发给vLLM子进程:出现
Launching vLLM server: ... --profiler-config {"profiler":"torch",...}。HTTP路由已注册:vLLM启动时的路由列表中包含
/start_profile与/stop_profile(否则POST /start_profile会返回404)。
3. 获取Router与Worker地址#
vLLM engine(workers)注册在vllm-router上。启动日志示例:
Router launched at 127.0.0.1:3521, Prometheus port: 4153
Ports for engine 0: {'host': '127.0.0.1', 'port': 15000, ...}
Starting vLLM server on http://127.0.0.1:15000
注意:router端口每次job可能变化(默认在3000–4000随机),不要沿用上次端口。可用curl验证:
curl http://127.0.0.1:3521/workers
返回每个worker的url与is_healthy。
4. 使用tools/profile_rollout.py#
脚本通过router的/workers列表,对所有worker调用/start_profile或/stop_profile。
启动Profiling#
cd /root/vime
python tools/profile_rollout.py \
--router-url http://127.0.0.1:3521 \
--action start
停止Profiling(可选)#
若在--vllm-profiler-config中设置了max_iterations,worker在记录足够步数后会自动stop并落盘,实践中发完推理后常可直接在torch_profiler_dir看到trace,不必再手动stop_profile。需要提前结束采集时再执行:
python tools/profile_rollout.py \
--router-url http://127.0.0.1:3521 \
--action stop
5. 发送推理请求#
在sleep_rollout等待期间,执行步骤如下:
profile_rollout.py --action start向router或直连worker发送少量completion请求(2~4条即可,trace会很大)
(可选)
profile_rollout.py --action stop;或等待max_iterations触发自动落盘在
torch_profiler_dir查看trace
请求示例(model使用HF checkpoint路径):
curl -X POST http://127.0.0.1:15000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"/root/models/Qwen3-4B","prompt":"Hello","max_tokens":32}'
6. 查看Trace#
Perfetto#
Open trace file,选择
*.trace.json.gz查看GPU kernel、CPU算子与时间线
Chrome Tracing#
浏览器访问chrome://tracing,Load加载trace文件。
分析工具#
cd /root/vime
python tools/analyze_profile.py --profile-dir /root/logs/vllm_profile --all-ranks
7. 常见问题#
现象 |
处理 |
|---|---|
|
用JSON传 |
start成功但目录为空 |
确认curl打到worker且返回200;适当增大 |
router 503 |
确认当前job的router端口;改直连worker |
stop很慢或超时 |
增大 |
8. 完整可运行示例#
以下脚本假设在容器内、vime仓库位于/root/vime,模型与数据在/root/models、/root/data。分两段:
launch_train_for_profiling:启动带profiler的train(sleep_rollout,单卡colocate最小示例,可按机器改GPU数)。run_profiling_session:train就绪后,在另一个终端执行profiling。
将脚本保存为 /root/vime/run_profiling_demo.sh 后执行。
#!/usr/bin/env bash
#
# vime rollout profiling 完整示例
# 用法:
# bash /root/vime/run_profiling_demo.sh launch # 终端1:启动train
# bash /root/vime/run_profiling_demo.sh profile # 终端2:train就绪后抓trace
#
set -euo pipefail
VIME_ROOT="${VIME_ROOT:-/root/vime}"
HF_CKPT="${HF_CKPT:-/root/models/Qwen3-4B}"
REF_LOAD="${REF_LOAD:-/root/models/Qwen3-4B_torch_dist}"
PROMPT_DATA="${PROMPT_DATA:-/root/data/gsm8k/train.parquet}"
LOG_ROOT="${LOG_ROOT:-/root/logs/vime_profiling}"
PROFILE_DIR="${PROFILE_DIR:-/root/logs/vllm_profile}"
TRAIN_LOG="${LOG_ROOT}/train_profiling.log"
ROUTER_HOST="${ROUTER_HOST:-127.0.0.1}"
mkdir -p "${LOG_ROOT}" "${PROFILE_DIR}"
VLLM_PROFILER_CONFIG_JSON="$(printf \
'{"profiler":"torch","torch_profiler_dir":"%s","max_iterations":3,"ignore_frontend":true}' \
"${PROFILE_DIR}")"
launch_train_for_profiling() {
cd "${VIME_ROOT}"
# 清理旧 Ray / vLLM 进程(按需注释)
ray stop --force || true
pkill -9 -f '[v]llm serve|VLL[M]::' || true
sleep 2
ray start --head --node-ip-address 127.0.0.1 --num-gpus 2 --disable-usage-stats
source "${VIME_ROOT}/scripts/models/qwen3-4B.sh"
export VLLM_RPC_TIMEOUT="${VLLM_RPC_TIMEOUT:-1800000}"
RUNTIME_ENV_JSON="{
\"env_vars\": {
\"PYTHONPATH\": \"/root/Megatron-LM\",
\"CUDA_DEVICE_MAX_CONNECTIONS\": \"1\",
\"VLLM_RPC_TIMEOUT\": \"${VLLM_RPC_TIMEOUT}\"
}
}"
echo "=== Launching train; log: ${TRAIN_LOG} ==="
echo "=== After engines are up, run: bash $0 profile ==="
ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json="${RUNTIME_ENV_JSON}" \
-- python3 train.py \
--train-backend megatron \
--colocate \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 1 \
--rollout-num-gpus 1 \
--rollout-num-gpus-per-engine 1 \
--rollout-function-path vime.rollout.sleep_rollout.sleep \
--hf-checkpoint "${HF_CKPT}" \
--ref-load "${REF_LOAD}" \
--prompt-data "${PROMPT_DATA}" \
--input-key question \
--label-key label \
--apply-chat-template \
--rm-type deepscaler \
--num-rollout 1 \
--rollout-batch-size 4 \
--n-samples-per-prompt 1 \
--rollout-max-response-len 512 \
--global-batch-size 4 \
--vllm-gpu-memory-utilization 0.7 \
--vllm-profiler-config "${VLLM_PROFILER_CONFIG_JSON}" \
${MODEL_ARGS[@]} \
2>&1 | tee "${TRAIN_LOG}"
}
discover_router_url() {
local line port
line="$(grep -E 'Router launched at' "${TRAIN_LOG}" | tail -1 || true)"
if [[ -z "${line}" ]]; then
echo "ERROR: Router not found in ${TRAIN_LOG}. Is train still starting?" >&2
exit 1
fi
# Router launched at 127.0.0.1:3521, Prometheus port: ...
port="$(echo "${line}" | sed -n 's/.*Router launched at [^:]*:\([0-9]*\).*/\1/p')"
echo "http://${ROUTER_HOST}:${port}"
}
discover_worker_url() {
local router_url="$1"
python3 - <<'PY' "${router_url}"
import json, sys, urllib.request
router = sys.argv[1]
with urllib.request.urlopen(f"{router}/workers", timeout=10) as r:
workers = json.load(r).get("workers", [])
if not workers:
raise SystemExit("No workers registered")
print(workers[0]["url"])
PY
}
run_profiling_session() {
cd "${VIME_ROOT}"
local router_url worker_url model="${HF_CKPT}"
router_url="$(discover_router_url)"
worker_url="$(discover_worker_url "${router_url}")"
echo "=== ROUTER=${router_url} WORKER=${worker_url} PROFILE_DIR=${PROFILE_DIR} ==="
echo "=== 1/3 start_profile (all workers via router) ==="
python tools/profile_rollout.py --router-url "${router_url}" --action start
echo "=== 2/3 send completions (direct to worker; 3 requests) ==="
for i in 1 2 3; do
curl -sS -X POST "${worker_url}/v1/completions" \
-H "Content-Type: application/json" \
-d "{\"model\":\"${model}\",\"prompt\":\"Hello ${i}\",\"max_tokens\":32}" \
| head -c 400
echo
done
echo "=== 3/3 list trace files (max_iterations=3 auto-stop; add --action stop if needed) ==="
sleep 2
find "${PROFILE_DIR}" -type f \( -name '*.json*' -o -name 'profiler_out_*' \) | sort
echo "Open *.trace.json.gz in https://ui.perfetto.dev/ or run:"
echo " python tools/analyze_profile.py --profile-dir ${PROFILE_DIR} --all-ranks"
}
case "${1:-}" in
launch) launch_train_for_profiling ;;
profile) run_profiling_session ;;
*)
echo "Usage: $0 {launch|profile}" >&2
exit 1
;;
esac
操作步骤:
# 终端1:启动train(等待vLLM与router就绪,日志出现Router launched at ...)
bash /root/vime/run_profiling_demo.sh launch
# 终端2:抓trace
bash /root/vime/run_profiling_demo.sh profile
按需修改脚本顶部的/root/models/...、/root/data/...与GPU布局(actor-num-gpus-per-node、rollout-num-gpus等)。