Profiling#
In vime, you can profile the rollout (vLLM inference) path in detail using vLLM’s profiling HTTP API. Profiling targets the vLLM engine side, not the Megatron training side.
Typical flow:
Start train (
sleep_rollout+vllm-profiler-config)Wait until vLLM engines and the router are ready
Read router/worker addresses from logs
start_profileSend a few inference requests
(Optional)
stop_profile; or traces flush automatically whenmax_iterationsis reachedInspect trace files under
torch_profiler_dir
1. Put Rollout into a Wait State (sleep_rollout)#
For flexible stress testing and profiling, rollout usually waits after initialization instead of generating immediately.
Replace rollout_function_path in train.py startup args—no code changes required:
python train.py \
--rollout-function-path vime.rollout.sleep_rollout.sleep \
... (other arguments)
This puts the rollout process in an infinite wait loop so you can send HTTP requests or run stress tools manually.
2. Enable the vLLM Profiler (at train startup)#
vLLM registers /start_profile and /stop_profile only when started with --profiler-config. In vime, pass --vllm-profiler-config through to the vllm serve subprocess.
2.1 Pass the full config as JSON#
--vllm-profiler-config '{"profiler":"torch","torch_profiler_dir":"/root/logs/vllm_profile","max_iterations":3,"ignore_frontend":true}'
Common JSON fields:
Field |
Description |
|---|---|
|
|
|
Trace output directory (absolute path) |
|
Worker auto-stops and flushes after more than N steps (condition is |
|
Recommended |
Avoid RPC timeout on stop_profile: vLLM APIServer talks to EngineCore/workers over internal RPC. Manually calling stop_profile to flush traces can take minutes, while the default VLLM_RPC_TIMEOUT is only 10 seconds (10000 ms), which can interrupt flush or leave traces incomplete. For profiling, set 30 minutes (1800000 ms).
Set this variable before starting train and launching vLLM, in the Ray worker environment (a local shell export may not reach the Ray job). Pass it via runtime-env-json on ray job submit, for example:
export VLLM_RPC_TIMEOUT="${VLLM_RPC_TIMEOUT:-1800000}"
RUNTIME_ENV_JSON="{
\"env_vars\": {
\"PYTHONPATH\": \"/root/Megatron-LM\",
\"CUDA_DEVICE_MAX_CONNECTIONS\": \"1\",
\"VLLM_RPC_TIMEOUT\": \"${VLLM_RPC_TIMEOUT}\"
}
}"
ray job submit --address=\"http://127.0.0.1:8265\" \
--runtime-env-json=\"${RUNTIME_ENV_JSON}\" \
-- python3 train.py \
... \
--vllm-profiler-config '{\"profiler\":\"torch\",\"torch_profiler_dir\":\"/root/logs/vllm_profile\",...}'
2.2 Verify it took effect#
After train starts, confirm all three in logs (missing any means the profiler is not enabled correctly):
Args parsed:
vllm_profiler_config ... profiler='torch'(andtorch_profiler_dirpath).Forwarded to vLLM subprocess:
Launching vLLM server: ... --profiler-config {"profiler":"torch",...}.HTTP routes registered: vLLM startup route list includes
/start_profileand/stop_profile(otherwisePOST /start_profilereturns 404).
3. Get Router and Worker Addresses#
vLLM engines (workers) register on the vllm-router. Example startup log:
Router launched at 127.0.0.1:3521, Prometheus port: 4153
Ports for engine 0: {'host': '127.0.0.1', 'port': 15000, ...}
Starting vLLM server on http://127.0.0.1:15000
Note: the router port may change on every job (random in 3000–4000 by default). Do not reuse the previous port. Verify with curl:
curl http://127.0.0.1:3521/workers
Returns each worker’s url and is_healthy.
4. Use tools/profile_rollout.py#
The script reads the router’s /workers list and calls /start_profile or /stop_profile on every worker.
Start Profiling#
cd /root/vime
python tools/profile_rollout.py \
--router-url http://127.0.0.1:3521 \
--action start
Stop Profiling (optional)#
If --vllm-profiler-config sets max_iterations, the worker auto-stops and flushes after enough steps. In practice, traces often appear under torch_profiler_dir right after inference—you do not need to call stop_profile manually. Use this only to end collection early:
python tools/profile_rollout.py \
--router-url http://127.0.0.1:3521 \
--action stop
5. Send Inference Requests#
While sleep_rollout is waiting:
profile_rollout.py --action startSend a few completion requests to the router or directly to a worker (2–4 is enough; traces get large)
(Optional)
profile_rollout.py --action stop; or wait formax_iterationsto auto-flushInspect traces under
torch_profiler_dir
Example request (model is the HF checkpoint path):
curl -X POST http://127.0.0.1:15000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"/root/models/Qwen3-4B","prompt":"Hello","max_tokens":32}'
6. View Traces#
Perfetto#
Open trace file, pick
*.trace.json.gzInspect GPU kernels, CPU ops, and the timeline
Chrome Tracing#
Open chrome://tracing in the browser and Load a trace file.
Analysis Tool#
cd /root/vime
python tools/analyze_profile.py --profile-dir /root/logs/vllm_profile --all-ranks
7. Troubleshooting#
Symptom |
Fix |
|---|---|
|
Pass |
Start OK but empty output dir |
Confirm curl hits a worker and returns 200; increase |
Router 503 |
Confirm the current job’s router port; connect directly to a worker |
Slow or timed-out stop |
Increase |
8. Full Runnable Example#
The script below assumes a container environment, vime at /root/vime, models and data under /root/models and /root/data. Two parts:
launch_train_for_profiling: start train with the profiler (sleep_rollout, minimal single-GPU colocate example—adjust GPU layout for your machine).run_profiling_session: run profiling from another terminal after train is ready.
Save as /root/vime/run_profiling_demo.sh and run:
#!/usr/bin/env bash
#
# Full vime rollout profiling example
# Usage:
# bash /root/vime/run_profiling_demo.sh launch # terminal 1: start train
# bash /root/vime/run_profiling_demo.sh profile # terminal 2: capture traces after train is ready
#
set -euo pipefail
VIME_ROOT="${VIME_ROOT:-/root/vime}"
HF_CKPT="${HF_CKPT:-/root/models/Qwen3-4B}"
REF_LOAD="${REF_LOAD:-/root/models/Qwen3-4B_torch_dist}"
PROMPT_DATA="${PROMPT_DATA:-/root/data/gsm8k/train.parquet}"
LOG_ROOT="${LOG_ROOT:-/root/logs/vime_profiling}"
PROFILE_DIR="${PROFILE_DIR:-/root/logs/vllm_profile}"
TRAIN_LOG="${LOG_ROOT}/train_profiling.log"
ROUTER_HOST="${ROUTER_HOST:-127.0.0.1}"
mkdir -p "${LOG_ROOT}" "${PROFILE_DIR}"
VLLM_PROFILER_CONFIG_JSON="$(printf \
'{"profiler":"torch","torch_profiler_dir":"%s","max_iterations":3,"ignore_frontend":true}' \
"${PROFILE_DIR}")"
launch_train_for_profiling() {
cd "${VIME_ROOT}"
# Clean up old Ray / vLLM processes (comment out if not needed)
ray stop --force || true
pkill -9 -f '[v]llm serve|VLL[M]::' || true
sleep 2
ray start --head --node-ip-address 127.0.0.1 --num-gpus 2 --disable-usage-stats
source "${VIME_ROOT}/scripts/models/qwen3-4B.sh"
export VLLM_RPC_TIMEOUT="${VLLM_RPC_TIMEOUT:-1800000}"
RUNTIME_ENV_JSON="{
\"env_vars\": {
\"PYTHONPATH\": \"/root/Megatron-LM\",
\"CUDA_DEVICE_MAX_CONNECTIONS\": \"1\",
\"VLLM_RPC_TIMEOUT\": \"${VLLM_RPC_TIMEOUT}\"
}
}"
echo "=== Launching train; log: ${TRAIN_LOG} ==="
echo "=== After engines are up, run: bash $0 profile ==="
ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json="${RUNTIME_ENV_JSON}" \
-- python3 train.py \
--train-backend megatron \
--colocate \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 1 \
--rollout-num-gpus 1 \
--rollout-num-gpus-per-engine 1 \
--rollout-function-path vime.rollout.sleep_rollout.sleep \
--hf-checkpoint "${HF_CKPT}" \
--ref-load "${REF_LOAD}" \
--prompt-data "${PROMPT_DATA}" \
--input-key question \
--label-key label \
--apply-chat-template \
--rm-type deepscaler \
--num-rollout 1 \
--rollout-batch-size 4 \
--n-samples-per-prompt 1 \
--rollout-max-response-len 512 \
--global-batch-size 4 \
--vllm-gpu-memory-utilization 0.7 \
--vllm-profiler-config "${VLLM_PROFILER_CONFIG_JSON}" \
${MODEL_ARGS[@]} \
2>&1 | tee "${TRAIN_LOG}"
}
discover_router_url() {
local line port
line="$(grep -E 'Router launched at' "${TRAIN_LOG}" | tail -1 || true)"
if [[ -z "${line}" ]]; then
echo "ERROR: Router not found in ${TRAIN_LOG}. Is train still starting?" >&2
exit 1
fi
# Router launched at 127.0.0.1:3521, Prometheus port: ...
port="$(echo "${line}" | sed -n 's/.*Router launched at [^:]*:\([0-9]*\).*/\1/p')"
echo "http://${ROUTER_HOST}:${port}"
}
discover_worker_url() {
local router_url="$1"
python3 - <<'PY' "${router_url}"
import json, sys, urllib.request
router = sys.argv[1]
with urllib.request.urlopen(f"{router}/workers", timeout=10) as r:
workers = json.load(r).get("workers", [])
if not workers:
raise SystemExit("No workers registered")
print(workers[0]["url"])
PY
}
run_profiling_session() {
cd "${VIME_ROOT}"
local router_url worker_url model="${HF_CKPT}"
router_url="$(discover_router_url)"
worker_url="$(discover_worker_url "${router_url}")"
echo "=== ROUTER=${router_url} WORKER=${worker_url} PROFILE_DIR=${PROFILE_DIR} ==="
echo "=== 1/3 start_profile (all workers via router) ==="
python tools/profile_rollout.py --router-url "${router_url}" --action start
echo "=== 2/3 send completions (direct to worker; 3 requests) ==="
for i in 1 2 3; do
curl -sS -X POST "${worker_url}/v1/completions" \
-H "Content-Type: application/json" \
-d "{\"model\":\"${model}\",\"prompt\":\"Hello ${i}\",\"max_tokens\":32}" \
| head -c 400
echo
done
echo "=== 3/3 list trace files (max_iterations=3 auto-stop; add --action stop if needed) ==="
sleep 2
find "${PROFILE_DIR}" -type f \( -name '*.json*' -o -name 'profiler_out_*' \) | sort
echo "Open *.trace.json.gz in https://ui.perfetto.dev/ or run:"
echo " python tools/analyze_profile.py --profile-dir ${PROFILE_DIR} --all-ranks"
}
case "${1:-}" in
launch) launch_train_for_profiling ;;
profile) run_profiling_session ;;
*)
echo "Usage: $0 {launch|profile}" >&2
exit 1
;;
esac
Steps:
# Terminal 1: start train (wait until logs show Router launched at ...)
bash /root/vime/run_profiling_demo.sh launch
# Terminal 2: capture traces
bash /root/vime/run_profiling_demo.sh profile
Adjust paths at the top of the script (/root/models/..., /root/data/...) and GPU layout (actor-num-gpus-per-node, rollout-num-gpus, etc.) as needed.