Profiling vLLM#
Warning
Profiling is only intended for vLLM developers and maintainers to understand the proportion of time spent in different parts of the codebase. vLLM end-users should never turn on profiling as it will significantly slow down the inference.
Profile with PyTorch Profiler#
We support tracing vLLM workers using the torch.profiler
module. You can enable tracing by setting the VLLM_TORCH_PROFILER_DIR
environment variable to the directory where you want to save the traces: VLLM_TORCH_PROFILER_DIR=/mnt/traces/
The OpenAI server also needs to be started with the VLLM_TORCH_PROFILER_DIR
environment variable set.
When using benchmarks/benchmark_serving.py
, you can enable profiling by passing the --profile
flag.
Traces can be visualized using https://ui.perfetto.dev/.
Tip
Only send a few requests through vLLM when profiling, as the traces can get quite large. Also, no need to untar the traces, they can be viewed directly.
Tip
To stop the profiler - it flushes out all the profile trace files to the directory. This takes time, for example for about 100 requests worth of data for a llama 70b, it takes about 10 minutes to flush out on a H100.
Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the server. Say something like 30 minutes.
export VLLM_RPC_TIMEOUT=1800000
Example commands and usage#
Offline Inference#
Refer to examples/offline_inference/simple_profiling.py for an example.
OpenAI Server#
VLLM_TORCH_PROFILER_DIR=./vllm_profile python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B
benchmark_serving.py:
python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-70B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 2
Profile with NVIDIA Nsight Systems#
Nsight systems is an advanced tool that exposes more profiling details, such as register and shared memory usage, annotated code regions and low-level CUDA APIs and events.
Install nsight-systems using your package manager. The following block is an example for Ubuntu.
apt update
apt install -y --no-install-recommends gnupg
echo "deb http://developer.download.nvidia.com/devtools/repos/ubuntu$(source /etc/lsb-release; echo "$DISTRIB_RELEASE" | tr -d .)/$(dpkg --print-architecture) /" | tee /etc/apt/sources.list.d/nvidia-devtools.list
apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
apt update
apt install nsight-systems-cli
Example commands and usage#
Offline Inference#
For basic usage, you can just append nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node
before any existing script you would run for offline inference.
The following is an example using the benchmarks/benchmark_latency.py
script:
nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node python benchmarks/benchmark_latency.py --model meta-llama/Llama-3.1-8B-Instruct --num-iters-warmup 5 --num-iters 1 --batch-size 16 --input-len 512 --output-len 8
OpenAI Server#
To profile the server, you will want to prepend your vllm serve
command with nsys profile
just like for offline inference, however you must specify --delay XX --duration YY
parameters according to the needs of your benchmark. After the duration time has been used up, the server will be killed.
# server
nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node --delay 30 --duration 60 vllm serve meta-llama/Llama-3.1-8B-Instruct
# client
python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 1 --dataset-name random --random-input 1024 --random-output 512
In practice, you should set the --duration
argument to a large value. Whenever you want the server to stop profiling, run:
nsys sessions list
to get the session id in the form of profile-XXXXX
, then run:
nsys stop --session=profile-XXXXX
to manually kill the profiler and generate your nsys-rep
report.
Analysis#
You can view these profiles either as summaries in the CLI, using nsys stats [profile-file]
, or in the GUI by installing Nsight locally following the directions here.
CLI example:
nsys stats report1.nsys-rep
...
** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ----------- ----------- -------- --------- ----------- ----------------------------------------------------------------------------------------------------
46.3 10,327,352,338 17,505 589,965.9 144,383.0 27,040 3,126,460 944,263.8 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_of…
14.8 3,305,114,764 5,152 641,520.7 293,408.0 287,296 2,822,716 867,124.9 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize256x128x64_warpgroupsize2x1x1_execute_segment_k_of…
12.1 2,692,284,876 14,280 188,535.4 83,904.0 19,328 2,862,237 497,999.9 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x128x64_warpgroupsize1x1x1_execute_segment_k_off…
9.5 2,116,600,578 33,920 62,399.8 21,504.0 15,326 2,532,285 290,954.1 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_…
5.0 1,119,749,165 18,912 59,208.4 9,056.0 6,784 2,578,366 271,581.7 void vllm::act_and_mul_kernel<c10::BFloat16, &vllm::silu_kernel<c10::BFloat16>, (bool)1>(T1 *, cons…
4.1 916,662,515 21,312 43,011.6 19,776.0 8,928 2,586,205 199,790.1 void cutlass::device_kernel<flash::enable_sm90_or_later<flash::FlashAttnFwdSm90<flash::CollectiveMa…
2.6 587,283,113 37,824 15,526.7 3,008.0 2,719 2,517,756 139,091.1 std::enable_if<T2>(int)0&&vllm::_typeConvert<T1>::exists, void>::type vllm::fused_add_rms_norm_kern…
1.9 418,362,605 18,912 22,121.5 3,871.0 3,328 2,523,870 175,248.2 void vllm::rotary_embedding_kernel<c10::BFloat16, (bool)1>(const long *, T1 *, T1 *, const T1 *, in…
0.7 167,083,069 18,880 8,849.7 2,240.0 1,471 2,499,996 101,436.1 void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0…
...
GUI example:
Profiling vLLM Python Code#
The Python standard library includes
cProfile for profiling Python
code. vLLM includes a couple of helpers that make it easy to apply it to a section of vLLM.
Both the vllm.utils.cprofile
and vllm.utils.cprofile_context
functions can be
used to profile a section of code.
Example usage - decorator#
The first helper is a Python decorator that can be used to profile a function. If a filename is specified, the profile will be saved to that file. If no filename is specified, profile data will be printed to stdout.
import vllm.utils
@vllm.utils.cprofile("expensive_function.prof")
def expensive_function():
# some expensive code
pass
Example Usage - context manager#
The second helper is a context manager that can be used to profile a block of code. Similar to the decorator, the filename is optional.
import vllm.utils
def another_function():
# more expensive code
pass
with vllm.utils.cprofile_context("another_function.prof"):
another_function()
Analyzing Profile Results#
There are multiple tools available that can help analyze the profile results. One example is snakeviz.
pip install snakeviz
snakeviz expensive_function.prof