Profiling vLLM¶

Warning

Profiling is only intended for vLLM developers and maintainers to understand the proportion of time spent in different parts of the codebase. vLLM end-users should never turn on profiling as it will significantly slow down the inference.

Choosing a profiler

Use Nsight Systems for low-overhead, performance-critical profiling.
Use PyTorch Profiler for medium-overhead profiling with richer debugging information (e.g., stack traces, memory, shapes). Note that enabling these features adds overhead and is not recommended for benchmarking.

Profile with PyTorch Profiler¶

We support tracing vLLM workers using different profilers. You can enable profiling by setting the --profiler-config flag when launching the server.

Note

The --profiler-config flag is available in vLLM v0.13.0 and later. If you are using an earlier version, please upgrade to use this feature.

To use the torch.profiler module, set the profiler entry to 'torch' and torch_profiler_dir to the directory where you want to save the traces. Additionally, you can control the profiling content by specifying the following additional arguments in the config:

torch_profiler_record_shapes to enable recording Tensor Shapes, off by default
torch_profiler_with_memory to record memory, off by default
torch_profiler_with_stack to enable recording stack information, on by default
torch_profiler_with_flops to enable recording FLOPs, off by default
torch_profiler_use_gzip to control gzip-compressing profiling files, on by default
torch_profiler_dump_cuda_time_total to control dumping and printing the aggregated CUDA self time table, on by default

When using vllm bench serve, you can enable profiling by passing the --profile flag.

Traces can be visualized using https://ui.perfetto.dev/.

Tip

You can directly call bench module without installing vLLM using python -m vllm.entrypoints.cli.main bench.

Tip

Only send a few requests through vLLM when profiling, as the traces can get quite large. Also, no need to untar the traces, they can be viewed directly.

Tip

To stop the profiler - it flushes out all the profile trace files to the directory. This takes time, for example for about 100 requests worth of data for a llama 70b, it takes about 10 minutes to flush out on a H100. The engine client waits for this flush to complete without timing out, so simply allow the stop call to run to completion.

Example commands and usage¶

Offline Inference¶

Refer to examples/features/profiling/simple_profiling_offline.py for an example.

OpenAI Server¶

vllm serve meta-llama/Llama-3.1-8B-Instruct --profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile"}'

vllm bench command:

vllm bench serve \
    --backend vllm \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name sharegpt \
    --dataset-path sharegpt.json \
    --profile \
    --num-prompts 2

Or use http request:

# We need first call /start_profile api to start profile.
$ curl -X POST http://localhost:8000/start_profile

# Call model generate.
curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
                "model": "meta-llama/Llama-3.1-8B-Instruct",
                "messages": [
                        {
                                "role": "user",
                                "content": "San Francisco is a"
                        }
                ]
    }'

# After need call /stop_profile api to stop profile.
$ curl -X POST http://localhost:8000/stop_profile

Profile with NVIDIA Nsight Systems¶

Nsight systems is an advanced tool that exposes more profiling details, such as register and shared memory usage, annotated code regions and low-level CUDA APIs and events.

Install nsight-systems using your package manager. The following block is an example for Ubuntu.

apt update
apt install -y --no-install-recommends gnupg
echo "deb http://developer.download.nvidia.com/devtools/repos/ubuntu$(source /etc/lsb-release; echo "$DISTRIB_RELEASE" | tr -d .)/$(dpkg --print-architecture) /" | tee /etc/apt/sources.list.d/nvidia-devtools.list
apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
apt update
apt install nsight-systems-cli

Tip

When profiling with nsys, it is advisable to set the environment variable VLLM_WORKER_MULTIPROC_METHOD=spawn. The default is to use the fork method instead of spawn. More information on the topic can be found in the Nsight Systems release notes.

The Nsight Systems profiler can be launched with nsys profile ..., with a few recommended flags for vLLM: --trace-fork-before-exec=true --cuda-graph-trace=node.

Example commands and usage¶

Offline Inference¶

For basic usage, you can just append the profiling command before any existing script you would run for offline inference.

The following is an example using the vllm bench latency script:

nsys profile  \
    --trace-fork-before-exec=true \
    --cuda-graph-trace=node \
vllm bench latency \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --num-iters-warmup 5 \
    --num-iters 1 \
    --batch-size 16 \
    --input-len 512 \
    --output-len 8

OpenAI Server¶

To profile the server, you will want to prepend your vllm serve command with nsys profile just like for offline inference, but you will need to specify a few other arguments to enable dynamic capture similarly to the Torch Profiler:

# server
nsys profile \
    --trace-fork-before-exec=true \
    --cuda-graph-trace=node \
    --capture-range=cudaProfilerApi \
    --capture-range-end repeat \
    vllm serve meta-llama/Llama-3.1-8B-Instruct --profiler-config.profiler cuda

# client
vllm bench serve \
    --backend vllm \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name sharegpt \
    --dataset-path sharegpt.json \
    --profile \
    --num-prompts 2

With --profile, vLLM will capture a profile for each run of vllm bench serve. Once the server is killed, the profiles will all be saved.

Analysis¶

You can view these profiles either as summaries in the CLI, using nsys stats [profile-file], or in the GUI by installing Nsight locally following the directions here.

CLI example

nsys stats report1.nsys-rep
...
** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):

Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)  Max (ns)   StdDev (ns)                                                  Name
--------  ---------------  ---------  -----------  -----------  --------  ---------  -----------  ----------------------------------------------------------------------------------------------------
    46.3   10,327,352,338     17,505    589,965.9    144,383.0    27,040  3,126,460    944,263.8  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_of…
    14.8    3,305,114,764      5,152    641,520.7    293,408.0   287,296  2,822,716    867,124.9  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize256x128x64_warpgroupsize2x1x1_execute_segment_k_of…
    12.1    2,692,284,876     14,280    188,535.4     83,904.0    19,328  2,862,237    497,999.9  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x128x64_warpgroupsize1x1x1_execute_segment_k_off…
    9.5    2,116,600,578     33,920     62,399.8     21,504.0    15,326  2,532,285    290,954.1  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_…
    5.0    1,119,749,165     18,912     59,208.4      9,056.0     6,784  2,578,366    271,581.7  void vllm::act_and_mul_kernel<c10::BFloat16, &vllm::silu_kernel<c10::BFloat16>, (bool)1>(T1 *, cons…
    4.1      916,662,515     21,312     43,011.6     19,776.0     8,928  2,586,205    199,790.1  void cutlass::device_kernel<flash::enable_sm90_or_later<flash::FlashAttnFwdSm90<flash::CollectiveMa…
    2.6      587,283,113     37,824     15,526.7      3,008.0     2,719  2,517,756    139,091.1  std::enable_if<T2>(int)0&&vllm::_typeConvert<T1>::exists, void>::type vllm::fused_add_rms_norm_kern…
    1.9      418,362,605     18,912     22,121.5      3,871.0     3,328  2,523,870    175,248.2  void vllm::rotary_embedding_kernel<c10::BFloat16, (bool)1>(const long *, T1 *, T1 *, const T1 *, in…
    0.7      167,083,069     18,880      8,849.7      2,240.0     1,471  2,499,996    101,436.1  void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0…
...

GUI example:

Continuous Profiling¶

There is a GitHub CI workflow in the PyTorch infrastructure repository that provides continuous profiling for different models on vLLM. This automated profiling helps track performance characteristics over time and across different model configurations.

How It Works¶

The workflow currently runs weekly profiling sessions for selected models, generating detailed performance traces that can be analyzed using different tools to identify performance regressions or optimization opportunities. But, it can be triggered manually as well, using the Github Action tool.

Adding New Models¶

To extend the continuous profiling to additional models, you can modify the profiling-tests.json configuration file in the PyTorch integration testing repository. Simply add your model specifications to this file to include them in the automated profiling runs.

Viewing Profiling Results¶

The profiling traces generated by the continuous profiling workflow are publicly available on the vLLM Performance Dashboard. Look for the Profiling traces table to access and download the traces for different models and runs.

Profiling vLLM Python Code¶

The Python standard library includes cProfile for profiling Python code.

Example usage - function call¶

If a filename is specified, the profile will be saved to that file. If no filename is specified, profile data can be printed to stdout.

import cProfile


def expensive_function():
    # some expensive code
    pass


profiler = cProfile.Profile()
profiler.runcall(expensive_function)
profiler.dump_stats("expensive_function.prof")

Example usage - context manager style¶

import cProfile


def another_function():
    # more expensive code
    pass


profiler = cProfile.Profile()
profiler.enable()
try:
    another_function()
finally:
    profiler.disable()
    profiler.dump_stats("another_function.prof")

Analyzing Profile Results¶

There are multiple tools available that can help analyze the profile results. One example is snakeviz.

pip install snakeviz
snakeviz expensive_function.prof

Analyzing Garbage Collection Costs¶

Leverage VLLM_GC_DEBUG environment variable to debug GC costs.

VLLM_GC_DEBUG=1: enable GC debugger with gc.collect elapsed times
VLLM_GC_DEBUG='{"top_objects":5}': enable GC debugger to log top 5 collected objects for each gc.collect