Skip to content

Profiling

There are a few approaches available for measuring and analyzing vLLM performance, each offering different levels of detail and suited to specific use cases. This document outlines these approaches to help evaluate execution time, identify performance bottlenecks, and analyze both host and device behavior during inference. The following table lists the available methods of collecting performance traces. Each linked method is described in detail in a separate section.

Profiling method Category Detail level Use case
End-to-end profiling Comprehensive profiling High Capturing all profiling data across host, Python, and device.
High-level profiling High-level profiling Low Debugging prompt/decode structure, batch sizes, and scheduling patterns.
PyTorch profiling via asynchronous server Server-based profiling Medium Measuring latency, host gaps, and server response timing.
PyTorch profiling via script Script-based profiling Medium Profiling within test scripts.
Profiling specific prompt or decode execution Device-level profiling Medium/High Capturing a general execution flow without graph details (no shapes, ops). Optionally, analyzing fused ops, node names, graph structures, and timing.