Request Stats

Request Stats#

class vllm_router.stats.request_stats.RequestStats(qps: float, ttft: float, in_prefill_requests: int, in_decoding_requests: int, finished_requests: int, uptime: int, avg_decoding_length: float, avg_latency: float, avg_itl: float, num_swapped_requests: int)#

class vllm_router.stats.request_stats.RequestStatsMonitor(*args, **kwargs)#

Monitors the request statistics of all serving engines.

get_request_stats(current_time: float) → Dict[str, RequestStats]#

Get the request statistics for each serving engine

Parameters:: current_time – The current timestamp in seconds
Returns:: A dictionary where the key is the serving engine URL and the value is the request statistics for that engine. The TTFT and inter token latency will be -1 if there is no requests finished in the sliding window.

on_new_request(engine_url: str, request_id: str, timestamp: float)#

Tell the monitor that a new request has been created.

Parameters:

on_request_complete(engine_url: str, request_id: str, timestamp: float)#

Tell the monitor that a request has been completed.

Parameters:

on_request_response(engine_url: str, request_id: str, timestamp: float)#

Tell the monitor that a response token has been received for a request.

Parameters:

on_request_swapped(engine_url: str, request_id: str, timestamp: float)#

Tell the monitor that a request has been swapped from GPU to CPU.

Parameters: