Request Stats#

class vllm_router.stats.request_stats.RequestStats(qps: float, ttft: float, in_prefill_requests: int, in_decoding_requests: int, finished_requests: int, uptime: int, avg_decoding_length: float, avg_latency: float, avg_itl: float, num_swapped_requests: int)#
class vllm_router.stats.request_stats.RequestStatsMonitor(*args, **kwargs)#

Monitors the request statistics of all serving engines.

get_request_stats(current_time: float) Dict[str, RequestStats]#

Get the request statistics for each serving engine

Parameters:

current_time – The current timestamp in seconds

Returns:

A dictionary where the key is the serving engine URL and the value is the request statistics for that engine. The TTFT and inter token latency will be -1 if there is no requests finished in the sliding window.

on_new_request(engine_url: str, request_id: str, timestamp: float)#

Tell the monitor that a new request has been created.

Parameters:
  • engine_url – The URL of the serving engine

  • request_id – The global request ID

  • timestamp – the timestamp when the request was created

on_request_complete(engine_url: str, request_id: str, timestamp: float)#

Tell the monitor that a request has been completed.

Parameters:
  • engine_url – The URL of the serving engine

  • request_id – The global request ID

  • timestamp – The timestamp when the request was completed

on_request_response(engine_url: str, request_id: str, timestamp: float)#

Tell the monitor that a response token has been received for a request.

Parameters:
  • engine_url – The URL of the serving engine

  • request_id – The global request ID

  • timestamp – The timestamp when the response token was received

on_request_swapped(engine_url: str, request_id: str, timestamp: float)#

Tell the monitor that a request has been swapped from GPU to CPU.

Parameters:
  • engine_url – The URL of the serving engine

  • request_id – The global request ID

  • timestamp – The timestamp when the request was swapped