vllm.entrypoints.utils
VLLM_SERVE_PARSER_EPILOG
module-attribute
¶
VLLM_SERVE_PARSER_EPILOG = "Tip: Use `vllm serve --help=<keyword>` to explore arguments from help.\n - To view a argument group: --help=ModelConfig\n - To view a single argument: --help=max-num-seqs\n - To search by keyword: --help=max\n - To list all groups: --help=listgroup"
_validate_truncation_size
¶
_validate_truncation_size(
max_model_len: int,
truncate_prompt_tokens: Optional[int],
tokenization_kwargs: Optional[dict[str, Any]] = None,
) -> Optional[int]
Source code in vllm/entrypoints/utils.py
cli_env_setup
¶
Source code in vllm/entrypoints/utils.py
decrement_server_load
¶
listen_for_disconnect
async
¶
Returns if a disconnect message is received
load_aware_call
¶
Source code in vllm/entrypoints/utils.py
show_filtered_argument_or_group_from_help
¶
Source code in vllm/entrypoints/utils.py
with_cancellation
¶
Decorator that allows a route handler to be cancelled by client disconnections.
This does not use request.is_disconnected, which does not work with middleware. Instead this follows the pattern from starlette.StreamingResponse, which simultaneously awaits on two tasks- one to wait for an http disconnect message, and the other to do the work that we want done. When the first task finishes, the other is cancelled.
A core assumption of this method is that the body of the request has already been read. This is a safe assumption to make for fastapi handlers that have already parsed the body of the request into a pydantic model for us. This decorator is unsafe to use elsewhere, as it will consume and throw away all incoming messages for the request while it looks for a disconnect message.
In the case where a StreamingResponse is returned by the handler, this
wrapper will stop listening for disconnects and instead the response object
will start listening for disconnects.