AsyncLLMEngine#

class vllm.AsyncLLMEngine(worker_use_ray: bool, engine_use_ray: bool, *args, log_requests: bool = True, max_log_len: int | None = None, start_engine_loop: bool = True, **kwargs)[source]#

An asynchronous wrapper for LLMEngine.

This class is used to wrap the LLMEngine class to make it asynchronous. It uses asyncio to create a background loop that keeps processing incoming requests. The LLMEngine is kicked by the generate method when there are requests in the waiting queue. The generate method yields the outputs from the LLMEngine to the caller.

NOTE: For the comprehensive list of arguments, see LLMEngine.

Parameters:
  • worker_use_ray – Whether to use Ray for model workers. Required for distributed execution. Should be the same as parallel_config.worker_use_ray.

  • engine_use_ray – Whether to make LLMEngine a Ray actor. If so, the async frontend will be executed in a separate process as the model workers.

  • log_requests – Whether to log the requests.

  • max_log_len – Maximum number of prompt characters or prompt ID numbers being printed in log.

  • start_engine_loop – If True, the background task to run the engine will be automatically started in the generate call.

  • *args – Arguments for LLMEngine.

  • *kwargs – Arguments for LLMEngine.

async abort(request_id: str) None[source]#

Abort a request.

Abort a submitted request. If the request is finished or not found, this method will be a no-op.

Parameters:

request_id – The unique id of the request.

async check_health() None[source]#

Raises an error if engine is unhealthy.

async engine_step() bool[source]#

Kick the engine to process the waiting requests.

Returns True if there are in-progress requests.

classmethod from_engine_args(engine_args: AsyncEngineArgs, start_engine_loop: bool = True, usage_context: UsageContext = UsageContext.ENGINE_CONTEXT) AsyncLLMEngine[source]#

Creates an async LLM engine from the engine arguments.

async generate(prompt: str | None, sampling_params: SamplingParams, request_id: str, prompt_token_ids: List[int] | None = None, lora_request: LoRARequest | None = None, multi_modal_data: MultiModalData | None = None) AsyncIterator[RequestOutput][source]#

Generate outputs for a request.

Generate outputs for a request. This method is a coroutine. It adds the request into the waiting queue of the LLMEngine and streams the outputs from the LLMEngine to the caller.

Parameters:
  • prompt – The prompt string. Can be None if prompt_token_ids is provided.

  • sampling_params – The sampling parameters of the request.

  • request_id – The unique id of the request.

  • prompt_token_ids – The token IDs of the prompt. If None, we use the tokenizer to convert the prompts to token IDs.

  • lora_request – LoRA request to use for generation, if any.

  • multi_modal_data – Multi modal data per request.

Yields:

The output RequestOutput objects from the LLMEngine for the request.

Details:
  • If the engine is not running, start the background loop, which iteratively invokes engine_step() to process the waiting requests.

  • Add the request to the engine’s RequestTracker. On the next background loop, this request will be sent to the underlying engine. Also, a corresponding AsyncStream will be created.

  • Wait for the request outputs from AsyncStream and yield them.

Example

>>> # Please refer to entrypoints/api_server.py for
>>> # the complete example.
>>>
>>> # initialize the engine and the example input
>>> engine = AsyncLLMEngine.from_engine_args(engine_args)
>>> example_input = {
>>>     "prompt": "What is LLM?",
>>>     "stream": False, # assume the non-streaming case
>>>     "temperature": 0.0,
>>>     "request_id": 0,
>>> }
>>>
>>> # start the generation
>>> results_generator = engine.generate(
>>>    example_input["prompt"],
>>>    SamplingParams(temperature=example_input["temperature"]),
>>>    example_input["request_id"])
>>>
>>> # get the results
>>> final_output = None
>>> async for request_output in results_generator:
>>>     if await request.is_disconnected():
>>>         # Abort the request if the client disconnects.
>>>         await engine.abort(request_id)
>>>         # Return or raise an error
>>>         ...
>>>     final_output = request_output
>>>
>>> # Process and return the final output
>>> ...
async get_decoding_config() DecodingConfig[source]#

Get the decoding configuration of the vLLM engine.

async get_model_config() ModelConfig[source]#

Get the model configuration of the vLLM engine.

start_background_loop() None[source]#

Start the background loop.