vllm_omni.engine.output_processor ¶
MultimodalOutputProcessor ¶
Bases: OutputProcessor
Handles multimodal output processing by capturing pooling_output from EngineCoreOutput and accumulating it as multimodal tensors, before delegating to the base vLLM OutputProcessor for text handling.
The actual data flow is: 1. For each EngineCoreOutput with pooling_output and a detokenizer: - Capture pooling_output into OmniRequestState.add_multimodal_tensor() - Clear eco.pooling_output to force text path in base processor 2. Base vLLM OutputProcessor handles text detokenization 3. On finish, _consolidate_multimodal_tensors() concatenates accumulated tensors 4. _new_completion_output() attaches mm_accumulated to CompletionOutput
add_request ¶
add_request(
request: EngineCoreRequest,
prompt: str | None,
parent_req: ParentRequest | None = None,
request_index: int = 0,
queue: RequestOutputCollector | None = None,
) -> None
Add a new request to be processed.
Creates an OmniRequestState for the request and registers it for output processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
request | EngineCoreRequest | Engine core request to add | required |
prompt | str | None | Optional prompt string for the request | required |
parent_req | ParentRequest | None | Optional parent request for parallel sampling | None |
request_index | int | Index of the request in the batch | 0 |
queue | RequestOutputCollector | None | Optional queue for collecting outputs | None |
Raises:
| Type | Description |
|---|---|
ValueError | If the request ID is already registered |
OmniRequestState ¶
Bases: RequestState
Request state for omni models, tracking multimodal outputs.
Extends the base RequestState with support for accumulating multimodal tensor outputs (e.g., images, audio, latents) that are produced incrementally during generation.
native_text_stats instance-attribute ¶
native_text_stats = RequestStateStats(
arrival_time=float(arrival_time or 0.0)
)
make_request_output ¶
make_request_output(
new_token_ids: list[int],
pooling_output: Tensor | None,
finish_reason: FinishReason | None,
stop_reason: int | str | None,
kv_transfer_params: dict[str, Any] | None = None,
) -> OmniRequestOutput | PoolingRequestOutput | None
Create a request output from generation results.
Creates a RequestOutput or PoolingRequestOutput from the generated tokens and accumulated multimodal outputs. Attaches multimodal tensors to the completion output if available.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
new_token_ids | list[int] | List of newly generated token IDs | required |
pooling_output | Tensor | None | Optional pooling output tensor | required |
finish_reason | FinishReason | None | Optional finish reason indicating why generation stopped | required |
stop_reason | int | str | None | Optional stop reason (token ID or stop string) | required |
kv_transfer_params | dict[str, Any] | None | Optional KV cache transfer parameters | None |
Returns:
| Type | Description |
|---|---|
OmniRequestOutput | PoolingRequestOutput | None | OmniRequestOutput or PoolingRequestOutput if output should be |
OmniRequestOutput | PoolingRequestOutput | None | emitted (based on finish status and output kind), None otherwise |