vllm_omni.core.sched ¶
Scheduling components for vLLM-Omni.
Modules:
| Name | Description |
|---|---|
omni_ar_scheduler | |
omni_generation_scheduler | |
omni_scheduler_mixin | |
omni_scheduling_coordinator | Scheduling-side coordination for chunk and full_payload input waiting. |
output | |
utils | Shared utilities for omni schedulers. |
OmniARAsyncScheduler ¶
OmniARScheduler ¶
Bases: OmniSchedulerMixin, Scheduler
Synchronous AutoRegressive scheduler for vLLM-Omni. This class is also used as a base class for the OmniARAsyncScheduler and holds most of the core scheduling logic.
requests_needing_kv_transfer instance-attribute ¶
finish_requests ¶
Handles the finish signal from outside the scheduler.
For example, the API server can abort a request when the client disconnects.
If request_ids is None, all requests will be finished.
Returns:
| Type | Description |
|---|---|
list[tuple[str, int]] | Tuple of (req_id, client_index) for requests that were aborted. Will not |
list[tuple[str, int]] | include any that were already finished. |
get_finished_requests_needing_kv_transfer ¶
Get and clear the list of requests needing KV cache transfer. Returns dict: {req_id: {"seq_len": int, "block_ids": list[int]}}
has_finished_requests ¶
has_finished_requests() -> bool
Check if there are any finished requests (including those needing KV transfer).
has_requests ¶
has_requests() -> bool
Check if there are any requests to process, including KV transfers.
OmniGenerationScheduler ¶
Bases: OmniSchedulerMixin, Scheduler
finish_requests ¶
Handles the finish signal from outside the scheduler.
For example, the API server can abort a request when the client disconnects.
If request_ids is None, all requests will be finished.
Returns:
| Type | Description |
|---|---|
list[tuple[str, int]] | Tuple of (req_id, client_index) for requests that were aborted. Will not |
list[tuple[str, int]] | include any that were already finished. |
schedule ¶
Diffusion fast path: - Feed all input tokens of the request at once (if 0, allocate 1 placeholder token). - If the token budget cannot be satisfied at once, fall back to the default vLLM scheduling.
update_from_output ¶
update_from_output(
scheduler_output: SchedulerOutput,
model_runner_output: OmniModelRunnerOutput,
) -> dict[int, EngineCoreOutputs]
Update the scheduler state based on the model runner output.
This method is modified to stop the request immediately for the diffusion model.
OmniNewRequestData dataclass ¶
Bases: NewRequestData
New request data for omni models with embeddings support.
Extends NewRequestData to include additional information for direct transfer between pipeline stages.
Note: prompt_embeds is inherited from NewRequestData (torch.Tensor | None).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
external_req_id | str | None | Optional external request ID for tracking | None |
additional_information | AdditionalInformationPayload | None | Optional serialized additional information dictionary containing tensors or lists | None |
additional_information class-attribute instance-attribute ¶
additional_information: (
AdditionalInformationPayload | None
) = None
from_request classmethod ¶
from_request(
request: Request,
block_ids: tuple[list[int], ...],
prefill_token_ids: list[int] | None = None,
) -> OmniNewRequestData
Create OmniNewRequestData from a Request object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
request | Request | Request object to convert | required |
block_ids | tuple[list[int], ...] | Tuple of block ID lists for KV cache allocation | required |
prefill_token_ids | list[int] | None | Optional prefill token IDs for v2 model runner | None |
Returns:
| Type | Description |
|---|---|
OmniNewRequestData | OmniNewRequestData instance with data from the request |