Skip to content

vllm_omni.core.sched

Scheduling components for vLLM-Omni.

Modules:

Name Description
omni_ar_scheduler
omni_generation_scheduler
omni_scheduler_mixin
omni_scheduling_coordinator

Scheduling-side coordination for chunk and full_payload input waiting.

output
utils

Shared utilities for omni schedulers.

OmniARAsyncScheduler

Bases: OmniARScheduler, AsyncScheduler

Asynchronous AutoRegressive scheduler.

OmniARScheduler

Bases: OmniSchedulerMixin, Scheduler

Synchronous AutoRegressive scheduler for vLLM-Omni. This class is also used as a base class for the OmniARAsyncScheduler and holds most of the core scheduling logic.

active_kv_transfers instance-attribute

active_kv_transfers: set[str] = set()

chunk_transfer_adapter instance-attribute

chunk_transfer_adapter = None

finished_req_ids_dict instance-attribute

finished_req_ids_dict = defaultdict(set)

input_coordinator instance-attribute

input_coordinator: OmniSchedulingCoordinator | None = None

kv_transfer_criteria instance-attribute

kv_transfer_criteria = _get_kv_transfer_criteria()

pending_stop_after_extraction instance-attribute

pending_stop_after_extraction: set[str] = set()

requests_needing_kv_transfer instance-attribute

requests_needing_kv_transfer: dict[str, dict[str, Any]] = {}

transfer_triggered_requests instance-attribute

transfer_triggered_requests: set[str] = set()

waiting_for_transfer_free instance-attribute

waiting_for_transfer_free: set[str] = set()

finish_requests

finish_requests(
    request_ids: Any, finished_status: RequestStatus
) -> list[tuple[str, int]]

Handles the finish signal from outside the scheduler.

For example, the API server can abort a request when the client disconnects.

If request_ids is None, all requests will be finished.

Returns:

Type Description
list[tuple[str, int]]

Tuple of (req_id, client_index) for requests that were aborted. Will not

list[tuple[str, int]]

include any that were already finished.

get_finished_requests_needing_kv_transfer

get_finished_requests_needing_kv_transfer() -> dict[
    str, dict
]

Get and clear the list of requests needing KV cache transfer. Returns dict: {req_id: {"seq_len": int, "block_ids": list[int]}}

has_finished_requests

has_finished_requests() -> bool

Check if there are any finished requests (including those needing KV transfer).

has_requests

has_requests() -> bool

Check if there are any requests to process, including KV transfers.

has_unfinished_requests

has_unfinished_requests() -> bool

Check if there are any unfinished requests (including those needing KV transfer).

schedule

schedule() -> SchedulerOutput

update_from_output

update_from_output(
    scheduler_output: SchedulerOutput,
    model_runner_output: ModelRunnerOutput,
) -> dict[int, EngineCoreOutputs]

OmniGenerationScheduler

Bases: OmniSchedulerMixin, Scheduler

chunk_transfer_adapter instance-attribute

chunk_transfer_adapter = None

input_coordinator instance-attribute

input_coordinator: OmniSchedulingCoordinator | None = None

finish_requests

finish_requests(
    request_ids, finished_status: RequestStatus
) -> list[tuple[str, int]]

Handles the finish signal from outside the scheduler.

For example, the API server can abort a request when the client disconnects.

If request_ids is None, all requests will be finished.

Returns:

Type Description
list[tuple[str, int]]

Tuple of (req_id, client_index) for requests that were aborted. Will not

list[tuple[str, int]]

include any that were already finished.

schedule

schedule() -> SchedulerOutput

Diffusion fast path: - Feed all input tokens of the request at once (if 0, allocate 1 placeholder token). - If the token budget cannot be satisfied at once, fall back to the default vLLM scheduling.

update_from_output

update_from_output(
    scheduler_output: SchedulerOutput,
    model_runner_output: OmniModelRunnerOutput,
) -> dict[int, EngineCoreOutputs]

Update the scheduler state based on the model runner output.

This method is modified to stop the request immediately for the diffusion model.

OmniNewRequestData dataclass

Bases: NewRequestData

New request data for omni models with embeddings support.

Extends NewRequestData to include additional information for direct transfer between pipeline stages.

Note: prompt_embeds is inherited from NewRequestData (torch.Tensor | None).

Parameters:

Name Type Description Default
external_req_id str | None

Optional external request ID for tracking

None
additional_information AdditionalInformationPayload | None

Optional serialized additional information dictionary containing tensors or lists

None

additional_information class-attribute instance-attribute

additional_information: (
    AdditionalInformationPayload | None
) = None

external_req_id class-attribute instance-attribute

external_req_id: str | None = None

from_request classmethod

from_request(
    request: Request,
    block_ids: tuple[list[int], ...],
    prefill_token_ids: list[int] | None = None,
) -> OmniNewRequestData

Create OmniNewRequestData from a Request object.

Parameters:

Name Type Description Default
request Request

Request object to convert

required
block_ids tuple[list[int], ...]

Tuple of block ID lists for KV cache allocation

required
prefill_token_ids list[int] | None

Optional prefill token IDs for v2 model runner

None

Returns:

Type Description
OmniNewRequestData

OmniNewRequestData instance with data from the request