Skip to content

vllm_omni.worker.gpu_ar_model_runner

AR GPU Model Runner for vLLM-Omni.

Exposes per-request hidden representations via ModelRunnerOutput.pooler_output and also outputs sampled tokens.

logger module-attribute

logger = init_logger(__name__)

ExecuteModelState

Bases: NamedTuple

aux_hidden_states instance-attribute

aux_hidden_states: list[Tensor] | None

cudagraph_stats instance-attribute

cudagraph_stats: Any

ec_connector_output instance-attribute

ec_connector_output: Any

hidden_states instance-attribute

hidden_states: Tensor

hidden_states_cpu instance-attribute

hidden_states_cpu: Tensor | None

logits instance-attribute

logits: Tensor | None

multimodal_outputs instance-attribute

multimodal_outputs: Any

sample_hidden_states instance-attribute

sample_hidden_states: Tensor

scheduler_output instance-attribute

scheduler_output: SchedulerOutput

slot_mappings class-attribute instance-attribute

slot_mappings: (
    dict[str, Tensor] | list[dict[str, Tensor]] | None
) = None

spec_decode_common_attn_metadata instance-attribute

spec_decode_common_attn_metadata: Any

spec_decode_metadata instance-attribute

spec_decode_metadata: Any

GPUARModelRunner

Bases: OmniGPUModelRunner, OmniConnectorModelRunnerMixin

Autoregressive GPU model runner that returns hidden states per request.

Follows the v0.12 two-phase execute/sample flow from GPUModelRunner, and reuses Omni hooks for additional_information / multimodal outputs. This class only overrides sample_tokens to expose hidden states + multimodal outputs per request while keeping Async output semantics.

hidden_size instance-attribute

hidden_size = hidden_size

input_ids instance-attribute

input_ids = _make_buffer(max_num_tokens, dtype=int32)

inputs_embeds instance-attribute

inputs_embeds = _make_buffer(
    max_num_tokens, hidden_size, dtype=dtype, numpy=False
)

kv_transfer_manager instance-attribute

kv_transfer_manager = from_vllm_config(
    vllm_config, model_config
)

capture_model

capture_model() -> int

execute_model

execute_model(
    scheduler_output: SchedulerOutput,
    intermediate_tensors: IntermediateTensors | None = None,
) -> (
    OmniModelRunnerOutput
    | AsyncModelRunnerOutput
    | IntermediateTensors
    | None
)

sample_tokens

sample_tokens(
    grammar_output: GrammarOutput | None,
) -> (
    OmniModelRunnerOutput
    | AsyncModelRunnerOutput
    | IntermediateTensors
)