vllm_omni.worker.gpu_ar_model_runner ¶
AR GPU Model Runner for vLLM-Omni.
Exposes per-request hidden representations via ModelRunnerOutput.pooler_output and also outputs sampled tokens.
ExecuteModelState ¶
Bases: NamedTuple
slot_mappings class-attribute instance-attribute ¶
GPUARModelRunner ¶
Bases: OmniGPUModelRunner, OmniConnectorModelRunnerMixin
Autoregressive GPU model runner that returns hidden states per request.
Follows the v0.12 two-phase execute/sample flow from GPUModelRunner, and reuses Omni hooks for additional_information / multimodal outputs. This class only overrides sample_tokens to expose hidden states + multimodal outputs per request while keeping Async output semantics.
inputs_embeds instance-attribute ¶
kv_transfer_manager instance-attribute ¶
kv_transfer_manager = from_vllm_config(
vllm_config, model_config
)
execute_model ¶
execute_model(
scheduler_output: SchedulerOutput,
intermediate_tensors: IntermediateTensors | None = None,
) -> (
OmniModelRunnerOutput
| AsyncModelRunnerOutput
| IntermediateTensors
| None
)
sample_tokens ¶
sample_tokens(
grammar_output: GrammarOutput | None,
) -> (
OmniModelRunnerOutput
| AsyncModelRunnerOutput
| IntermediateTensors
)