vllm_omni.worker.gpu_ar_model_runner ¶
AR GPU Model Runner for vLLM-Omni.
Exposes per-request hidden representations via ModelRunnerOutput.pooler_output and also outputs sampled tokens.
ExecuteModelState ¶
Bases: NamedTuple
slot_mappings class-attribute instance-attribute ¶
GPUARModelRunner ¶
Bases: OmniGPUModelRunner, OmniConnectorModelRunnerMixin
Autoregressive GPU model runner that returns hidden states per request.
Follows the v0.12 two-phase execute/sample flow from GPUModelRunner, and reuses Omni hooks for additional_information / multimodal outputs. This class only overrides sample_tokens to expose hidden states + multimodal outputs per request while keeping Async output semantics.
input_ids instance-attribute ¶
inputs_embeds instance-attribute ¶
inputs_embeds = self._make_buffer(
self.max_num_tokens,
self.hidden_size,
dtype=self.dtype,
numpy=False,
)
kv_transfer_manager instance-attribute ¶
kv_transfer_manager = (
OmniKVTransferManager.from_vllm_config(
self.vllm_config, self.model_config
)
)
execute_model ¶
execute_model(
scheduler_output: SchedulerOutput,
intermediate_tensors: IntermediateTensors | None = None,
) -> (
OmniModelRunnerOutput
| AsyncModelRunnerOutput
| IntermediateTensors
| None
)
sample_tokens ¶
sample_tokens(
grammar_output: GrammarOutput | None,
) -> (
OmniModelRunnerOutput
| AsyncModelRunnerOutput
| IntermediateTensors
)
shutdown ¶
Release omni-specific GPU resources before upstream shutdown.
Order of operations (must match upstream's expectation): 1. Unfreeze Python GC so model weights are collected immediately when self.model is set to None (upstream Worker.init_device calls gc.freeze() / freeze_gc_heap()). 2. Destroy omni-specific CUDA graphs (talker MTP) so references to model parameters are released before self.model = None. 3. Clear GPU-side buffers (input_ids, inputs_embeds) and per-request caches that may hold GPU tensor references. 4. Call CUDAGraphWrapper.clear_all_graphs() unconditionally (not just on ROCm) to ensure all CUDA graphs including talker MTP are released before model weight teardown. 5. Call BreakableCUDAGraphWrapper.clear_all_graphs() as well, to match the upstream ROCm-only pattern but also protect CUDA. 6. Delegate to upstream GPUModelRunner.shutdown() which sets self.model = None, clears KV caches, resets workspace, etc.
This prevents abrupt GPU memory release during EngineCore subprocess exit that can trigger GPU OOM signals when the parent process concurrently cleans up its own GPU state.