Skip to content

vllm_omni.worker.gpu_ar_model_runner

AR GPU Model Runner for vLLM-Omni.

Exposes per-request hidden representations via ModelRunnerOutput.pooler_output and also outputs sampled tokens.

logger module-attribute

logger = init_logger(__name__)

ExecuteModelState

Bases: NamedTuple

aux_hidden_states instance-attribute

aux_hidden_states: list[Tensor] | None

cudagraph_stats instance-attribute

cudagraph_stats: Any

ec_connector_output instance-attribute

ec_connector_output: Any

hidden_states instance-attribute

hidden_states: Tensor

hidden_states_cpu instance-attribute

hidden_states_cpu: Tensor | None

logits instance-attribute

logits: Tensor | None

multimodal_outputs instance-attribute

multimodal_outputs: Any

sample_hidden_states instance-attribute

sample_hidden_states: Tensor

scheduler_output instance-attribute

scheduler_output: SchedulerOutput

slot_mappings class-attribute instance-attribute

slot_mappings: (
    dict[str, Tensor] | list[dict[str, Tensor]] | None
) = None

spec_decode_common_attn_metadata instance-attribute

spec_decode_common_attn_metadata: Any

spec_decode_metadata instance-attribute

spec_decode_metadata: Any

GPUARModelRunner

Bases: OmniGPUModelRunner, OmniConnectorModelRunnerMixin

Autoregressive GPU model runner that returns hidden states per request.

Follows the v0.12 two-phase execute/sample flow from GPUModelRunner, and reuses Omni hooks for additional_information / multimodal outputs. This class only overrides sample_tokens to expose hidden states + multimodal outputs per request while keeping Async output semantics.

hidden_size instance-attribute

hidden_size = self.model_config.hf_text_config.hidden_size

input_ids instance-attribute

input_ids = self._make_buffer(
    self.max_num_tokens, dtype=torch.int32
)

inputs_embeds instance-attribute

inputs_embeds = self._make_buffer(
    self.max_num_tokens,
    self.hidden_size,
    dtype=self.dtype,
    numpy=False,
)

kv_transfer_manager instance-attribute

kv_transfer_manager = (
    OmniKVTransferManager.from_vllm_config(
        self.vllm_config, self.model_config
    )
)

capture_model

capture_model() -> int

execute_model

execute_model(
    scheduler_output: SchedulerOutput,
    intermediate_tensors: IntermediateTensors | None = None,
) -> (
    OmniModelRunnerOutput
    | AsyncModelRunnerOutput
    | IntermediateTensors
    | None
)

sample_tokens

sample_tokens(
    grammar_output: GrammarOutput | None,
) -> (
    OmniModelRunnerOutput
    | AsyncModelRunnerOutput
    | IntermediateTensors
)

shutdown

shutdown() -> None

Release omni-specific GPU resources before upstream shutdown.

Order of operations (must match upstream's expectation): 1. Unfreeze Python GC so model weights are collected immediately when self.model is set to None (upstream Worker.init_device calls gc.freeze() / freeze_gc_heap()). 2. Destroy omni-specific CUDA graphs (talker MTP) so references to model parameters are released before self.model = None. 3. Clear GPU-side buffers (input_ids, inputs_embeds) and per-request caches that may hold GPU tensor references. 4. Call CUDAGraphWrapper.clear_all_graphs() unconditionally (not just on ROCm) to ensure all CUDA graphs including talker MTP are released before model weight teardown. 5. Call BreakableCUDAGraphWrapper.clear_all_graphs() as well, to match the upstream ROCm-only pattern but also protect CUDA. 6. Delegate to upstream GPUModelRunner.shutdown() which sets self.model = None, clears KV caches, resets workspace, etc.

This prevents abrupt GPU memory release during EngineCore subprocess exit that can trigger GPU OOM signals when the parent process concurrently cleans up its own GPU state.