Skip to content

vllm_omni.worker.gpu_model_runner ¶

logger `module-attribute` ¶

logger = init_logger(__name__)

xgr `module-attribute` ¶

xgr = LazyLoader('xgr', globals(), 'xgrammar')

xgr_torch_compile `module-attribute` ¶

xgr_torch_compile = LazyLoader(
    "xgr_torch_compile",
    globals(),
    "xgrammar.kernels.apply_token_bitmask_inplace_torch_compile",
)

OmniGPUModelRunner ¶

Bases: GPUModelRunner

model_intermediate_buffer `instance-attribute` ¶

model_intermediate_buffer: dict[str, dict[str, Any]] = {}

omni_prefix_cache `instance-attribute` ¶

omni_prefix_cache = None

extract_multimodal_outputs ¶

extract_multimodal_outputs(
    hidden_states: Tensor | list[Tensor] | OmniOutput,
) -> dict

initialize_metadata_builders ¶

initialize_metadata_builders(
    kv_cache_config, kernel_block_sizes
)

Initialize metadata builders and keep FA3 graph metadata buffers sized.

FlashAttentionMetadataBuilder can pre-allocate scheduler_metadata for only max_num_seqs + 1 entries while FA3 with split scheduling may need max_num_seqs * max_num_splits + 1 entries during CUDA graph capture. This runner is shared across Omni models, so preserve the existing workaround for non-Higgs models that still use FA3.

load_model ¶

load_model(*args, **kwargs) -> None