Skip to content

vllm_omni.worker.gpu_model_runner

logger module-attribute

logger = init_logger(__name__)

xgr module-attribute

xgr = LazyLoader('xgr', globals(), 'xgrammar')

xgr_torch_compile module-attribute

xgr_torch_compile = LazyLoader(
    "xgr_torch_compile",
    globals(),
    "xgrammar.kernels.apply_token_bitmask_inplace_torch_compile",
)

OmniGPUModelRunner

Bases: GPUModelRunner

model_intermediate_buffer instance-attribute

model_intermediate_buffer: dict[str, dict[str, Any]] = {}

omni_prefix_cache instance-attribute

omni_prefix_cache = None

extract_multimodal_outputs

extract_multimodal_outputs(
    hidden_states: Tensor | list[Tensor] | OmniOutput,
) -> dict

initialize_metadata_builders

initialize_metadata_builders(
    kv_cache_config, kernel_block_sizes
)

Override to fix scheduler_metadata buffer size for FA3 + CUDA graph.

The upstream FlashAttentionMetadataBuilder pre-allocates scheduler_metadata with (max_num_seqs + 1) entries, but FA3's get_scheduler_metadata() can return up to (max_num_seqs * max_num_splits + 1) entries, causing a RuntimeError during CUDA graph capture. After calling the parent implementation we resize any too-small buffers.

load_model

load_model(*args, **kwargs) -> None