vllm_omni.worker.gpu_model_runner ¶
xgr_torch_compile module-attribute ¶
xgr_torch_compile = LazyLoader(
"xgr_torch_compile",
globals(),
"xgrammar.kernels.apply_token_bitmask_inplace_torch_compile",
)
OmniGPUModelRunner ¶
Bases: GPUModelRunner
model_intermediate_buffer instance-attribute ¶
extract_multimodal_outputs ¶
extract_multimodal_outputs(
hidden_states: Tensor | list[Tensor] | OmniOutput,
) -> dict
initialize_metadata_builders ¶
Initialize metadata builders and keep FA3 graph metadata buffers sized.
FlashAttentionMetadataBuilder can pre-allocate scheduler_metadata for only max_num_seqs + 1 entries while FA3 with split scheduling may need max_num_seqs * max_num_splits + 1 entries during CUDA graph capture. This runner is shared across Omni models, so preserve the existing workaround for non-Higgs models that still use FA3.