vllm_omni.worker.gpu_model_runner ¶
xgr_torch_compile module-attribute ¶
xgr_torch_compile = LazyLoader(
"xgr_torch_compile",
globals(),
"xgrammar.kernels.apply_token_bitmask_inplace_torch_compile",
)
OmniGPUModelRunner ¶
Bases: GPUModelRunner
model_intermediate_buffer instance-attribute ¶
extract_multimodal_outputs ¶
extract_multimodal_outputs(
hidden_states: Tensor | list[Tensor] | OmniOutput,
) -> dict
initialize_metadata_builders ¶
Override to fix scheduler_metadata buffer size for FA3 + CUDA graph.
The upstream FlashAttentionMetadataBuilder pre-allocates scheduler_metadata with (max_num_seqs + 1) entries, but FA3's get_scheduler_metadata() can return up to (max_num_seqs * max_num_splits + 1) entries, causing a RuntimeError during CUDA graph capture. After calling the parent implementation we resize any too-small buffers.