vllm.forward_context
batchsize_logging_interval
module-attribute
¶
batchsize_logging_interval: float = (
VLLM_LOG_BATCHSIZE_INTERVAL
)
DPMetadata
dataclass
¶
Source code in vllm/forward_context.py
make
staticmethod
¶
make(
parallel_config: ParallelConfig,
attn_metadata: Any,
num_tokens: int,
num_tokens_across_dp: Optional[Tensor] = None,
) -> DPMetadata
Source code in vllm/forward_context.py
num_tokens_across_dp
staticmethod
¶
Gather the num_tokens across all DP ranks and return results in a CPU tensor of size dp_size.
Source code in vllm/forward_context.py
ForwardContext
dataclass
¶
Source code in vllm/forward_context.py
attn_metadata
instance-attribute
¶
attn_metadata: Union[
AttentionMetadata, dict[str, AttentionMetadata]
]
no_compile_layers
instance-attribute
¶
Type AttentionMetadata for v0, Type Dict[str, AttentionMetadata] for v1, map from layer_name of each attention layer to its attention metadata set dynamically for each forward pass
__init__
¶
__init__(
no_compile_layers: dict[str, Any],
attn_metadata: Union[
AttentionMetadata, dict[str, AttentionMetadata]
],
virtual_engine: int,
dp_metadata: Optional[DPMetadata] = None,
skip_cuda_graphs: bool = False,
) -> None
get_forward_context
¶
get_forward_context() -> ForwardContext
Get the current forward context.
Source code in vllm/forward_context.py
set_forward_context
¶
set_forward_context(
attn_metadata: Any,
vllm_config: VllmConfig,
virtual_engine: int = 0,
num_tokens: Optional[int] = None,
num_tokens_across_dp: Optional[Tensor] = None,
skip_cuda_graphs: bool = False,
)
A context manager that stores the current forward context, can be attention metadata, etc. Here we can inject common logic for every model forward pass.