vllm.v1.kv_cache_interface
AttentionSpec
dataclass
¶
Bases: KVCacheSpec
Source code in vllm/v1/kv_cache_interface.py
FullAttentionSpec
dataclass
¶
Bases: AttentionSpec
Source code in vllm/v1/kv_cache_interface.py
sliding_window
class-attribute
instance-attribute
¶
When hybrid allocator is disabled and the model contains both full attention layers and sliding window attention layers, sliding window attention are regarded as full attention in KV cache manager (blocks are allocated for all tokens), while computed as sliding window attention in model runner. In this case, we use FullAttentionSpec and record the sliding window size. Default to None for not using sliding window attention.
__init__
¶
__init__(
block_size: int,
num_kv_heads: int,
head_size: int,
dtype: dtype,
use_mla: bool,
sliding_window: Optional[int] = None,
) -> None
max_memory_usage_bytes
¶
max_memory_usage_bytes(vllm_config: VllmConfig) -> int
merge
classmethod
¶
Merge a list of FullAttentionSpec objects into a single FullAttentionSpec object.
Source code in vllm/v1/kv_cache_interface.py
KVCacheConfig
dataclass
¶
The KV cache configuration of a model.
Source code in vllm/v1/kv_cache_interface.py
num_blocks
instance-attribute
¶
num_blocks: int
layer_name -> how to initialize KV cache for that layer
tensors
instance-attribute
¶
tensors: dict[str, KVCacheTensor]
The kv cache groups of the model. The layers in the models are repeated with some patterns, e.g., a model with 10 full attention layers and 20 sliding window attention layers can be regarded as repeating the pattern (1 * full, 2 * sw) 10 times. The KVCacheManager allocates different block tables for each of the 3 layers in the pattern, and repeats each of them 10 times to generate the block_table for the 30 layers in the model. Therefore, we can group the layers in the model into 3 groups, each of which contains 10 layers in the model. The KVCacheManager allocates the block_table for each group based on its kv_cache spec, and the model runner applies the block table to each layer in the group. For example: 1. A model only uses full attention. The pattern is (num_hidden_layers * full), so there is only one group and the block table is shared by all layers. 2. (WIP) A model with 10 full attention layers and 20 sliding window attention layers. There are 3 layers in the pattern (1 * full, 2 * sw), so there are 3 groups, each of which represents 10 layers in the model.
__init__
¶
__init__(
num_blocks: int,
tensors: dict[str, KVCacheTensor],
kv_cache_groups: list[KVCacheGroupSpec],
) -> None
KVCacheGroupSpec
dataclass
¶
Represents a group of model layers that share the same KV cache block table. These layers are regarded as one layer in the KV cache manager.
Source code in vllm/v1/kv_cache_interface.py
KVCacheSpec
dataclass
¶
A base class for specifying the KV cache format of one layer.
Source code in vllm/v1/kv_cache_interface.py
type_id
property
¶
type_id: str
The type identifier of this KV cache. Return different strings for layers with different KV cache type (e.g., different number of tokens like full attention vs sliding window attention, different KV cache size per token like layers with different number of heads)
Returns:
| Type | Description |
|---|---|
str
|
The type identifier of this KV cache. |
max_memory_usage_bytes
¶
max_memory_usage_bytes(vllm_config: VllmConfig) -> int
The maximum possible memory usage of this KV cache in bytes.
Returns:
| Type | Description |
|---|---|
int
|
The KV cache size in bytes |
merge
classmethod
¶
Merge a list of KVCacheSpec objects into a single KVCacheSpec object.
Source code in vllm/v1/kv_cache_interface.py
KVCacheTensor
dataclass
¶
A dataclass for specifying how the workers should initialize the KV cache for a layer. Only contains the size of KV cache for that layer for now. Will be extended to support multiple layers sharing the same memory pool.
Source code in vllm/v1/kv_cache_interface.py
SlidingWindowSpec
dataclass
¶
Bases: AttentionSpec
Source code in vllm/v1/kv_cache_interface.py
__init__
¶
__init__(
block_size: int,
num_kv_heads: int,
head_size: int,
dtype: dtype,
use_mla: bool,
sliding_window: int,
) -> None
__post_init__
¶
max_memory_usage_bytes
¶
max_memory_usage_bytes(vllm_config: VllmConfig) -> int