vllm.model_executor
Modules:
Name | Description |
---|---|
custom_op |
|
guided_decoding |
|
layers |
|
model_loader |
|
models |
|
parameter |
|
pooling_metadata |
|
sampling_metadata |
|
utils |
Utils for model executor. |
__all__
module-attribute
¶
__all__ = [
"SamplingMetadata",
"SamplingMetadataCache",
"set_random_seed",
"BasevLLMParameter",
"PackedvLLMParameter",
]
BasevLLMParameter
¶
Bases: Parameter
Base parameter for vLLM linear layers. Extends the torch.nn.parameter by taking in a linear weight loader. Will copy the loaded weight into the parameter when the provided weight loader is called.
Source code in vllm/model_executor/parameter.py
__init__
¶
Initialize the BasevLLMParameter
:param data: torch tensor with the parameter data :param weight_loader: weight loader callable
:returns: a torch.nn.parameter
Source code in vllm/model_executor/parameter.py
PackedvLLMParameter
¶
Bases: ModelWeightParameter
Parameter for model weights which are packed on disk. Example: GPTQ Marlin weights are int4 or int8, packed into int32. Extends the ModelWeightParameter to take in the packed factor, the packed dimension, and optionally, marlin tile size for marlin kernels. Adjusts the shard_size and shard_offset for fused linear layers model weight loading by accounting for packing and optionally, marlin tile size.
Source code in vllm/model_executor/parameter.py
__init__
¶
__init__(
packed_factor: Union[int, Fraction],
packed_dim: int,
marlin_tile_size: Optional[int] = None,
bitblas_tile_size: Optional[int] = None,
**kwargs,
)
Source code in vllm/model_executor/parameter.py
adjust_shard_indexes_for_packing
¶
Source code in vllm/model_executor/parameter.py
SamplingMetadata
¶
Metadata for input sequences. Used in sampler.
The usage is as follow;
hidden_states = execute_model(...)
logits = hidden_states[sampling_metadata.selected_token_indices]
sample(logits)
def sample(logits):
# Use categorized_sample_indices for sampling....
Parameters:
Name | Type | Description | Default |
---|---|---|---|
seq_groups
|
list[SequenceGroupToSample]
|
List of batched sequence groups. |
required |
selected_token_indices
|
Tensor
|
(num_query_tokens_to_logprob). Indices to find logits from the initial model output hidden states. |
required |
categorized_sample_indices
|
dict[SamplingType, Tensor]
|
SamplingType -> token indices to sample. Each token indices is 2D tensor of (num_indices, num_indices) where the first item means the sample index within the returned logit (before pruning padding), and the second item means the sample index after pruning using selected_token_indices. For example, if the returned logit is [1, 2, 3], and we select [1, 2] for sampling, the pruned logit will be [2, 3]. In this case, The first tuple is [1, 2] (sampled index within original logit), and the second tuple is [0, 1] (sampled index within pruned logit). |
required |
num_prompts
|
int
|
Number of prompt sequence groups in seq_groups. |
required |
skip_sampler_cpu_output
|
bool
|
Indicates if we want to skip the GPU=>CPU serialization of token outputs. |
False
|
reuse_sampling_tensors
|
bool
|
Indicates if we want to reuse sampling tensors that are part of the sampler forward pass. Currently, it is mainly used for multi-step decode. |
False
|
Source code in vllm/model_executor/sampling_metadata.py
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 |
|
categorized_sample_indices
instance-attribute
¶
__init__
¶
__init__(
seq_groups: list[SequenceGroupToSample],
selected_token_indices: Tensor,
categorized_sample_indices: dict[SamplingType, Tensor],
num_prompts: int,
skip_sampler_cpu_output: bool = False,
reuse_sampling_tensors: bool = False,
) -> None
Source code in vllm/model_executor/sampling_metadata.py
prepare
staticmethod
¶
prepare(
seq_group_metadata_list: list[SequenceGroupMetadata],
seq_lens: list[int],
query_lens: list[int],
device: str,
pin_memory: bool,
generators: Optional[dict[str, Generator]] = None,
cache: Optional[SamplingMetadataCache] = None,
) -> SamplingMetadata
Source code in vllm/model_executor/sampling_metadata.py
SamplingMetadataCache
¶
Used to cache SamplingMetadata objects between scheduler iterations