Sampling Parameters#

class vllm.SamplingParams(n: int = 1, best_of: int | None = None, presence_penalty: float = 0.0, frequency_penalty: float = 0.0, repetition_penalty: float = 1.0, temperature: float = 1.0, top_p: float = 1.0, top_k: int = -1, min_p: float = 0.0, seed: int | None = None, use_beam_search: bool = False, length_penalty: float = 1.0, early_stopping: bool | str = False, stop: str | ~typing.List[str] | None = None, stop_token_ids: ~typing.List[int] | None = None, ignore_eos: bool = False, max_tokens: int | None = 16, min_tokens: int = 0, logprobs: int | None = None, prompt_logprobs: int | None = None, detokenize: bool = True, skip_special_tokens: bool = True, spaces_between_special_tokens: bool = True, logits_processors: ~typing.Any | None = None, include_stop_str_in_output: bool = False, truncate_prompt_tokens: int[int] | None = None, output_text_buffer_length: int = 0, _all_stop_token_ids: ~typing.Set[int] = <factory>)[source]#

Sampling parameters for text generation.

Overall, we follow the sampling parameters from the OpenAI text completion API (https://platform.openai.com/docs/api-reference/completions/create). In addition, we support beam search, which is not supported by OpenAI.

Parameters:
  • n – Number of output sequences to return for the given prompt.

  • best_of – Number of output sequences that are generated from the prompt. From these best_of sequences, the top n sequences are returned. best_of must be greater than or equal to n. This is treated as the beam width when use_beam_search is True. By default, best_of is set to n.

  • presence_penalty – Float that penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.

  • frequency_penalty – Float that penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.

  • repetition_penalty – Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.

  • temperature – Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.

  • top_p – Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.

  • top_k – Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens.

  • min_p – Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.

  • seed – Random seed to use for the generation.

  • use_beam_search – Whether to use beam search instead of sampling.

  • length_penalty – Float that penalizes sequences based on their length. Used in beam search.

  • early_stopping – Controls the stopping condition for beam search. It accepts the following values: True, where the generation stops as soon as there are best_of complete candidates; False, where an heuristic is applied and the generation stops when is it very unlikely to find better candidates; “never”, where the beam search procedure only stops when there cannot be better candidates (canonical beam search algorithm).

  • stop – List of strings that stop the generation when they are generated. The returned output will not contain the stop strings.

  • stop_token_ids – List of tokens that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens.

  • include_stop_str_in_output – Whether to include the stop strings in output text. Defaults to False.

  • ignore_eos – Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.

  • max_tokens – Maximum number of tokens to generate per output sequence.

  • min_tokens – Minimum number of tokens to generate per output sequence before EOS or stop_token_ids can be generated

  • logprobs – Number of log probabilities to return per output token. When set to None, no probability is returned. If set to a non-None value, the result includes the log probabilities of the specified number of most likely tokens, as well as the chosen tokens. Note that the implementation follows the OpenAI API: The API will always return the log probability of the sampled token, so there may be up to logprobs+1 elements in the response.

  • prompt_logprobs – Number of log probabilities to return per prompt token.

  • detokenize – Whether to detokenize the output. Defaults to True.

  • skip_special_tokens – Whether to skip special tokens in the output.

  • spaces_between_special_tokens – Whether to add spaces between special tokens in the output. Defaults to True.

  • logits_processors – List of functions that modify logits based on previously generated tokens, and optionally prompt tokens as a first argument.

  • truncate_prompt_tokens – If set to an integer k, will use only the last k tokens from the prompt (i.e., left truncation). Defaults to None (i.e., no truncation).

clone() SamplingParams[source]#

Deep copy excluding LogitsProcessor objects.

LogitsProcessor objects are excluded because they may contain an arbitrary, nontrivial amount of data. See vllm-project/vllm#3087

update_from_generation_config(generation_config: Dict[str, Any], model_eos_token_id: int | None = None) None[source]#

Update if there are non-default values from generation_config