Engine Arguments

Engine Arguments#

Below, you can find an explanation of every engine argument for vLLM:

--model <model_name_or_path>#: Name or path of the huggingface model to use.

--tokenizer <tokenizer_name_or_path>#: Name or path of the huggingface tokenizer to use.

--revision <revision>#: The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.

--tokenizer-revision <revision>#: The specific tokenizer version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.

--tokenizer-mode {auto,slow}#

The tokenizer mode.

“auto” will use the fast tokenizer if available.
“slow” will always use the slow tokenizer.

--trust-remote-code#: Trust remote code from huggingface.

--download-dir <directory>#: Directory to download and load the weights, default to the default cache dir of huggingface.

--load-format {auto,pt,safetensors,npcache,dummy}#

The format of the model weights to load.

“auto” will try to load the weights in the safetensors format and fall back to the pytorch bin format if safetensors format is not available.
“pt” will load the weights in the pytorch bin format.
“safetensors” will load the weights in the safetensors format.
“npcache” will load the weights in pytorch format and store a numpy cache to speed up the loading.
“dummy” will initialize the weights with random values, mainly for profiling.

--dtype {auto,half,float16,bfloat16,float,float32}#

Data type for model weights and activations.

“auto” will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
“half” for FP16. Recommended for AWQ quantization.
“float16” is the same as “half”.
“bfloat16” for a balance between precision and range.
“float” is shorthand for FP32 precision.
“float32” for FP32 precision.

--max-model-len <length>#: Model context length. If unspecified, will be automatically derived from the model config.

--worker-use-ray#: Use Ray for distributed serving, will be automatically set when using more than 1 GPU.

--pipeline-parallel-size (-pp) <size>#: Number of pipeline stages.

--tensor-parallel-size (-tp) <size>#: Number of tensor parallel replicas.

--max-parallel-loading-workers <workers>#: Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel and large models.

--block-size {8,16,32}#: Token block size for contiguous chunks of tokens.

--enable-prefix-caching#: Enables automatic prefix caching

--seed <seed>#: Random seed for operations.

--swap-space <size>#: CPU swap space size (GiB) per GPU.

--gpu-memory-utilization <fraction>#: The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9.

--max-num-batched-tokens <tokens>#: Maximum number of batched tokens per iteration.

--max-num-seqs <sequences>#: Maximum number of sequences per iteration.

--max-paddings <paddings>#: Maximum number of paddings in a batch.

--disable-log-stats#: Disable logging statistics.

--quantization (-q) {awq,squeezellm,None}#: Method used to quantize the weights.