Engine Arguments#

Below, you can find an explanation of every engine argument for vLLM:

--model <model_name_or_path>#

Name or path of the huggingface model to use.

--tokenizer <tokenizer_name_or_path>#

Name or path of the huggingface tokenizer to use.

--revision <revision>#

The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.

--tokenizer-revision <revision>#

The specific tokenizer version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.

--tokenizer-mode {auto,slow}#

The tokenizer mode.

  • “auto” will use the fast tokenizer if available.

  • “slow” will always use the slow tokenizer.

--trust-remote-code#

Trust remote code from huggingface.

--download-dir <directory>#

Directory to download and load the weights, default to the default cache dir of huggingface.

--load-format {auto,pt,safetensors,npcache,dummy}#

The format of the model weights to load.

  • “auto” will try to load the weights in the safetensors format and fall back to the pytorch bin format if safetensors format is not available.

  • “pt” will load the weights in the pytorch bin format.

  • “safetensors” will load the weights in the safetensors format.

  • “npcache” will load the weights in pytorch format and store a numpy cache to speed up the loading.

  • “dummy” will initialize the weights with random values, mainly for profiling.

--dtype {auto,half,float16,bfloat16,float,float32}#

Data type for model weights and activations.

  • “auto” will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.

  • “half” for FP16. Recommended for AWQ quantization.

  • “float16” is the same as “half”.

  • “bfloat16” for a balance between precision and range.

  • “float” is shorthand for FP32 precision.

  • “float32” for FP32 precision.

--max-model-len <length>#

Model context length. If unspecified, will be automatically derived from the model config.

--worker-use-ray#

Use Ray for distributed serving, will be automatically set when using more than 1 GPU.

--pipeline-parallel-size (-pp) <size>#

Number of pipeline stages.

--tensor-parallel-size (-tp) <size>#

Number of tensor parallel replicas.

--max-parallel-loading-workers <workers>#

Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel and large models.

--block-size {8,16,32}#

Token block size for contiguous chunks of tokens.

--enable-prefix-caching#

Enables automatic prefix caching

--seed <seed>#

Random seed for operations.

--swap-space <size>#

CPU swap space size (GiB) per GPU.

--gpu-memory-utilization <fraction>#

The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9.

--max-num-batched-tokens <tokens>#

Maximum number of batched tokens per iteration.

--max-num-seqs <sequences>#

Maximum number of sequences per iteration.

--max-paddings <paddings>#

Maximum number of paddings in a batch.

--disable-log-stats#

Disable logging statistics.

--quantization (-q) {awq,squeezellm,None}#

Method used to quantize the weights.