Distributed Inference and Serving#

vLLM supports distributed tensor-parallel inference and serving. Currently, we support Megatron-LM’s tensor parallel algorithm. We manage the distributed runtime with Ray. To run distributed inference, install Ray with:

$ pip install ray

To run multi-GPU inference with the LLM class, set the tensor_parallel_size argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs:

from vllm import LLM
llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
output = llm.generate("San Franciso is a")

To run multi-GPU serving, pass in the --tensor-parallel-size argument when starting the server. For example, to run API server on 4 GPUs:

$ python -m vllm.entrypoints.api_server \
$     --model facebook/opt-13b \
$     --tensor-parallel-size 4

To scale vLLM beyond a single machine, start a Ray runtime via CLI before running vLLM:

$ # On head node
$ ray start --head

$ # On worker nodes
$ ray start --address=<ray-head-address>

After that, you can run inference and serving on multiple machines by launching the vLLM process on the head node by setting tensor_parallel_size to the number of GPUs to be the total number of GPUs across all machines.