Distributed Inference and Serving#
vLLM supports distributed tensor-parallel inference and serving. Currently, we support Megatron-LM’s tensor parallel algorithm. We manage the distributed runtime with Ray. To run distributed inference, install Ray with:
$ pip install ray
To run multi-GPU inference with the LLM
class, set the tensor_parallel_size
argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs:
from vllm import LLM
llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
output = llm.generate("San Franciso is a")
To run multi-GPU serving, pass in the --tensor-parallel-size
argument when starting the server. For example, to run API server on 4 GPUs:
$ python -m vllm.entrypoints.api_server \
$ --model facebook/opt-13b \
$ --tensor-parallel-size 4
To scale vLLM beyond a single machine, start a Ray runtime via CLI before running vLLM:
$ # On head node
$ ray start --head
$ # On worker nodes
$ ray start --address=<ray-head-address>
After that, you can run inference and serving on multiple machines by launching the vLLM process on the head node by setting tensor_parallel_size
to the number of GPUs to be the total number of GPUs across all machines.