Skip to content

CPU

vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions:

vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.

Warning

There are no pre-built wheels or images for this device, so you must build vLLM from source.

vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CPU backend initially developed for the x86 platform.

ARM CPU backend currently supports Float32, FP16 and BFloat16 datatypes.

Warning

There are no pre-built wheels or images for this device, so you must build vLLM from source.

vLLM has experimental support for macOS with Apple silicon. For now, users shall build from the source vLLM to natively run on macOS.

Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.

Warning

There are no pre-built wheels or images for this device, so you must build vLLM from source.

vLLM has experimental support for s390x architecture on IBM Z platform. For now, users shall build from the vLLM source to natively run on IBM Z platform.

Currently the CPU implementation for s390x architecture supports FP32 datatype only.

Warning

There are no pre-built wheels or images for this device, so you must build vLLM from source.

Requirements

  • Python: 3.9 -- 3.12
  • OS: Linux
  • Compiler: gcc/g++ >= 12.3.0 (optional, recommended)
  • Instruction Set Architecture (ISA): AVX512 (optional, recommended)

Tip

Intel Extension for PyTorch (IPEX) extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware.

  • OS: Linux
  • Compiler: gcc/g++ >= 12.3.0 (optional, recommended)
  • Instruction Set Architecture (ISA): NEON support is required
  • OS: macOS Sonoma or later
  • SDK: XCode 15.4 or later with Command Line Tools
  • Compiler: Apple Clang >= 15.0.0
  • OS: Linux
  • SDK: gcc/g++ >= 12.3.0 or later with Command Line Tools
  • Instruction Set Architecture (ISA): VXE support is required. Works with Z14 and above.
  • Build install python packages: pyarrow, torch and torchvision

Set up using Python

Create a new Python environment

It's recommended to use uv, a very fast Python environment manager, to create and manage Python environments. Please follow the documentation to install uv. After installing uv, you can create a new Python environment and install vLLM using the following commands:

uv venv --python 3.12 --seed
source .venv/bin/activate

Pre-built wheels

Currently, there are no pre-built CPU wheels.

Build wheel from source

Note

  • AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, which brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
  • If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable VLLM_CPU_AVX512BF16=1 before the building.

Testing has been conducted on AWS Graviton3 instances for compatibility.

After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from the source.

git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements/cpu.txt
pip install -e . 

Note

On macOS the VLLM_TARGET_DEVICE is automatically set to cpu, which currently is the only supported device.

Troubleshooting

If the build has error like the following snippet where standard C++ headers cannot be found, try to remove and reinstall your Command Line Tools for Xcode.

[...] fatal error: 'map' file not found
          1 | #include <map>
            |          ^~~~~
      1 error generated.
      [2/8] Building CXX object CMakeFiles/_C.dir/csrc/cpu/pos_encoding.cpp.o

[...] fatal error: 'cstddef' file not found
         10 | #include <cstddef>
            |          ^~~~~~~~~
      1 error generated.

Install the following packages from the package manager before building the vLLM. For example on RHEL 9.4:

dnf install -y \
    which procps findutils tar vim git gcc g++ make patch make cython zlib-devel \
    libjpeg-turbo-devel libtiff-devel libpng-devel libwebp-devel freetype-devel harfbuzz-devel \
    openssl-devel openblas openblas-devel wget autoconf automake libtool cmake numactl-devel

Install rust>=1.80 which is needed for outlines-core and uvloop python packages installation.

curl https://sh.rustup.rs -sSf | sh -s -- -y && \
    . "$HOME/.cargo/env"

Execute the following commands to build and install vLLM from the source.

Tip

Please build the following dependencies, torchvision, pyarrow from the source before building vLLM.

    sed -i '/^torch/d' requirements-build.txt    # remove torch from requirements-build.txt since we use nightly builds
    pip install -v \
        --extra-index-url https://download.pytorch.org/whl/nightly/cpu \
        -r requirements-build.txt \
        -r requirements-cpu.txt \
    VLLM_TARGET_DEVICE=cpu python setup.py bdist_wheel && \
    pip install dist/*.whl

Set up using Docker

Pre-built images

Build image from source

$ docker build -f docker/Dockerfile.cpu --tag vllm-cpu-env --target vllm-openai .

# Launching OpenAI server 
$ docker run --rm \
             --privileged=true \
             --shm-size=4g \
             -p 8000:8000 \
             -e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
             -e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
             vllm-cpu-env \
             --model=meta-llama/Llama-3.2-1B-Instruct \
             --dtype=bfloat16 \
             other vLLM OpenAI server arguments

Tip

For ARM or Apple silicon, use docker/Dockerfile.arm

Tip

For IBM Z (s390x), use docker/Dockerfile.s390x and in docker run use flag --dtype float

Supported features

vLLM CPU backend supports the following vLLM features:

  • Tensor Parallel
  • Model Quantization (INT8 W8A8, AWQ, GPTQ)
  • Chunked-prefill
  • Prefix-caching
  • FP8-E5M2 KV cache
  • VLLM_CPU_KVCACHE_SPACE: specify the KV Cache size (e.g, VLLM_CPU_KVCACHE_SPACE=40 means 40 GiB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users. Default value is 0.
  • VLLM_CPU_OMP_THREADS_BIND: specify the CPU cores dedicated to the OpenMP threads. For example, VLLM_CPU_OMP_THREADS_BIND=0-31 means there will be 32 OpenMP threads bound on 0-31 CPU cores. VLLM_CPU_OMP_THREADS_BIND=0-31|32-63 means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores. By setting to auto, the OpenMP threads of each rank are bound to the CPU cores in each NUMA node. By setting to all, the OpenMP threads of each rank uses all CPU cores available on the system. Default value is auto.
  • VLLM_CPU_NUM_OF_RESERVED_CPU: specify the number of CPU cores which are not dedicated to the OpenMP threads for each rank. The variable only takes effect when VLLM_CPU_OMP_THREADS_BIND is set to auto. Default value is 0.
  • VLLM_CPU_MOE_PREPACK: whether to use prepack for MoE layer. This will be passed to ipex.llm.modules.GatedMLPMOE. Default is 1 (True). On unsupported CPUs, you might need to set this to 0 (False).

Performance tips

  • We highly recommend to use TCMalloc for high performance memory allocation and better cache locality. For example, on Ubuntu 22.4, you can run:
sudo apt-get install libtcmalloc-minimal4 # install TCMalloc library
find / -name *libtcmalloc* # find the dynamic link library path
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD # prepend the library to LD_PRELOAD
python examples/offline_inference/basic/basic.py # run vLLM
  • When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP:
export VLLM_CPU_KVCACHE_SPACE=40
export VLLM_CPU_OMP_THREADS_BIND=0-29
vllm serve facebook/opt-125m

or using default auto thread binding:

export VLLM_CPU_KVCACHE_SPACE=40
export VLLM_CPU_NUM_OF_RESERVED_CPU=2
vllm serve facebook/opt-125m
  • If using vLLM CPU backend on a machine with hyper-threading, it is recommended to bind only one OpenMP thread on each physical CPU core using VLLM_CPU_OMP_THREADS_BIND or using auto thread binding feature by default. On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
$ lscpu -e # check the mapping between logical CPU cores and physical CPU cores

# The "CPU" column means the logical CPU core IDs, and the "CORE" column means the physical core IDs. On this platform, two logical cores are sharing one physical core.
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ   MINMHZ      MHZ
0    0      0    0 0:0:0:0          yes 2401.0000 800.0000  800.000
1    0      0    1 1:1:1:0          yes 2401.0000 800.0000  800.000
2    0      0    2 2:2:2:0          yes 2401.0000 800.0000  800.000
3    0      0    3 3:3:3:0          yes 2401.0000 800.0000  800.000
4    0      0    4 4:4:4:0          yes 2401.0000 800.0000  800.000
5    0      0    5 5:5:5:0          yes 2401.0000 800.0000  800.000
6    0      0    6 6:6:6:0          yes 2401.0000 800.0000  800.000
7    0      0    7 7:7:7:0          yes 2401.0000 800.0000  800.000
8    0      0    0 0:0:0:0          yes 2401.0000 800.0000  800.000
9    0      0    1 1:1:1:0          yes 2401.0000 800.0000  800.000
10   0      0    2 2:2:2:0          yes 2401.0000 800.0000  800.000
11   0      0    3 3:3:3:0          yes 2401.0000 800.0000  800.000
12   0      0    4 4:4:4:0          yes 2401.0000 800.0000  800.000
13   0      0    5 5:5:5:0          yes 2401.0000 800.0000  800.000
14   0      0    6 6:6:6:0          yes 2401.0000 800.0000  800.000
15   0      0    7 7:7:7:0          yes 2401.0000 800.0000  800.000

# On this platform, it is recommend to only bind openMP threads on logical CPU cores 0-7 or 8-15
$ export VLLM_CPU_OMP_THREADS_BIND=0-7
$ python examples/offline_inference/basic/basic.py
  • If using vLLM CPU backend on a multi-socket machine with NUMA, be aware to set CPU cores using VLLM_CPU_OMP_THREADS_BIND to avoid cross NUMA node memory access.

Other considerations

  • The CPU backend significantly differs from the GPU backend since the vLLM architecture was originally optimized for GPU use. A number of optimizations are needed to enhance its performance.

  • Decouple the HTTP serving components from the inference components. In a GPU backend configuration, the HTTP serving and tokenization tasks operate on the CPU, while inference runs on the GPU, which typically does not pose a problem. However, in a CPU-based setup, the HTTP serving and tokenization can cause significant context switching and reduced cache efficiency. Therefore, it is strongly recommended to segregate these two components for improved performance.

  • On CPU based setup with NUMA enabled, the memory access performance may be largely impacted by the topology. For NUMA architecture, Tensor Parallel is a option for better performance.

  • Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:

    VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
    

    or using default auto thread binding:

    VLLM_CPU_KVCACHE_SPACE=40 vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
    
  • For each thread id list in VLLM_CPU_OMP_THREADS_BIND, users should guarantee threads in the list belong to a same NUMA node.

  • Meanwhile, users should also take care of memory capacity of each NUMA node. The memory usage of each TP rank is the sum of weight shard size and VLLM_CPU_KVCACHE_SPACE, if it exceeds the capacity of a single NUMA node, TP worker will be killed due to out-of-memory.