CPU Binding#
Starting from vllm-ascend v0.18.0rc1, CPU binding is enabled by default on ARM-based Ascend servers.
You usually do not need to configure it manually. Set enable_cpu_binding
only when you want to disable it or make the default explicit.
Benefits of CPU Binding#
CPU Binding improves host-side scheduling for multi-socket ARM servers with Ascend NPUs. It is designed to solve three common host-side inference performance issues:
Lower cross-NUMA traffic. Worker processes stay closer to the CPU and memory resources selected for their active NPU, reducing remote NUMA access.
Lower context-switch overhead from thread preemption. Key runtime threads run on stable CPU ranges, reducing scheduler movement and CPU contention on busy hosts.
Better latency stability and multi-worker isolation. Independent workers avoid sharing the same CPU/NUMA resources, which helps reduce tail-latency jitter and makes throughput more predictable during multi-NPU serving.
This feature is a host-side performance optimization. It does not change model execution logic or numerical outputs. When memory migration support is unavailable, CPU affinity still works, but memory locality may be worse and latency or throughput may degrade.
Usage#
Online Serving#
Default behavior:
vllm serve Qwen/Qwen2.5-7B-Instruct
Disable CPU binding:
vllm serve Qwen/Qwen2.5-7B-Instruct \
--additional-config '{"enable_cpu_binding": false}'
Offline Inference#
Default behavior:
from vllm import LLM
llm = LLM(model="Qwen/Qwen2.5-7B-Instruct")
Disable CPU binding:
from vllm import LLM
llm = LLM(
model="Qwen/Qwen2.5-7B-Instruct",
additional_config={"enable_cpu_binding": False},
)
Requirements#
Official vllm-ascend images have already included util-linux and procps /
procps-ng in v0.18.0rc1 and earlier releases. Starting from v0.18.0rc1, the
official images also include numactl.
If you are not using the official image, install the host tools manually:
# Ubuntu/Debian
sudo apt-get install -y util-linux numactl procps
# RHEL/CentOS/Alma/Rocky
sudo yum install -y util-linux numactl procps-ng
# openEuler
sudo dnf install -y util-linux numactl procps-ng
Without numactl / migratepages, vLLM Ascend skips only memory migration.
The worker process and runtime threads are still pinned, but pages already
placed on remote NUMA nodes are not migrated, which can reduce locality and
degrade latency or throughput.
For optimal locality, use a cpuset that is evenly distributed across NUMA nodes. Unbalanced cpusets may reduce the locality benefit of CPU binding.
For IRQ binding, the process also needs permission to read /proc/interrupts
and write /proc/irq/*/smp_affinity. If irqbalance is running and the process
can use systemctl, vLLM Ascend stops it before applying IRQ affinity. In
containers where systemctl is unavailable, stop irqbalance on the host when
IRQ affinity matters.
On the host, stop irqbalance before starting vLLM when you need stable IRQ
affinity:
sudo systemctl stop irqbalance
After the vLLM service exits, restart it if the host should return to the default IRQ balancing policy:
sudo systemctl start irqbalance
Troubleshooting#
Message |
Meaning |
Action |
|---|---|---|
|
CPU binding only runs on ARM. |
No action needed on x86_64. |
|
No running NPU was found, or |
Check visible NPU IDs and |
|
Fewer than 5 CPUs are available per logical NPU. |
Expand the cpuset or reduce visible NPUs. |
|
Topology affinity is unavailable. |
vLLM Ascend falls back to |
|
Memory migration is skipped, while CPU thread binding still proceeds. |
Install |
|
A binding step failed and CPU binding was skipped for that rank. |
Check |