优化与调优¶

本指南旨在帮助用户在系统层面提升 vLLM Ascend 的性能，涵盖操作系统配置、库优化、部署指南等内容。欢迎提供任何反馈。

准备工作¶

运行容器：

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci0
# Update the cann base image
export IMAGE=m.daocloud.io/quay.io/ascend/cann:9.0.1-910b-ubuntu22.04-py3.12
docker run --rm \
--name performance-test \
--shm-size=1g \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash

配置环境：

# Configure the mirror
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy main restricted universe multiverse" > /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-updates main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-updates main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-backports main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-backports main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-security main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-security main restricted universe multiverse" >> /etc/apt/sources.list

# Install os packages
apt update && apt install wget gcc g++ libnuma-dev git vim -y

安装 vLLM 和 vLLM Ascend：

# Install necessary dependencies
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install modelscope pandas datasets gevent sacrebleu rouge_score pybind11 pytest

# Configure this var to speed up model download
export VLLM_USE_MODELSCOPE=True

请遵循安装指南确保 vLLM 和 vLLM Ascend 已正确安装。

Note

请确保在完成 Python 配置后再安装 vLLM 和 vLLM Ascend，因为这些包会使用当前环境中的 Python 构建二进制文件。如果在完成第 1.1 节之前安装 vLLM 和 vLLM Ascend，则二进制文件将无法使用优化后的 Python。

优化措施¶

1. 内存分配器优化¶

1.1. jemalloc¶

jemalloc 是一种内存分配器，可提升多线程场景下的性能并减少内存碎片。jemalloc 使用本地线程内存管理器来分配变量，从而避免线程间的锁竞争，大幅优化性能。

# Install jemalloc
sudo apt update
sudo apt install libjemalloc2

# Configure jemalloc
export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2:$LD_PRELOAD

1.2. TCMalloc¶

TCMalloc（线程缓存 Malloc） 是一种通用内存分配器，通过引入多级缓存结构、减少互斥锁竞争并优化大对象处理流程，在保证低延迟的同时提升整体性能。更多详情。

# Install tcmalloc
sudo apt update
sudo apt install libgoogle-perftools4 libgoogle-perftools-dev

# Get the location of libtcmalloc.so*
find /usr -name libtcmalloc.so*

# Make the priority of tcmalloc higher
# The <path> is the location of libtcmalloc.so we get from the upper command
# Example: "$LD_PRELOAD:/usr/lib/aarch64-linux-gnu/libtcmalloc.so"
export LD_PRELOAD="$LD_PRELOAD:<path>"

# Verify your configuration
# The path of libtcmalloc.so will be contained in the result if your configuration is valid
ldd `which python`

2. `torch_npu` 优化¶

torch_npu 中的部分性能调优功能由环境变量控制。以下列出部分功能及其相关环境变量。

内存优化：

# Upper limit of memory block splitting allowed (MB): Setting this parameter can prevent large memory blocks from being split.
export PYTORCH_NPU_ALLOC_CONF="max_split_size_mb:250"

或

# When operators on the communication stream have dependencies, they all need to be ended before being released for reuse. The logic of multi-stream reuse is to release the memory on the communication stream in advance so that the computing stream can be reused.
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"

调度优化：

# Optimize operator delivery queue. This will affect the memory peak value, and may degrade if the memory is tight.
export TASK_QUEUE_ENABLE=2

# This will greatly improve the CPU bottleneck model and ensure the same performance for the NPU bottleneck model.
export CPU_AFFINITY_CONF=1

3. CANN 优化¶

3.1. HCCL 优化¶

HCCL 中存在部分性能调优功能，由环境变量控制。

您可以通过设置如下环境变量，将 HCCL 配置为使用 "AIV" 模式以优化性能。在 "AIV" 模式下，通信由 AI 向量核心直接通过 RoCE 调度，而非由 AI CPU 调度。

export HCCL_OP_EXPANSION_MODE="AIV"

此外，在特定场景下还有更多性能优化功能，如下所示。

HCCL_INTRA_ROCE_ENABLE：使用 RDMA 链路替代两个 8P 之间的 SDMA 链路作为网格互联链路。更多详情。
HCCL_RDMA_TC：使用此变量配置 RDMA 网卡的流量类别。更多详情。
HCCL_RDMA_SL：使用此变量配置 RDMA 网卡的服务等级。更多详情。
HCCL_BUFFSIZE：使用此变量控制两个 NPU 之间共享数据的缓存大小。更多详情。

4. 内核优化¶

本节描述在宿主机（裸机或 Kubernetes 节点）上应用的操作系统级优化，以提升推理工作负载的性能稳定性、延迟和吞吐量。

Note

这些设置必须在宿主机操作系统上以 root 权限应用，而非在容器内部。

4.1 将 CPU 频率调节器设置为 `performance`¶

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

目的

强制所有 CPU 核心在 performance 调节器下运行
禁用动态频率缩放（例如 ondemand、powersave）

优势

保持 CPU 核心处于最高频率
减少延迟抖动
提升推理工作负载的可预测性

4.2 禁用交换空间使用¶

sysctl -w vm.swappiness=0

目的

最小化内核将内存页交换到磁盘的倾向

优势

防止由交换引起的严重延迟尖峰
提升大型内存模型的稳定性

说明

对于推理工作负载，交换可能引入秒级延迟
推荐值为 0 或 1

4.3 禁用自动 NUMA 平衡¶

sysctl -w kernel.numa_balancing=0

目的

禁用内核的自动 NUMA 页面迁移机制

优势

防止后台内存页面迁移
减少不可预测的内存访问延迟
提升 NUMA 系统上的性能稳定性

推荐场景

多插槽服务器
具有显式 NUMA 绑定的 Ascend / NPU 部署
手动管理 CPU 和内存亲和性的系统

4.4 增加调度器迁移成本¶

sysctl -w kernel.sched_migration_cost_ns=50000

目的

增加调度器在 CPU 核心之间迁移任务的成本

优势

减少频繁的线程迁移
提升 CPU 缓存局部性
降低推理工作负载的延迟抖动

参数详情

单位：纳秒 (ns)
典型推荐范围：50000–100000
值越高，线程越倾向于停留在同一 CPU 核心上