优化与调优

优化与调优#

This guide aims to help users improve vLLM Ascend performance at the system level. It includes OS configuration, library optimization, deployment guide, and so on. Any feedback is welcome.

准备工作#

运行容器：

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci0
# Update the cann base image
export IMAGE=m.daocloud.io/quay.io/ascend/cann:9.0.0-910b-ubuntu22.04-py3.11
docker run --rm \
--name performance-test \
--shm-size=1g \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash

配置您的环境：

# Configure the mirror
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy main restricted universe multiverse" > /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-updates main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-updates main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-backports main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-backports main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-security main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-security main restricted universe multiverse" >> /etc/apt/sources.list

# Install os packages
apt update && apt install wget gcc g++ libnuma-dev git vim -y

Install vLLM and vLLM Ascend:

# Install necessary dependencies
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install modelscope pandas datasets gevent sacrebleu rouge_score pybind11 pytest

# Configure this var to speed up model download
export VLLM_USE_MODELSCOPE=True

Please follow the Installation Guide to make sure vLLM and vLLM Ascend are installed correctly.

备注

Make sure your vLLM and vLLM Ascend are installed after your Python configuration is completed, because these packages will build binary files using python in current environment. If you install vLLM and vLLM Ascend before completing section 1.1, the binary files will not use the optimized python.

优化措施#

1.编译优化#

1.1.安装优化版 `python`#

Python 从 3.6 及以上版本开始支持 LTO 和 PGO 优化，可以在编译时启用。为了方便用户，我们直接提供了优化版的 python 软件包。您也可以根据具体场景，按照此教程自行构建 python。

mkdir -p /workspace/tmp
cd /workspace/tmp

# Download prebuilt lib and packages
wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libcrypto.so.1.1
wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libomp.so
wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libssl.so.1.1
wget https://repo.oepkgs.net/ascend/pytorch/vllm/python/py311_bisheng.tar.gz

# Configure python and pip
cp ./*.so* /usr/local/lib
tar -zxvf ./py311_bisheng.tar.gz -C /usr/local/
mv  /usr/local/py311_bisheng/  /usr/local/python
sed -i "1c#\!/usr/local/python/bin/python3.11" /usr/local/python/bin/pip3
sed -i "1c#\!/usr/local/python/bin/python3.11" /usr/local/python/bin/pip3.11
ln -sf  /usr/local/python/bin/python3  /usr/bin/python
ln -sf  /usr/local/python/bin/python3  /usr/bin/python3
ln -sf  /usr/local/python/bin/python3.11  /usr/bin/python3.11
ln -sf  /usr/local/python/bin/pip3  /usr/bin/pip3
ln -sf  /usr/local/python/bin/pip3  /usr/bin/pip

export PATH=/usr/bin:/usr/local/python/bin:$PATH

2.操作系统优化#

2.1.jemalloc#

jemalloc 是一个内存分配器，可提升多线程场景下的性能并减少内存碎片。jemalloc 使用本地线程内存管理器来分配变量，这可以避免线程间的锁竞争，从而大幅优化性能。

# Install jemalloc
sudo apt update
sudo apt install libjemalloc2

# Configure jemalloc
export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2:$LD_PRELOAD

2.2.Tcmalloc#

TCMalloc (Thread Caching Malloc) 是一个通用内存分配器，通过引入多级缓存结构、减少互斥锁竞争以及优化大对象处理流程，在确保低延迟的同时提升整体性能。更多详情。

# Install tcmalloc
sudo apt update
sudo apt install libgoogle-perftools4 libgoogle-perftools-dev

# Get the location of libtcmalloc.so*
find /usr -name libtcmalloc.so*

# Make the priority of tcmalloc higher
# The <path> is the location of libtcmalloc.so we get from the upper command
# Example: "$LD_PRELOAD:/usr/lib/aarch64-linux-gnu/libtcmalloc.so"
export LD_PRELOAD="$LD_PRELOAD:<path>"

# Verify your configuration
# The path of libtcmalloc.so will be contained in the result if your configuration is valid
ldd `which python`

3.`torch_npu` 优化#

torch_npu 中的一些性能调优功能由环境变量控制。部分功能及其相关环境变量如下所示。

内存优化：

# Upper limit of memory block splitting allowed (MB): Setting this parameter can prevent large memory blocks from being split.
export PYTORCH_NPU_ALLOC_CONF="max_split_size_mb:250"

# When operators on the communication stream have dependencies, they all need to be ended before being released for reuse. The logic of multi-stream reuse is to release the memory on the communication stream in advance so that the computing stream can be reused.
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"

调度优化：

# Optimize operator delivery queue. This will affect the memory peak value, and may degrade if the memory is tight.
export TASK_QUEUE_ENABLE=2

or

# This will greatly improve the CPU bottleneck model and ensure the same performance for the NPU bottleneck model.
export CPU_AFFINITY_CONF=1

4.CANN 优化#

4.1.HCCL 优化#

HCCL 中有一些性能调优功能，由环境变量控制。

您可以通过设置如下所示的环境变量，将 HCCL 配置为使用 "AIV" 模式以优化性能。在 "AIV" 模式下，通信由 AI 向量核通过 RoCE 直接调度，而非由 AI CPU 调度。

export HCCL_OP_EXPANSION_MODE="AIV"

此外，针对特定场景还有更多性能优化功能，如下所示。

HCCL_INTRA_ROCE_ENABLE：在两个 8P 之间使用 RDMA 链路而非 SDMA 链路作为网状互连链路。更多详情。
HCCL_RDMA_TC：使用此变量配置 RDMA 网卡的流量类别。更多详情。
HCCL_RDMA_SL：使用此变量配置 RDMA 网卡的服务级别。更多详情。
HCCL_BUFFSIZE：使用此变量控制两个 NPU 之间共享数据的缓存大小。更多详情。

5.操作系统优化#

本节描述了在主机（裸机或 Kubernetes 节点）上应用的操作系统级优化，旨在提升推理工作负载的性能稳定性、延迟和吞吐量。

备注

这些设置必须在主机操作系统上以 root 权限应用，而不是在容器内部。

5.1#

将 CPU 频率调节器设置为 performance

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

目的

强制所有 CPU 核心在 performance 调节器下运行
禁用动态频率调节（例如 ondemand、powersave）

优势

使 CPU 核心保持最高频率
减少延迟抖动
提高推理工作负载的可预测性

5.2禁用交换空间使用#

sysctl -w vm.swappiness=0

目的

最小化内核将内存页交换到磁盘的倾向

优势

防止因交换导致的严重延迟峰值
提高大型内存模型的稳定性

备注

对于推理工作负载，交换可能导致秒级延迟
推荐值为 0 或 1

5.3禁用自动 NUMA 平衡#

sysctl -w kernel.numa_balancing=0

目的

禁用内核的自动 NUMA 页面迁移机制

优势

防止后台内存页迁移
减少不可预测的内存访问延迟
提高 NUMA 系统上的性能稳定性

推荐用于

多插槽服务器
具有显式 NUMA 绑定的 Ascend / NPU 部署
手动管理 CPU 和内存亲和性的系统

5.4增加调度器迁移成本#

sysctl -w kernel.sched_migration_cost_ns=50000

目的

增加调度器在 CPU 核心间迁移任务的成本

优势

减少频繁的线程迁移
提高 CPU 缓存局部性
降低推理工作负载的延迟抖动

参数详情

单位：纳秒 (ns)
典型推荐范围：50000–100000
更高的值鼓励线程保持在同一个 CPU 核心上