安装

安装#

本文档介绍如何手动安装 vllm-ascend。

环境要求#

操作系统：Linux
Python：>= 3.9，< 3.12
配备昇腾 NPU 的硬件，通常为 Atlas 800 A2 系列。

软件依赖：

软件	支持的版本	说明
Ascend HDK	请参考此处	CANN 运行所必需
CANN	== 8.3.RC2	vllm-ascend 与 torch-npu 所必需
torch-npu	== 2.7.1.post1	vllm-ascend 所必需，无需手动安装，后续步骤中会自动安装
torch	== 2.7.1	torch-npu 与 vllm 所必需

共有两种安装方式：

使用 pip：先手动准备环境或使用 CANN 镜像准备环境，然后通过 pip 安装 vllm-ascend。
使用 Docker：直接使用 vllm-ascend 预构建的 Docker 镜像。

配置新环境#

在安装之前，请确保固件/驱动以及 CANN 已正确安装。更多详情请参考 Ascend 环境搭建指南。

配置硬件环境#

要验证 Ascend NPU 固件和驱动是否正确安装，请执行：

npu-smi info

更多信息请参考 Ascend 环境搭建指南。

配置软件环境#

使用 pip 之前

准备软件环境的最简单方式是直接使用 CANN 镜像：

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/cann:8.3.rc2-910b-ubuntu22.04-py3.11
docker run --rm \
    --name vllm-ascend-env \
    --device $DEVICE \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -it $IMAGE bash

使用 Docker 之前

如果你使用的是 vllm-ascend 预构建 Docker 镜像，则无需额外步骤。

完成上述步骤后，即可开始配置 vllm 和 vllm-ascend。

安装 vllm 和 vllm-ascend#

使用 pip

首先安装系统依赖并配置 pip 镜像源：

# Using apt-get with mirror
sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
apt-get update -y && apt-get install -y gcc g++ cmake libnuma-dev wget git curl jq
# Or using yum
# yum update -y && yum install -y gcc g++ cmake numactl-devel wget git curl jq
# Config pip mirror
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

【可选】 如果你在 x86 机器上工作，或使用 torch-npu 的开发版本，请配置 pip 的额外索引：

# For torch-npu post version or x86 machine
pip config set global.extra-index-url "https://download.pytorch.org/whl/cpu/ https://mirrors.huaweicloud.com/ascend/repos/pypi"

然后即可从预编译的 wheel 包安装 vllm 和 vllm-ascend：

# Install vllm-project/vllm. The newest supported version is v0.11.0.
# Because the version v0.11.0 has not been archived in pypi, so you need to install from source.
git clone --depth 1 --branch v0.11.0 https://github.com/vllm-project/vllm
cd vllm
VLLM_TARGET_DEVICE=empty pip install -v -e .
cd ..

# Install vllm-project/vllm-ascend from pypi.
pip install vllm-ascend==0.11.0

备注

如果需要使用休眠模式功能，请手动设置 COMPILE_CUSTOM_KERNELS=1。构建自定义算子要求 gcc/g++ 版本高于 8，并支持 C++17 或更高标准。如果使用 pip install -e . 时遇到 torch-npu 版本冲突，请改用 pip install --no-build-isolation -e . 以在系统环境中构建。若编译过程中出现其他问题，通常是由于使用了非预期的编译器，可在编译前通过设置环境变量 CXX_COMPILER 和 C_COMPILER 来指定 g++ 和 gcc 的路径。

使用 Docker

你可以直接拉取预构建镜像并通过 bash 运行。

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0
docker run --rm \
    --name vllm-ascend-env \
    --device $DEVICE \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -it $IMAGE bash

默认工作目录为 /workspace。vLLM 与 vLLM Ascend 的代码位于 /vllm-workspace，并以开发模式（pip install -e）进行安装，以便开发者在修改代码后立即生效，而无需重新安装。

附加信息#

验证安装#

创建并运行一个简单的推理测试，example.py 示例：

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="Qwen/Qwen3-0.6B")

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

然后运行：

# Try `export VLLM_USE_MODELSCOPE=true` and `pip install modelscope`
# to speed up download if huggingface is not reachable.
python example.py

输出结果如下：

Prompt: 'Hello, my name is', Generated text: " Shinji, a teenage boy from New York City. I'm a computer science"
Prompt: 'The president of the United States is', Generated text: ' a very important person. When he or she is elected, many people think that'
Prompt: 'The capital of France is', Generated text: ' Paris. The oldest part of the city is Saint-Germain-des-Pr'
Prompt: 'The future of AI is', Generated text: ' not bright\n\nThere is no doubt that the evolution of AI will have a huge'