快速开始¶

简介¶

本节将指导您完成基于容器的环境搭建和大模型推理，以 Qwen3-0.6B 离线单卡推理脚本为例。

关于使用不同模型的详细信息，请参见“模型教程”目录下的对应模型教程，例如 Qwen3-30B-A3B。
关于使用不同功能的详细信息，请参见“功能教程”目录下的对应功能教程，例如 Prefill-Decode 分离部署 (Deepseek)。

前提条件¶

支持的设备¶

Atlas A2 训练系列（Atlas 800T A2、Atlas 900 A2 PoD、Atlas 200T A2 Box16、Atlas 300T A2）
Atlas 800I A2 推理系列（Atlas 800I A2）
Atlas A3 训练系列（Atlas 800T A3、Atlas 900 A3 SuperPoD、Atlas 9000 A3 SuperPoD）
Atlas 800I A3 推理系列（Atlas 800I A3）
[实验性] Atlas 300I 推理系列（Atlas 300I Duo）

环境要求¶

操作系统：Linux
Python：>= 3.10, < 3.13
硬件：配备昇腾 NPU，通常为 Atlas 800 A2 系列。

软件：

软件	支持的版本	备注
Ascend HDK	参考文档 CANN 9.0.0	CANN 所需
CANN	== 9.0.0	vllm-ascend 和 torch-npu 所需
torch-npu	== 2.10.0	vllm-ascend 所需，无需手动安装，将在以下步骤中自动安装
torch	== 2.10.0	torch-npu 和 vllm 所需，无需手动安装，将在以下步骤中自动安装
NNAL	== 9.0.0	libatb.so 所需，支持高级张量运算

使用容器搭建环境¶

在使用容器之前，请确保您的系统已安装 Docker。如果尚未安装 Docker，请参考 Docker 安装指南进行安装。

UbuntuopenEuler

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci0
# Update the vllm-ascend image
# Atlas A2:
# export IMAGE=quay.io/ascend/vllm-ascend:v0.22.1rc1
# Atlas A3:
# export IMAGE=quay.io/ascend/vllm-ascend:v0.22.1rc1-a3
export IMAGE=quay.io/ascend/vllm-ascend:v0.22.1rc1
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
# Install curl
apt-get update -y && apt-get install -y curl

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci0
# Update the vllm-ascend image
# Atlas A2:
# export IMAGE=quay.io/ascend/vllm-ascend:v0.22.1rc1-openeuler
# Atlas A3:
# export IMAGE=quay.io/ascend/vllm-ascend:v0.22.1rc1-a3-openeuler
export IMAGE=quay.io/ascend/vllm-ascend:v0.22.1rc1-openeuler
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
# Install curl
yum update -y && yum install -y curl

默认工作目录为 /workspace，vLLM 和 vLLM Ascend 代码放置在 /vllm-workspace 中，并以开发模式（pip install -e）安装，以便开发者修改后立即生效，无需重新安装。

使用方法¶

您可以使用 ModelScope 镜像加速下载：

export VLLM_USE_MODELSCOPE=True

在昇腾 NPU 上启动 vLLM 有两种方式：

离线批量推理OpenAI Completions API

安装 vLLM 后，您可以开始为输入提示列表生成文本（即离线批量推理）。

创建并运行一个简单的推理测试。example.py 的内容可以如下：

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# The first run will take about 3-5 mins (10 MB/s) to download models
llm = LLM(model="Qwen/Qwen3-0.6B")

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

然后运行：

python example.py

如果遇到与 Hugging Face 的连接错误（例如 We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.），请运行以下命令以使用 ModelScope 作为替代：

export VLLM_USE_MODELSCOPE=True
pip install modelscope
python example.py

以下部分显示 vllm 已成功检测到昇腾平台：

INFO 05-27 11:40:38 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 05-27 11:40:38 [__init__.py:46] - ascend -> vllm_ascend:register
INFO 05-27 11:40:38 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 05-27 11:40:38 [__init__.py:238] Platform plugin ascend is activated

以下部分显示最终输出：

Prompt: 'Hello, my name is', Generated text: ' Lucy and I am an 8 year old who loves to draw and write stories'
Prompt: 'The president of the United States is', Generated text: " a key leader in the federal government, and the president's role in the executive"
Prompt: 'The capital of France is', Generated text: ' a city. What is the capital of France? The capital of France is Paris'
Prompt: 'The future of AI is', Generated text: ' a topic that is being discussed in various contexts. In the business world, AI'

以下部分显示离线推理后进程退出，不影响实际推理：

(EngineCore pid=970) INFO 05-12 11:36:00 [core.py:1201] Shutdown initiated (timeout=0)
(EngineCore pid=970) INFO 05-12 11:36:00 [core.py:1224] Shutdown complete
ERROR 05-12 11:36:01 [core_client.py:704] Engine core proc EngineCore died unexpectedly, shutting down client.
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

vLLM 也可以部署为实现了 OpenAI API 协议的服务器。运行以下命令，使用 Qwen/Qwen3-0.6B 模型启动 vLLM 服务器：

# Deploy vLLM server (The first run will take about 3-5 mins (10 MB/s) to download models)
vllm serve Qwen/Qwen3-0.6B &

如果您看到如下日志：

INFO:     Started server process [3594]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

恭喜，您已成功启动 vLLM 服务器！

您可以查询模型列表：

curl http://localhost:8000/v1/models | python3 -m json.tool

您也可以使用输入提示查询模型：

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen3-0.6B",
        "prompt": "Beijing is a",
        "max_completion_tokens": 5,
        "temperature": 0
    }' | python3 -m json.tool

vLLM 作为后台进程运行，您可以使用 kill -2 $VLLM_PID 优雅地停止后台进程，这类似于使用 Ctrl-C 停止前台 vLLM 进程：

  VLLM_PID=$(pgrep -f "vllm serve")
  kill -2 "$VLLM_PID"

输出如下：

INFO:     Shutting down FastAPI HTTP server.
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.

最后，您可以使用 ctrl-D 退出容器。