快速入门

快速入门#

先决条件#

支持的设备#

Atlas A2 training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2)
Atlas 800I A2 inference series (Atlas 800I A2)
Atlas A3 training series (Atlas 800T A3, Atlas 900 A3 SuperPoD, Atlas 9000 A3 SuperPoD)
Atlas 800I A3 inference series (Atlas 800I A3)
[Experimental] Atlas 300I inference series (Atlas 300I Duo) Currently for 310I Duo, the stable version is vllm-ascend v0.10.0rc1

使用容器设置环境#

Ubuntu

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci0
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0
docker run --rm \
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
# Install curl
apt-get update -y && apt-get install -y curl

openEuler

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci0
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0-openeuler
docker run --rm \
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
# Install curl
yum update -y && yum install -y curl

默认的工作目录是 /workspace，vLLM 和 vLLM Ascend 代码被放置在 /vllm-workspace，并以开发模式（pip install -e）安装，以便开发者能够即时生效更改，而无需重新安装。

用法#

你可以使用 Modelscope 镜像来加速下载：

export VLLM_USE_MODELSCOPE=true

在昇腾 NPU 上启动 vLLM 有两种方式：

离线批量推理

安装了 vLLM 后，您可以开始为一系列输入提示生成文本（即离线批量推理）。

尝试直接运行下面的 Python 脚本，或者使用 python3 交互式命令行来生成文本：

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# The first run will take about 3-5 mins (10 MB/s) to download models
llm = LLM(model="Qwen/Qwen3-0.6B")

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

OpenAI Completions API

vLLM can also be deployed as a server that implements the OpenAI API protocol. Run the following command to start the vLLM server with the Qwen/Qwen3-0.6B model:

# Deploy vLLM server (The first run will take about 3-5 mins (10 MB/s) to download models)
vllm serve Qwen/Qwen3-0.6B &

If you see a log as below:

INFO:     Started server process [3594]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

恭喜，你已经成功启动了 vLLM 服务器！

You can query the list of models:

curl http://localhost:8000/v1/models | python3 -m json.tool

你也可以通过输入提示来查询模型：

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen3-0.6B",
        "prompt": "Beijing is a",
        "max_tokens": 5,
        "temperature": 0
    }' | python3 -m json.tool

vLLM is serving as a background process, you can use kill -2 $VLLM_PID to stop the background process gracefully, which is similar to Ctrl-C for stopping the foreground vLLM process:

  VLLM_PID=$(pgrep -f "vllm serve")
  kill -2 "$VLLM_PID"

The output is as below:

INFO:     Shutting down FastAPI HTTP server.
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.

Finally, you can exit the container by using ctrl-D.