预填充-解码分离部署（Qwen2.5-VL）

预填充-解码分离部署（Qwen2.5-VL）#

快速开始#

vLLM-Ascend 现已支持预填充-解码（PD）分离部署。本指南将逐步介绍如何在资源受限的环境下验证此功能。

Using the Qwen2.5-VL-7B-Instruct model as an example, use vLLM-Ascend v0.20.2rc1 (with vLLM v0.20.2) on 1 Atlas 800T A2 server to deploy the "1P1D" architecture (one Prefiller and one Decoder on the same node). Assume the IP address is 192.0.0.1.

验证通信环境#

验证流程#

单节点验证：

依次执行以下命令。结果必须全部为 success，状态必须为 UP：

# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done

检查 NPU HCCN 配置：

确保环境中存在 hccn.conf 文件。如果使用 Docker，请将其挂载到容器中。
```
cat /etc/hccn.conf
```

获取 NPU IP 地址

for i in {0..7}; do hccn_tool -i $i -ip -g;done

跨节点 PING 测试

# Execute on the target node (replace 'x.x.x.x' with actual npu ip address).
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done

检查 NPU TLS 配置

# The tls settings should be consistent across all nodes
for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch

使用 Docker 运行#

启动 Docker 容器。

# Update the vllm-ascend image
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:v0.20.2rc1
export NAME=vllm-ascend

# Run the container using the defined variables
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /mnt/sfs_turbo/.cache:/root/.cache \
-it $IMAGE bash

安装 Mooncake#

Mooncake 是 Moonshot AI 提供的领先大语言模型服务 Kimi 的服务平台。安装与编译指南：kvcache-ai/Mooncake。首先，我们需要获取 Mooncake 项目。请参考以下命令：

git clone -b v0.3.9 --depth 1 https://github.com/kvcache-ai/Mooncake.git

（可选）如果网络状况不佳，请替换 go install 的 URL。

cd Mooncake
sed -i 's|https://go.dev/dl/|https://golang.google.cn/dl/|g' dependencies.sh

安装 mpi。

apt-get install mpich libmpich-dev -y

安装相关依赖。无需安装 Go。

bash dependencies.sh -y

编译并安装。

mkdir build
cd build
cmake .. -DUSE_ASCEND_DIRECT=ON
make -j
make install

设置环境变量。

注意：

请根据您的具体 Python 安装情况调整 Python 路径
确保 /usr/local/lib 和 /usr/local/lib64 在您的 LD_LIBRARY_PATH 中

export LD_LIBRARY_PATH=/usr/local/lib64/python3.11/site-packages/mooncake:$LD_LIBRARY_PATH

预填充器/解码器部署#

我们可以运行以下脚本分别在预填充器/解码器 NPU 上启动服务。

预填充器

export ASCEND_RT_VISIBLE_DEVICES=0
export HCCL_IF_IP=192.0.0.1  # node ip
export GLOO_SOCKET_IFNAME="eth0"  # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10

vllm serve /model/Qwen2.5-VL-7B-Instruct  \
  --host 0.0.0.0 \
  --port 13700 \
  --no-enable-prefix-caching \
  --tensor-parallel-size 1 \
  --seed 1024 \
  --served-model-name qwen25vl \
  --max-model-len 40000  \
  --max-num-batched-tokens 40000  \
  --trust-remote-code \
  --gpu-memory-utilization 0.9  \
  --kv-transfer-config \
  '{"kv_connector": "MooncakeConnectorV1",
  "kv_role": "kv_producer",
  "kv_port": "30000",
  "engine_id": "0",
  "kv_connector_extra_config": {
            "prefill": {
                    "dp_size": 1,
                    "tp_size": 1
             },
             "decode": {
                    "dp_size": 1,
                    "tp_size": 1
             }
      }
  }'

解码器

export ASCEND_RT_VISIBLE_DEVICES=1
export HCCL_IF_IP=192.0.0.1  # node ip
export GLOO_SOCKET_IFNAME="eth0"  # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10

vllm serve /model/Qwen2.5-VL-7B-Instruct  \
  --host 0.0.0.0 \
  --port 13701 \
  --no-enable-prefix-caching \
  --tensor-parallel-size 1 \
  --seed 1024 \
  --served-model-name qwen25vl \
  --max-model-len 40000  \
  --max-num-batched-tokens 40000  \
  --trust-remote-code \
  --gpu-memory-utilization 0.9  \
  --kv-transfer-config \
  '{"kv_connector": "MooncakeConnectorV1",
  "kv_role": "kv_consumer",
  "kv_port": "30100",
  "engine_id": "1",
  "kv_connector_extra_config": {
            "prefill": {
                    "dp_size": 1,
                    "tp_size": 1
             },
             "decode": {
                    "dp_size": 1,
                    "tp_size": 1
             }
      }
  }'

如果要运行“2P1D”，请为每个 P 进程设置不同的 ASCEND_RT_VISIBLE_DEVICES 和端口。

部署示例代理#

在与预填充器服务实例相同的节点上运行代理服务器。您可以在仓库的示例中获取代理程序：load_balance_proxy_server_example.py

python load_balance_proxy_server_example.py \
    --host 192.0.0.1 \
    --port 8080 \
    --prefiller-hosts 192.0.0.1 \
    --prefiller-port 13700 \
    --decoder-hosts 192.0.0.1 \
    --decoder-ports 13701

参数	含义
--port	代理端口
--prefiller-port	预填充的所有端口
--decoder-ports	解码的所有端口

验证#

使用代理服务器端点检查服务健康状态。

curl http://192.0.0.1:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen25vl",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": [
                {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
                {"type": "text", "text": "What is the text in the illustration?"}
            ]}
            ],
        "max_completion_tokens": 100,
        "temperature": 0
    }'