预填充-解码分离架构的 Mooncake 验证(Qwen)#
开始之前#
vLLM-Ascend 现已支持预填充-解码分离架构,并包含专家并行的选项。本指南将引导您在受限的资源环境下,一步步验证这些功能。
以 Qwen3-235B 模型为例,使用 4 台 Atlas 800T A3 服务器来部署 "2P1D" 架构。假设预填充服务器的 IP 地址为 192.0.0.1 (预填充器 1) 和 192.0.0.2 (预填充器 2),解码服务器分别为 192.0.0.3 (解码器 1) 和 192.0.0.4 (解码器 2)。每台服务器使用 8个 NPU (16个芯片) 来部署一个服务实例。
验证多节点通信环境#
物理层要求#
物理服务器必须位于同一局域网内,并确保网络互通。
所有 NPU 必须能够互联。节点内通过 HCCS 互联,节点间通过 RDMA 互联。
验证流程#
单节点验证:
请依次在每个节点上执行以下命令。所有结果必须为 success,状态必须为 UP:
# Check the remote switch ports
for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..15}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..15}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..15}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..15}; do hccn_tool -i $i -gateway -g ; done
检查 NPU 网络配置:
# Ensure that the hccn.conf file exists in the environment. If using Docker, mount it into the container.
cat /etc/hccn.conf
获取 NPU IP 地址
for i in {0..15}; do hccn_tool -i $i -ip -g | grep ipaddr; done
跨节点 PING 测试
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
for i in {0..15}; do hccn_tool -i $i -ping -g address x.x.x.x;done
安装 Mooncake#
Mooncake 是 Kimi 的服务平台,Kimi 是由 Moonshot AI 提供的主流大语言模型服务。首先,我们需要获取 Mooncake 项目。请参考以下命令:
git clone https://github.com/kvcache-ai/Mooncake.git
更新并安装 Python。
apt-get update
apt-get install python3
修改 Mooncake 编译选项
cd Mooncake
vi mooncake-common/common.cmake
# find this row and set USE_ASCEND_DIRECT ON.
option(USE_ASCEND_DIRECT "option for using ascend npu with adxl engine" ON)
安装 mpi
apt-get install mpich libmpich-dev -y
安装相关依赖。Go 语言无需安装。
bash dependencies.sh -y
编译并安装
mkdir build
cd build
cmake ..
make -j
make install
预填充器/解码器部署#
我们可以分别运行以下脚本来在预填充器/解码器节点上启动服务器。请注意,每个 P/D 节点会占用从 kv_port 到 kv_port + num_chips 的端口范围来初始化 Socket 监听器。为避免问题,应防止端口冲突。此外,请确保每个节点的 engine_id 分配唯一,以避免冲突。
分层模式#
unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.1
export GLOO_SOCKET_IFNAME="eth0" # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
vllm serve /model/Qwen3-235B-A22B-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--api-server-count 1 \
--data-parallel-size 2 \
--data-parallel-size-local 2 \
--data-parallel-address 192.0.0.1 \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--seed 1024 \
--enforce-eager \
--distributed-executor-backend mp \
--served-model-name qwen3-moe \
--max-model-len 32768 \
--max-num-batched-tokens 32768 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
'{"kv_connector": "MooncakeLayerwiseConnector",
"kv_role": "kv_producer",
"kv_port": "30000",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_layerwise_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.2
export GLOO_SOCKET_IFNAME="eth0" # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
vllm serve /model/Qwen3-235B-A22B-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--api-server-count 1 \
--data-parallel-size 2 \
--data-parallel-size-local 2 \
--data-parallel-address 192.0.0.2 \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--seed 1024 \
--enforce-eager \
--distributed-executor-backend mp \
--served-model-name qwen3-moe \
--max-model-len 32768 \
--max-num-batched-tokens 32768 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
'{"kv_connector": "MooncakeLayerwiseConnector",
"kv_role": "kv_producer",
"kv_port": "30100",
"engine_id": "1",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_layerwise_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.3
export GLOO_SOCKET_IFNAME="eth0" # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=2048
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
vllm serve /model/Qwen3-235B-A22B-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--api-server-count 1 \
--data-parallel-size 32 \
--data-parallel-size-local 16 \
--data-parallel-address 192.0.0.3 \
--data-parallel-rpc-port 5964 \
--tensor-parallel-size 1 \
--enable-expert-parallel \
--seed 1024 \
--distributed-executor-backend mp \
--served-model-name qwen3-moe \
--max-model-len 32768 \
--max-num-batched-tokens 512 \
--max-num_seqs 16 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--compilation-config '{"cudagraph_capture_sizes":[16]}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeLayerwiseConnector",
"kv_role": "kv_consumer",
"kv_port": "30200",
"engine_id": "2",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_layerwise_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.4
export GLOO_SOCKET_IFNAME="eth0" # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=2048
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
vllm serve /model/Qwen3-235B-A22B-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--headless \
--data-parallel-size 32 \
--data-parallel-size-local 16 \
--data-parallel-start-rank 16 \
--data-parallel-address 192.0.0.3 \
--data-parallel-rpc-port 5964 \
--tensor-parallel-size 1 \
--enable-expert-parallel \
--seed 1024 \
--distributed-executor-backend mp \
--served-model-name qwen3-moe \
--max-model-len 32768 \
--max-num-batched-tokens 512 \
--max-num_seqs 16 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--compilation-config '{"cudagraph_capture_sizes":[16]}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeLayerwiseConnector",
"kv_role": "kv_consumer",
"kv_port": "30200",
"engine_id": "2",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_layerwise_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
非分层模式#
unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.1
export GLOO_SOCKET_IFNAME="eth0" # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
vllm serve /model/Qwen3-235B-A22B-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--api-server-count 1 \
--data-parallel-size 2 \
--data-parallel-size-local 2 \
--data-parallel-address 192.0.0.1 \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--seed 1024 \
--enforce-eager \
--distributed-executor-backend mp \
--served-model-name qwen3-moe \
--max-model-len 32768 \
--max-num-batched-tokens 32768 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnector",
"kv_role": "kv_producer",
"kv_port": "30000",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.2
export GLOO_SOCKET_IFNAME="eth0" # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
vllm serve /model/Qwen3-235B-A22B-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--api-server-count 1 \
--data-parallel-size 2 \
--data-parallel-size-local 2 \
--data-parallel-address 192.0.0.2 \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--seed 1024 \
--enforce-eager \
--distributed-executor-backend mp \
--served-model-name qwen3-moe \
--max-model-len 32768 \
--max-num-batched-tokens 32768 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnector",
"kv_role": "kv_producer",
"kv_port": "30100",
"engine_id": "1",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.3
export GLOO_SOCKET_IFNAME="eth0" # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=2048
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
vllm serve /model/Qwen3-235B-A22B-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--api-server-count 1 \
--data-parallel-size 32 \
--data-parallel-size-local 16 \
--data-parallel-address 192.0.0.3 \
--data-parallel-rpc-port 5964 \
--tensor-parallel-size 1 \
--enable-expert-parallel \
--seed 1024 \
--distributed-executor-backend mp \
--served-model-name qwen3-moe \
--max-model-len 32768 \
--max-num-batched-tokens 512 \
--max-num_seqs 16 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--compilation-config '{"cudagraph_capture_sizes":[16]}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnector",
"kv_role": "kv_consumer",
"kv_port": "30200",
"engine_id": "2",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.4
export GLOO_SOCKET_IFNAME="eth0" # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=2048
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
vllm serve /model/Qwen3-235B-A22B-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--headless \
--data-parallel-size 32 \
--data-parallel-size-local 16 \
--data-parallel-start-rank 16 \
--data-parallel-address 192.0.0.3 \
--data-parallel-rpc-port 5964 \
--tensor-parallel-size 1 \
--enable-expert-parallel \
--seed 1024 \
--distributed-executor-backend mp \
--served-model-name qwen3-moe \
--max-model-len 32768 \
--max-num-batched-tokens 512 \
--max-num_seqs 16 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--compilation-config '{"cudagraph_capture_sizes":[16]}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnector",
"kv_role": "kv_consumer",
"kv_port": "30200",
"engine_id": "2",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
用于部署的示例代理#
在与预填充服务实例相同的节点上运行一个代理服务器。您可以在代码库的示例中获取该代理程序:load_balance_proxy_layerwise_server_example.py 或 load_balance_proxy_server_example.py
python load_balance_proxy_layerwise_server_example.py \
--host 192.0.0.1 \
--port 8080 \
--prefiller-hosts 192.0.0.1 192.0.0.2\
--prefiller-port 8004 8004\
--decoder-hosts 192.0.0.3\
--decoder-ports 8004
python load_balance_proxy_server_example.py \
--host 192.0.0.1 \
--port 8080 \
--prefiller-hosts 192.0.0.1 192.0.0.2\
--prefiller-port 8004 8004\
--decoder-hosts 192.0.0.3\
--decoder-ports 8004
验证#
使用代理服务器的端点来检查服务健康状态。
curl http://192.0.0.1:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-moe",
"prompt": "Who are you?",
"max_tokens": 100,
"temperature": 0
}'