预填充-解码分离架构的 Mooncake 验证(Qwen)#

开始之前#

vLLM-Ascend 现已支持预填充-解码分离架构,并包含专家并行的选项。本指南将引导您在受限的资源环境下,一步步验证这些功能。

以 Qwen3-235B 模型为例,使用 4 台 Atlas 800T A3 服务器来部署 "2P1D" 架构。假设预填充服务器的 IP 地址为 192.0.0.1 (预填充器 1) 和 192.0.0.2 (预填充器 2),解码服务器分别为 192.0.0.3 (解码器 1) 和 192.0.0.4 (解码器 2)。每台服务器使用 8个 NPU (16个芯片) 来部署一个服务实例。

验证多节点通信环境#

物理层要求#

  • 物理服务器必须位于同一局域网内,并确保网络互通。

  • 所有 NPU 必须能够互联。节点内通过 HCCS 互联,节点间通过 RDMA 互联。

验证流程#

  1. 单节点验证:

请依次在每个节点上执行以下命令。所有结果必须为 success,状态必须为 UP

# Check the remote switch ports
for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done 
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..15}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..15}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..15}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..15}; do hccn_tool -i $i -gateway -g ; done
  1. 检查 NPU 网络配置:

# Ensure that the hccn.conf file exists in the environment. If using Docker, mount it into the container.
cat /etc/hccn.conf
  1. 获取 NPU IP 地址

for i in {0..15}; do hccn_tool -i $i -ip -g | grep ipaddr; done
  1. 跨节点 PING 测试

# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
for i in {0..15}; do hccn_tool -i $i -ping -g address x.x.x.x;done

安装 Mooncake#

Mooncake 是 Kimi 的服务平台,Kimi 是由 Moonshot AI 提供的主流大语言模型服务。首先,我们需要获取 Mooncake 项目。请参考以下命令:

git clone https://github.com/kvcache-ai/Mooncake.git

更新并安装 Python。

apt-get update
apt-get install python3

修改 Mooncake 编译选项

cd Mooncake
vi mooncake-common/common.cmake
# find this row and set USE_ASCEND_DIRECT ON.
option(USE_ASCEND_DIRECT "option for using ascend npu with adxl engine" ON)

安装 mpi

apt-get install mpich libmpich-dev -y

安装相关依赖。Go 语言无需安装。

bash dependencies.sh -y

编译并安装

mkdir build
cd build
cmake ..
make -j
make install

预填充器/解码器部署#

我们可以分别运行以下脚本来在预填充器/解码器节点上启动服务器。请注意,每个 P/D 节点会占用从 kv_port 到 kv_port + num_chips 的端口范围来初始化 Socket 监听器。为避免问题,应防止端口冲突。此外,请确保每个节点的 engine_id 分配唯一,以避免冲突。

分层模式#

unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.1
export GLOO_SOCKET_IFNAME="eth0"  # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH

vllm serve /model/Qwen3-235B-A22B-W8A8 \
  --host 0.0.0.0 \
  --port 8004 \
  --api-server-count 1 \
  --data-parallel-size 2 \
  --data-parallel-size-local 2 \
  --data-parallel-address 192.0.0.1 \
  --data-parallel-rpc-port 13389 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --seed 1024 \
  --enforce-eager \
  --distributed-executor-backend mp \
  --served-model-name qwen3-moe \
  --max-model-len 32768 \
  --max-num-batched-tokens 32768 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --kv-transfer-config \
  '{"kv_connector": "MooncakeLayerwiseConnector",
  "kv_role": "kv_producer",
  "kv_port": "30000",
  "engine_id": "0",
  "kv_connector_module_path": "vllm_ascend.distributed.mooncake_layerwise_connector",
  "kv_connector_extra_config": {
            "prefill": {
                    "dp_size": 2,
                    "tp_size": 8
             },
             "decode": {
                    "dp_size": 32,
                    "tp_size": 1
             }
      }
  }'
unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.2
export GLOO_SOCKET_IFNAME="eth0"  # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH

vllm serve /model/Qwen3-235B-A22B-W8A8 \
  --host 0.0.0.0 \
  --port 8004 \
  --api-server-count 1 \
  --data-parallel-size 2 \
  --data-parallel-size-local 2 \
  --data-parallel-address 192.0.0.2 \
  --data-parallel-rpc-port 13389 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --seed 1024 \
  --enforce-eager \
  --distributed-executor-backend mp \
  --served-model-name qwen3-moe \
  --max-model-len 32768 \
  --max-num-batched-tokens 32768 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --kv-transfer-config \
  '{"kv_connector": "MooncakeLayerwiseConnector",
  "kv_role": "kv_producer",
  "kv_port": "30100",
  "engine_id": "1",
  "kv_connector_module_path": "vllm_ascend.distributed.mooncake_layerwise_connector",
  "kv_connector_extra_config": {
            "prefill": {
                    "dp_size": 2,
                    "tp_size": 8
             },
             "decode": {
                    "dp_size": 32,
                    "tp_size": 1
             }
      }
  }'
unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.3
export GLOO_SOCKET_IFNAME="eth0"  # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=2048
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH

vllm serve /model/Qwen3-235B-A22B-W8A8 \
  --host 0.0.0.0 \
  --port 8004 \
  --api-server-count 1 \
  --data-parallel-size 32 \
  --data-parallel-size-local 16 \
  --data-parallel-address 192.0.0.3 \
  --data-parallel-rpc-port 5964  \
  --tensor-parallel-size 1 \
  --enable-expert-parallel \
  --seed 1024 \
  --distributed-executor-backend mp \
  --served-model-name qwen3-moe \
  --max-model-len 32768 \
  --max-num-batched-tokens 512 \
  --max-num_seqs 16 \
  --trust-remote-code \
  --no-enable-prefix-caching \
  --gpu-memory-utilization 0.9 \
  --compilation-config '{"cudagraph_capture_sizes":[16]}' \
  --kv-transfer-config \
  '{"kv_connector": "MooncakeLayerwiseConnector",
  "kv_role": "kv_consumer",
  "kv_port": "30200",
  "engine_id": "2",
  "kv_connector_module_path": "vllm_ascend.distributed.mooncake_layerwise_connector",
  "kv_connector_extra_config": {
            "prefill": {
                    "dp_size": 2,
                    "tp_size": 8
             },
             "decode": {
                    "dp_size": 32,
                    "tp_size": 1
             }
      }
  }'
unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.4
export GLOO_SOCKET_IFNAME="eth0"  # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=2048
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH

vllm serve /model/Qwen3-235B-A22B-W8A8 \
  --host 0.0.0.0 \
  --port 8004 \
  --headless \
  --data-parallel-size 32 \
  --data-parallel-size-local 16 \
  --data-parallel-start-rank 16 \
  --data-parallel-address 192.0.0.3 \
  --data-parallel-rpc-port 5964  \
  --tensor-parallel-size 1 \
  --enable-expert-parallel \
  --seed 1024 \
  --distributed-executor-backend mp \
  --served-model-name qwen3-moe \
  --max-model-len 32768 \
  --max-num-batched-tokens 512 \
  --max-num_seqs 16 \
  --trust-remote-code \
  --no-enable-prefix-caching \
  --gpu-memory-utilization 0.9 \
  --compilation-config '{"cudagraph_capture_sizes":[16]}' \
  --kv-transfer-config \
  '{"kv_connector": "MooncakeLayerwiseConnector",
  "kv_role": "kv_consumer",
  "kv_port": "30200",
  "engine_id": "2",
  "kv_connector_module_path": "vllm_ascend.distributed.mooncake_layerwise_connector",
  "kv_connector_extra_config": {
            "prefill": {
                    "dp_size": 2,
                    "tp_size": 8
             },
             "decode": {
                    "dp_size": 32,
                    "tp_size": 1
             }
      }
  }'

非分层模式#

unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.1
export GLOO_SOCKET_IFNAME="eth0"  # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH

vllm serve /model/Qwen3-235B-A22B-W8A8 \
  --host 0.0.0.0 \
  --port 8004 \
  --api-server-count 1 \
  --data-parallel-size 2 \
  --data-parallel-size-local 2 \
  --data-parallel-address 192.0.0.1 \
  --data-parallel-rpc-port 13389 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --seed 1024 \
  --enforce-eager \
  --distributed-executor-backend mp \
  --served-model-name qwen3-moe \
  --max-model-len 32768 \
  --max-num-batched-tokens 32768 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --kv-transfer-config \
  '{"kv_connector": "MooncakeConnector",
  "kv_role": "kv_producer",
  "kv_port": "30000",
  "engine_id": "0",
  "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
  "kv_connector_extra_config": {
            "prefill": {
                    "dp_size": 2,
                    "tp_size": 8
             },
             "decode": {
                    "dp_size": 32,
                    "tp_size": 1
             }
      }
  }'
unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.2
export GLOO_SOCKET_IFNAME="eth0"  # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH

vllm serve /model/Qwen3-235B-A22B-W8A8 \
  --host 0.0.0.0 \
  --port 8004 \
  --api-server-count 1 \
  --data-parallel-size 2 \
  --data-parallel-size-local 2 \
  --data-parallel-address 192.0.0.2 \
  --data-parallel-rpc-port 13389 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --seed 1024 \
  --enforce-eager \
  --distributed-executor-backend mp \
  --served-model-name qwen3-moe \
  --max-model-len 32768 \
  --max-num-batched-tokens 32768 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --kv-transfer-config \
  '{"kv_connector": "MooncakeConnector",
  "kv_role": "kv_producer",
  "kv_port": "30100",
  "engine_id": "1",
  "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
  "kv_connector_extra_config": {
            "prefill": {
                    "dp_size": 2,
                    "tp_size": 8
             },
             "decode": {
                    "dp_size": 32,
                    "tp_size": 1
             }
      }
  }'
unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.3
export GLOO_SOCKET_IFNAME="eth0"  # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=2048
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH

vllm serve /model/Qwen3-235B-A22B-W8A8 \
  --host 0.0.0.0 \
  --port 8004 \
  --api-server-count 1 \
  --data-parallel-size 32 \
  --data-parallel-size-local 16 \
  --data-parallel-address 192.0.0.3 \
  --data-parallel-rpc-port 5964  \
  --tensor-parallel-size 1 \
  --enable-expert-parallel \
  --seed 1024 \
  --distributed-executor-backend mp \
  --served-model-name qwen3-moe \
  --max-model-len 32768 \
  --max-num-batched-tokens 512 \
  --max-num_seqs 16 \
  --trust-remote-code \
  --no-enable-prefix-caching \
  --gpu-memory-utilization 0.9 \
  --compilation-config '{"cudagraph_capture_sizes":[16]}' \
  --kv-transfer-config \
  '{"kv_connector": "MooncakeConnector",
  "kv_role": "kv_consumer",
  "kv_port": "30200",
  "engine_id": "2",
  "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
  "kv_connector_extra_config": {
            "prefill": {
                    "dp_size": 2,
                    "tp_size": 8
             },
             "decode": {
                    "dp_size": 32,
                    "tp_size": 1
             }
      }
  }'
unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.4
export GLOO_SOCKET_IFNAME="eth0"  # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=2048
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH

vllm serve /model/Qwen3-235B-A22B-W8A8 \
  --host 0.0.0.0 \
  --port 8004 \
  --headless \
  --data-parallel-size 32 \
  --data-parallel-size-local 16 \
  --data-parallel-start-rank 16 \
  --data-parallel-address 192.0.0.3 \
  --data-parallel-rpc-port 5964  \
  --tensor-parallel-size 1 \
  --enable-expert-parallel \
  --seed 1024 \
  --distributed-executor-backend mp \
  --served-model-name qwen3-moe \
  --max-model-len 32768 \
  --max-num-batched-tokens 512 \
  --max-num_seqs 16 \
  --trust-remote-code \
  --no-enable-prefix-caching \
  --gpu-memory-utilization 0.9 \
  --compilation-config '{"cudagraph_capture_sizes":[16]}' \
  --kv-transfer-config \
  '{"kv_connector": "MooncakeConnector",
  "kv_role": "kv_consumer",
  "kv_port": "30200",
  "engine_id": "2",
  "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
  "kv_connector_extra_config": {
            "prefill": {
                    "dp_size": 2,
                    "tp_size": 8
             },
             "decode": {
                    "dp_size": 32,
                    "tp_size": 1
             }
      }
  }'

用于部署的示例代理#

在与预填充服务实例相同的节点上运行一个代理服务器。您可以在代码库的示例中获取该代理程序:load_balance_proxy_layerwise_server_example.pyload_balance_proxy_server_example.py

python load_balance_proxy_layerwise_server_example.py \
    --host 192.0.0.1 \
    --port 8080 \
    --prefiller-hosts 192.0.0.1 192.0.0.2\
    --prefiller-port 8004 8004\
    --decoder-hosts 192.0.0.3\
    --decoder-ports 8004
python load_balance_proxy_server_example.py \
    --host 192.0.0.1 \
    --port 8080 \
    --prefiller-hosts 192.0.0.1 192.0.0.2\
    --prefiller-port 8004 8004\
    --decoder-hosts 192.0.0.3\
    --decoder-ports 8004

验证#

使用代理服务器的端点来检查服务健康状态。

curl http://192.0.0.1:8080/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen3-moe",
        "prompt": "Who are you?",
        "max_tokens": 100,
        "temperature": 0
    }'