大规模专家并行的DP分布式部署

大规模专家并行的DP分布式部署#

快速上手#

vLLM-Ascend now supports prefill-decode (PD) disaggregation in the large scale Expert Parallelism (EP) scenario. To achieve better performance, the distributed DP server is applied in vLLM-Ascend. In the PD separation scenario, different optimization strategies can be implemented based on the distinct characteristics of PD nodes, thereby enabling more flexible model deployment.
Take the deepseek model as an example, use 8 Atlas 800T A3 servers to deploy the model. Assume the ip of the servers start from 192.0.0.1, and end by 192.0.0.8. Use the first 4 servers as prefiller nodes and the last 4 servers as decoder nodes. And the prefiller nodes deployed as master node independently, the decoder nodes set 192.0.0.5 node to be the master node.

多节点通信环境校验#

物理层要求：#

物理机必须位于同一局域网（WLAN）内，并具备网络连通性。
所有 NPU 必须互联。对于 Atlas A2 世代，节点内通过 HCCS 连接，节点间通过 RDMA 连接。对于 Atlas A3 世代，节点内和节点间均通过 HCCS 连接。

校验过程：#

A3

单节点校验：

依次在每个节点执行以下命令。结果必须全部为 success 且状态为 UP：

 # Check the remote switch ports
 for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done
 # Get the link status of the Ethernet ports (UP or DOWN)
 for i in {0..15}; do hccn_tool -i $i -link -g ; done
 # Check the network health status
 for i in {0..15}; do hccn_tool -i $i -net_health -g ; done
 # View the network detected IP configuration
 for i in {0..15}; do hccn_tool -i $i -netdetect -g ; done
 # View gateway configuration
 for i in {0..15}; do hccn_tool -i $i -gateway -g ; done
 # View NPU network configuration
 cat /etc/hccn.conf

获取 NPU IP 地址

for i in {0..15}; do hccn_tool -i $i -vnic -g;done

获取 superpodid 和 SDID

for i in {0..15}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-info -i $i -c 1;done

跨节点 PING 测试

# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done

A2

单节点校验：

依次在每个节点执行以下命令。结果必须全部为 success 且状态为 UP：

# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf

获取 NPU IP 地址

for i in {0..7}; do hccn_tool -i $i -ip -g;done

跨节点 PING 测试

# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done

大规模 EP 模型部署#

生成带有配置的脚本#

在 PD 分离场景下，我们提供了优化配置。您可以使用以下 Shell 脚本分别配置 Prefiller 节点和 Decoder 节点。

Prefiller 节点

# run_dp_template.sh
#!/bin/sh

# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip
nic_name="xxxx"
local_ip="xxxx"

# basic configuration for HCCL and connection
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_BUFFSIZE=256

# obtain parameters from distributed DP server
export VLLM_DP_SIZE=$1
export VLLM_DP_MASTER_IP=$2
export VLLM_DP_MASTER_PORT=$3
export VLLM_DP_RANK_LOCAL=$4
export VLLM_DP_RANK=$5
export VLLM_DP_SIZE_LOCAL=$7

#pytorch_npu settings and vllm settings
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export TASK_QUEUE_ENABLE=1
export VLLM_USE_MODELSCOPE="True"

# enable the distributed DP server
export VLLM_WORKER_MULTIPROC_METHOD="fork"
export VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED=1

# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8
# "--additional-config" is used to enable characteristics from vllm-ascend
vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
    --host 0.0.0.0 \
    --port $6 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --seed 1024 \
    --served-model-name deepseek_r1 \
    --max-model-len 17000 \
    --max-num-batched-tokens 16384 \
    --trust-remote-code \
    --max-num-seqs 4 \
    --gpu-memory-utilization 0.9 \
    --quantization ascend \
    --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
    --enforce-eager \
    --kv-transfer-config \
    '{"kv_connector": "MooncakeConnectorV1",
      "kv_buffer_device": "npu",
      "kv_role": "kv_producer",
      "kv_parallel_size": "1",
      "kv_port": "20001",
      "engine_id": "0",
      "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector"
    }'
    --additional-config '{"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}'

Decoder 节点

# run_dp_template.sh
#!/bin/sh

# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip
nic_name="xxxx"
local_ip="xxxx"

# basic configuration for HCCL and connection
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_BUFFSIZE=1024

# obtain parameters from distributed DP server
export VLLM_DP_SIZE=$1
export VLLM_DP_MASTER_IP=$2
export VLLM_DP_MASTER_PORT=$3
export VLLM_DP_RANK_LOCAL=$4
export VLLM_DP_RANK=$5
export VLLM_DP_SIZE_LOCAL=$7

#pytorch_npu settings and vllm settings
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export TASK_QUEUE_ENABLE=1
export VLLM_USE_MODELSCOPE="True"

# enable the distributed DP server
export VLLM_WORKER_MULTIPROC_METHOD="fork"
export VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED=1

# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8
# "--additional-config" is used to enable characteristics from vllm-ascend
vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
    --host 0.0.0.0 \
    --port $6 \
    --tensor-parallel-size 1 \
    --enable-expert-parallel \
    --seed 1024 \
    --served-model-name deepseek_r1 \
    --max-model-len 17000 \
    --max-num-batched-tokens 256 \
    --trust-remote-code \
    --max-num-seqs 28 \
    --gpu-memory-utilization 0.9 \
    --quantization ascend \
    --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
    --kv-transfer-config \
        '{"kv_connector": "MooncakeConnectorV1",
        "kv_buffer_device": "npu",
        "kv_role": "kv_consumer",
        "kv_parallel_size": "1",
        "kv_port": "20001",
        "engine_id": "0",
        "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector"
        }' \
    --additional-config '{"enable_weight_nz_layout":true}'

启动分布式 DP 服务端以进行 PD 分离#

在所有节点上执行以下 Python 文件以使用分布式 DP 服务端。（建议在 v0.9.1 官方版本上使用此功能）

Prefiller 节点

import multiprocessing
import os
import sys
dp_size = 2 # total number of DP engines for decode/prefill
dp_size_local = 2 # number of DP engines on the current node
dp_rank_start = 0 # starting DP rank for the current node
# dp_ip is different on prefiller nodes in this example
dp_ip = "192.0.0.1" # master node ip for DP communication
dp_port = 13395 # port used for DP communication
engine_port = 9000 # starting port for all DP groups on the current node
template_path = "./run_dp_template.sh"
if not os.path.exists(template_path):
  print(f"Template file {template_path} does not exist.")
  sys.exit(1)
def run_command(dp_rank_local, dp_rank, engine_port_):
  command = f"bash ./run_dp_template.sh {dp_size} {dp_ip} {dp_port} {dp_rank_local} {dp_rank} {engine_port_} {dp_size_local}"
  os.system(command)
processes = []
for i in range(dp_size_local):
  dp_rank = dp_rank_start + i
  dp_rank_local = i
  engine_port_ = engine_port + i
  process = multiprocessing.Process(target=run_command, args=(dp_rank_local, dp_rank, engine_port_))
  processes.append(process)
  process.start()
for process in processes:
  process.join()

Decoder 节点

import multiprocessing
import os
import sys
dp_size = 64 # total number of DP engines for decode/prefill
dp_size_local = 16 # number of DP engines on the current node
dp_rank_start = 0 # starting DP rank for the current node. e.g. 0/16/32/48
# dp_ip is the same on decoder nodes in this example
dp_ip = "192.0.0.5" # master node ip for DP communication.
dp_port = 13395 # port used for DP communication
engine_port = 9000 # starting port for all DP groups on the current node
template_path = "./run_dp_template.sh"
if not os.path.exists(template_path):
  print(f"Template file {template_path} does not exist.")
  sys.exit(1)
def run_command(dp_rank_local, dp_rank, engine_port_):
  command = f"bash ./run_dp_template.sh {dp_size} {dp_ip} {dp_port} {dp_rank_local} {dp_rank} {engine_port_} {dp_size_local}"
  os.system(command)
processes = []
for i in range(dp_size_local):
  dp_rank = dp_rank_start + i
  dp_rank_local = i
  engine_port_ = engine_port + i
  process = multiprocessing.Process(target=run_command, args=(dp_rank_local, dp_rank, engine_port_))
  processes.append(process)
  process.start()
for process in processes:
  process.join()

请注意，Prefiller 节点和 Decoder 节点可能有不同的配置。在此示例中，每个 Prefiller 节点独立作为 Master 节点部署，但所有 Decoder 节点都将第一个节点作为 Master 节点。这导致了 'dp_size_local' 和 'dp_rank_start' 的差异。

分布式 DP 服务端代理示例#

在 PD 分离场景中，我们需要一个代理来分发请求。执行以下命令启用示例代理：

python load_balance_proxy_server_example.py \
  --port 8000 \
  --host 0.0.0.0 \
  --prefiller-hosts \
    192.0.0.1 \
    192.0.0.2 \
    192.0.0.3 \
    192.0.0.4 \
  --prefiller-hosts-num \
    2 2 2 2 \
  --prefiller-ports \
    9000 9000 9000 9000 \
  --prefiller-ports-inc \
    2 2 2 2\
  --decoder-hosts \
    192.0.0.5 \
    192.0.0.6 \
    192.0.0.7 \
    192.0.0.8 \
  --decoder-hosts-num \
    16 16 16 16 \
  --decoder-ports  \
    9000 9000 9000 9000 \
  --decoder-ports-inc \
    16 16 16 16 \

参数	含义
--port	代理服务端口
--host	代理服务主机 IP
--prefiller-hosts	Prefiller 节点主机列表
--prefiller-hosts-num	Prefiller 节点主机的重复次数
--prefiller-ports	Prefiller 节点端口列表
--prefiller-ports-inc	Prefiller 节点端口的递增次数
--decoder-hosts	Decoder 节点主机列表
--decoder-hosts-num	Decoder 节点主机的重复次数
--decoder-ports	Decoder 节点端口列表
--decoder-ports-inc	Decoder 节点端口的递增次数

您可以在仓库的 examples 目录中获取代理程序： load_balance_proxy_server_example.py

基准测试#

我们建议使用 aisbench 工具来评估性能。执行以下命令安装 aisbench

git clone https://gitee.com/aisbench/benchmark.git
cd benchmark/
pip3 install -e ./

在评估性能之前，您需要取消 HTTP 代理，如下所示：

# unset proxy
unset http_proxy
unset https_proxy

您可以将数据集放置在目录：benchmark/ais_bench/datasets 中
您可以在目录 benchmark/ais_bench/benchmark/configs/models/vllm_api 中修改配置。以 vllm_api_stream_chat.py 为例。

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChatStream,
        abbr='vllm-api-stream-chat',
        path="vllm-ascend/DeepSeek-R1-W8A8",
        model="dsr1",
        request_rate = 28,
        retry = 2,
        host_ip = "192.0.0.1", # Proxy service host IP
        host_port = 8000,  # Proxy service Port
        max_out_len = 10,
        batch_size=1536,
        trust_remote_code=True,
        generation_kwargs = dict(
            temperature = 0,
            seed = 1024,
            ignore_eos=False,
        )
    )
]

以 gsm8k 数据集为例，执行以下命令评估性能。

ais_bench --models vllm_api_stream_chat --datasets gsm8k_gen_0_shot_cot_str_perf  --debug  --mode perf

关于 aisbench 命令和参数的更多详情，请参考 aisbench

Prefill 与 Decode 配置详情#

在 PD 分离场景中，我们提供了一套优化配置。

Prefiller 节点

设置 HCCL_BUFFSIZE=256
在 'vllm serve' 中添加 '--enforce-eager' 命令
'--kv-transfer-config' 参数如下：

--kv-transfer-config \
    '{"kv_connector": "MooncakeConnectorV1",
      "kv_buffer_device": "npu",
      "kv_role": "kv_producer",
      "kv_parallel_size": "1",
      "kv_port": "20001",
      "engine_id": "0",
      "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector"
    }'

'--additional-config' 参数如下：

--additional-config '{"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}'

Decoder 节点

设置 HCCL_BUFFSIZE=1024
'--kv-transfer-config' 参数如下：

--kv-transfer-config
    '{"kv_connector": "MooncakeConnectorV1",
      "kv_buffer_device": "npu",
      "kv_role": "kv_consumer",
      "kv_parallel_size": "1",
      "kv_port": "20001",
      "engine_id": "0",
      "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector"
    }'

'--additional-config' 参数如下：

--additional-config '{"enable_weight_nz_layout":true}'

参数说明#

1.'--additional-config' Parameter Introduction:

"enable_weight_nz_layout"： 是否将量化权重转换为 NZ 格式以加速矩阵乘法。
"enable_prefill_optimizations"： 是否启用 DeepSeek 模型的 Prefill 优化。

3.enable MTP Add the following command to your configurations.

--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}'

常见问题解答 (FAQ)#

1.Prefiller 节点需要预热 (Warmup)#

由于某些 NPU 算子的计算需要几轮预热才能达到最佳性能，我们建议在进行性能测试之前先用一些请求预热服务，以获得最佳的端到端吞吐量。

节点类型	DP	TP	EP	max-model-len	max-num-batched-tokens	max-num-seqs	gpu-memory-utilization
prefill	2	8	16	17000	16384	4	0.9
decode	64	1	64	17000	256	28	0.9