大规模专家并行的DP分布式部署#
快速上手#
vLLM-Ascend now supports prefill-decode (PD) disaggregation in the large scale Expert Parallelism (EP) scenario. To achieve better performance, the distributed DP server is applied in vLLM-Ascend. In the PD separation scenario, different optimization strategies can be implemented based on the distinct characteristics of PD nodes, thereby enabling more flexible model deployment.
Take the deepseek model as an example, use 8 Atlas 800T A3 servers to deploy the model. Assume the ip of the servers start from 192.0.0.1, and end by 192.0.0.8. Use the first 4 servers as prefiller nodes and the last 4 servers as decoder nodes. And the prefiller nodes deployed as master node independently, the decoder nodes set 192.0.0.5 node to be the master node.
多节点通信环境校验#
物理层要求:#
物理机必须位于同一局域网(WLAN)内,并具备网络连通性。
所有 NPU 必须互联。对于 Atlas A2 世代,节点内通过 HCCS 连接,节点间通过 RDMA 连接。对于 Atlas A3 世代,节点内和节点间均通过 HCCS 连接。
校验过程:#
单节点校验:
依次在每个节点执行以下命令。结果必须全部为 success 且状态为 UP:
# Check the remote switch ports
for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..15}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..15}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..15}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..15}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
获取 NPU IP 地址
for i in {0..15}; do hccn_tool -i $i -vnic -g;done
获取 superpodid 和 SDID
for i in {0..15}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-info -i $i -c 1;done
跨节点 PING 测试
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
单节点校验:
依次在每个节点执行以下命令。结果必须全部为 success 且状态为 UP:
# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
获取 NPU IP 地址
for i in {0..7}; do hccn_tool -i $i -ip -g;done
跨节点 PING 测试
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
大规模 EP 模型部署#
生成带有配置的脚本#
在 PD 分离场景下,我们提供了优化配置。您可以使用以下 Shell 脚本分别配置 Prefiller 节点和 Decoder 节点。
# run_dp_template.sh
#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip
nic_name="xxxx"
local_ip="xxxx"
# basic configuration for HCCL and connection
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_BUFFSIZE=256
# obtain parameters from distributed DP server
export VLLM_DP_SIZE=$1
export VLLM_DP_MASTER_IP=$2
export VLLM_DP_MASTER_PORT=$3
export VLLM_DP_RANK_LOCAL=$4
export VLLM_DP_RANK=$5
export VLLM_DP_SIZE_LOCAL=$7
#pytorch_npu settings and vllm settings
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export TASK_QUEUE_ENABLE=1
export VLLM_USE_MODELSCOPE="True"
# enable the distributed DP server
export VLLM_WORKER_MULTIPROC_METHOD="fork"
export VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED=1
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8
# "--additional-config" is used to enable characteristics from vllm-ascend
vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
--host 0.0.0.0 \
--port $6 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name deepseek_r1 \
--max-model-len 17000 \
--max-num-batched-tokens 16384 \
--trust-remote-code \
--max-num-seqs 4 \
--gpu-memory-utilization 0.9 \
--quantization ascend \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--enforce-eager \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_buffer_device": "npu",
"kv_role": "kv_producer",
"kv_parallel_size": "1",
"kv_port": "20001",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector"
}'
--additional-config '{"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}'
# run_dp_template.sh
#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip
nic_name="xxxx"
local_ip="xxxx"
# basic configuration for HCCL and connection
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_BUFFSIZE=1024
# obtain parameters from distributed DP server
export VLLM_DP_SIZE=$1
export VLLM_DP_MASTER_IP=$2
export VLLM_DP_MASTER_PORT=$3
export VLLM_DP_RANK_LOCAL=$4
export VLLM_DP_RANK=$5
export VLLM_DP_SIZE_LOCAL=$7
#pytorch_npu settings and vllm settings
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export TASK_QUEUE_ENABLE=1
export VLLM_USE_MODELSCOPE="True"
# enable the distributed DP server
export VLLM_WORKER_MULTIPROC_METHOD="fork"
export VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED=1
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8
# "--additional-config" is used to enable characteristics from vllm-ascend
vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
--host 0.0.0.0 \
--port $6 \
--tensor-parallel-size 1 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name deepseek_r1 \
--max-model-len 17000 \
--max-num-batched-tokens 256 \
--trust-remote-code \
--max-num-seqs 28 \
--gpu-memory-utilization 0.9 \
--quantization ascend \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_buffer_device": "npu",
"kv_role": "kv_consumer",
"kv_parallel_size": "1",
"kv_port": "20001",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector"
}' \
--additional-config '{"enable_weight_nz_layout":true}'
启动分布式 DP 服务端以进行 PD 分离#
在所有节点上执行以下 Python 文件以使用分布式 DP 服务端。(建议在 v0.9.1 官方版本上使用此功能)
import multiprocessing
import os
import sys
dp_size = 2 # total number of DP engines for decode/prefill
dp_size_local = 2 # number of DP engines on the current node
dp_rank_start = 0 # starting DP rank for the current node
# dp_ip is different on prefiller nodes in this example
dp_ip = "192.0.0.1" # master node ip for DP communication
dp_port = 13395 # port used for DP communication
engine_port = 9000 # starting port for all DP groups on the current node
template_path = "./run_dp_template.sh"
if not os.path.exists(template_path):
print(f"Template file {template_path} does not exist.")
sys.exit(1)
def run_command(dp_rank_local, dp_rank, engine_port_):
command = f"bash ./run_dp_template.sh {dp_size} {dp_ip} {dp_port} {dp_rank_local} {dp_rank} {engine_port_} {dp_size_local}"
os.system(command)
processes = []
for i in range(dp_size_local):
dp_rank = dp_rank_start + i
dp_rank_local = i
engine_port_ = engine_port + i
process = multiprocessing.Process(target=run_command, args=(dp_rank_local, dp_rank, engine_port_))
processes.append(process)
process.start()
for process in processes:
process.join()
import multiprocessing
import os
import sys
dp_size = 64 # total number of DP engines for decode/prefill
dp_size_local = 16 # number of DP engines on the current node
dp_rank_start = 0 # starting DP rank for the current node. e.g. 0/16/32/48
# dp_ip is the same on decoder nodes in this example
dp_ip = "192.0.0.5" # master node ip for DP communication.
dp_port = 13395 # port used for DP communication
engine_port = 9000 # starting port for all DP groups on the current node
template_path = "./run_dp_template.sh"
if not os.path.exists(template_path):
print(f"Template file {template_path} does not exist.")
sys.exit(1)
def run_command(dp_rank_local, dp_rank, engine_port_):
command = f"bash ./run_dp_template.sh {dp_size} {dp_ip} {dp_port} {dp_rank_local} {dp_rank} {engine_port_} {dp_size_local}"
os.system(command)
processes = []
for i in range(dp_size_local):
dp_rank = dp_rank_start + i
dp_rank_local = i
engine_port_ = engine_port + i
process = multiprocessing.Process(target=run_command, args=(dp_rank_local, dp_rank, engine_port_))
processes.append(process)
process.start()
for process in processes:
process.join()
请注意,Prefiller 节点和 Decoder 节点可能有不同的配置。在此示例中,每个 Prefiller 节点独立作为 Master 节点部署,但所有 Decoder 节点都将第一个节点作为 Master 节点。这导致了 'dp_size_local' 和 'dp_rank_start' 的差异。
分布式 DP 服务端代理示例#
在 PD 分离场景中,我们需要一个代理来分发请求。执行以下命令启用示例代理:
python load_balance_proxy_server_example.py \
--port 8000 \
--host 0.0.0.0 \
--prefiller-hosts \
192.0.0.1 \
192.0.0.2 \
192.0.0.3 \
192.0.0.4 \
--prefiller-hosts-num \
2 2 2 2 \
--prefiller-ports \
9000 9000 9000 9000 \
--prefiller-ports-inc \
2 2 2 2\
--decoder-hosts \
192.0.0.5 \
192.0.0.6 \
192.0.0.7 \
192.0.0.8 \
--decoder-hosts-num \
16 16 16 16 \
--decoder-ports \
9000 9000 9000 9000 \
--decoder-ports-inc \
16 16 16 16 \
参数 |
含义 |
|---|---|
--port |
代理服务端口 |
--host |
代理服务主机 IP |
--prefiller-hosts |
Prefiller 节点主机列表 |
--prefiller-hosts-num |
Prefiller 节点主机的重复次数 |
--prefiller-ports |
Prefiller 节点端口列表 |
--prefiller-ports-inc |
Prefiller 节点端口的递增次数 |
--decoder-hosts |
Decoder 节点主机列表 |
--decoder-hosts-num |
Decoder 节点主机的重复次数 |
--decoder-ports |
Decoder 节点端口列表 |
--decoder-ports-inc |
Decoder 节点端口的递增次数 |
您可以在仓库的 examples 目录中获取代理程序: load_balance_proxy_server_example.py
基准测试#
我们建议使用 aisbench 工具来评估性能。执行以下命令安装 aisbench
git clone https://gitee.com/aisbench/benchmark.git
cd benchmark/
pip3 install -e ./
在评估性能之前,您需要取消 HTTP 代理,如下所示:
# unset proxy
unset http_proxy
unset https_proxy
您可以将数据集放置在目录:
benchmark/ais_bench/datasets中您可以在目录
benchmark/ais_bench/benchmark/configs/models/vllm_api中修改配置。以vllm_api_stream_chat.py为例。
models = [
dict(
attr="service",
type=VLLMCustomAPIChatStream,
abbr='vllm-api-stream-chat',
path="vllm-ascend/DeepSeek-R1-W8A8",
model="dsr1",
request_rate = 28,
retry = 2,
host_ip = "192.0.0.1", # Proxy service host IP
host_port = 8000, # Proxy service Port
max_out_len = 10,
batch_size=1536,
trust_remote_code=True,
generation_kwargs = dict(
temperature = 0,
seed = 1024,
ignore_eos=False,
)
)
]
以 gsm8k 数据集为例,执行以下命令评估性能。
ais_bench --models vllm_api_stream_chat --datasets gsm8k_gen_0_shot_cot_str_perf --debug --mode perf
关于 aisbench 命令和参数的更多详情,请参考 aisbench
Prefill 与 Decode 配置详情#
在 PD 分离场景中,我们提供了一套优化配置。
Prefiller 节点
设置 HCCL_BUFFSIZE=256
在 'vllm serve' 中添加 '--enforce-eager' 命令
'--kv-transfer-config' 参数如下:
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_buffer_device": "npu",
"kv_role": "kv_producer",
"kv_parallel_size": "1",
"kv_port": "20001",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector"
}'
'--additional-config' 参数如下:
--additional-config '{"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}'
Decoder 节点
设置 HCCL_BUFFSIZE=1024
'--kv-transfer-config' 参数如下:
--kv-transfer-config
'{"kv_connector": "MooncakeConnectorV1",
"kv_buffer_device": "npu",
"kv_role": "kv_consumer",
"kv_parallel_size": "1",
"kv_port": "20001",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector"
}'
'--additional-config' 参数如下:
--additional-config '{"enable_weight_nz_layout":true}'
参数说明#
1.'--additional-config' Parameter Introduction:
"enable_weight_nz_layout": 是否将量化权重转换为 NZ 格式以加速矩阵乘法。
"enable_prefill_optimizations": 是否启用 DeepSeek 模型的 Prefill 优化。
3.enable MTP Add the following command to your configurations.
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}'
推荐配置示例#
例如,如果平均输入长度为 3.5k,输出长度为 1.1k,上下文长度为 16k,输入数据集的最大长度为 7k。在这种场景下,我们为支持高 EP 的分布式 DP 服务端提供了一套推荐配置。这里我们使用 4 个节点进行 Prefill,4 个节点进行 Decode。
节点类型 |
DP |
TP |
EP |
max-model-len |
max-num-batched-tokens |
max-num-seqs |
gpu-memory-utilization |
|---|---|---|---|---|---|---|---|
prefill |
2 |
8 |
16 |
17000 |
16384 |
4 |
0.9 |
decode |
64 |
1 |
64 |
17000 |
256 |
28 |
0.9 |
备注
请注意,这些配置与优化逻辑无关。您需要根据实际场景调整这些参数。
常见问题解答 (FAQ)#
1.Prefiller 节点需要预热 (Warmup)#
由于某些 NPU 算子的计算需要几轮预热才能达到最佳性能,我们建议在进行性能测试之前先用一些请求预热服务,以获得最佳的端到端吞吐量。