分布式数据并行服务器与大规模专家并行#

快速开始#

vLLM-Ascend 现已支持在大规模**专家并行(EP)**场景下的预填充-解码(PD)解耦。为获得更好的性能,vLLM-Ascend 中应用了分布式数据并行服务器。在 PD 分离场景下,可以根据 PD 节点的不同特性实施不同的优化策略,从而实现更灵活的模型部署。以 DeepSeek 模型为例,使用 8 台 Atlas 800T A3 服务器部署模型。假设服务器 IP 从 192.0.0.1 开始到 192.0.0.8 结束。使用前 4 台服务器作为预填充节点,后 4 台服务器作为解码节点。并且预填充节点独立部署为主节点,而解码节点使用 192.0.0.5 节点作为主节点。

验证多节点通信环境#

物理层要求#

  • 物理机必须位于同一无线局域网内,并具备网络连通性。

  • 所有 NPU 必须互连。对于 Atlas A2 代,节点内连接通过 HCCS,节点间连接通过 RDMA。对于 Atlas A3 代,节点内和节点间连接均通过 HCCS。

验证流程#

  1. 单节点验证:

    依次在每个节点上执行以下命令。结果必须全部为 success 且状态必须为 UP

    # Check the remote switch ports
    for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done
    # Get the link status of the Ethernet ports (UP or DOWN)
    for i in {0..15}; do hccn_tool -i $i -link -g ; done
    # Check the network health status
    for i in {0..15}; do hccn_tool -i $i -net_health -g ; done
    # View the network detected IP configuration
    for i in {0..15}; do hccn_tool -i $i -netdetect -g ; done
    # View gateway configuration
    for i in {0..15}; do hccn_tool -i $i -gateway -g ; done
    # View NPU network configuration
    cat /etc/hccn.conf
    
  2. 获取 NPU IP 地址

    for i in {0..15}; do hccn_tool -i $i -vnic -g;done
    
  3. 获取 superpodid 和 SDID

    for i in {0..15}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-info -i $i -c 1;done
    
  4. 跨节点 PING 测试

    # Execute on the target node (replace 'x.x.x.x' with actual NPU IP address)
    for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
    
  1. 单节点验证:

    依次在每个节点上执行以下命令。结果必须全部为 success 且状态必须为 UP

    # Check the remote switch ports
    for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
    # Get the link status of the Ethernet ports (UP or DOWN)
    for i in {0..7}; do hccn_tool -i $i -link -g ; done
    # Check the network health status
    for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
    # View the network detected IP configuration
    for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
    # View gateway configuration
    for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
    # View NPU network configuration
    cat /etc/hccn.conf
    
  2. 获取 NPU IP 地址

    for i in {0..7}; do hccn_tool -i $i -ip -g;done
    
  3. 跨节点 PING 测试

    # Execute on the target node (replace 'x.x.x.x' with actual NPU IP address)
    for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
    

大规模 EP 模型部署#

生成配置脚本#

在 PD 分离场景下,我们提供了优化配置。您可以使用以下 shell 脚本分别配置预填充节点和解码节点。

# run_dp_template.sh
#!/bin/sh

# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip
nic_name="xxxx"
local_ip="xxxx"

# basic configuration for HCCL and connection
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_BUFFSIZE=256

# obtain parameters from distributed DP server
export VLLM_DP_SIZE=$1
export VLLM_DP_MASTER_IP=$2
export VLLM_DP_MASTER_PORT=$3
export VLLM_DP_RANK_LOCAL=$4
export VLLM_DP_RANK=$5
export VLLM_DP_SIZE_LOCAL=$7

#pytorch_npu settings and vllm settings
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export TASK_QUEUE_ENABLE=1
export VLLM_USE_MODELSCOPE="True"

# enable the distributed DP server
export VLLM_WORKER_MULTIPROC_METHOD="fork"
export VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED=1

# The w8a8 weight can be obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8
# "--additional-config" is used to enable characteristics from vllm-ascend
vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
    --host 0.0.0.0 \
    --port $6 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --seed 1024 \
    --served-model-name deepseek_r1 \
    --max-model-len 17000 \
    --max-num-batched-tokens 16384 \
    --trust-remote-code \
    --max-num-seqs 4 \
    --gpu-memory-utilization 0.9 \
    --quantization ascend \
    --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
    --enforce-eager \
    --kv-transfer-config \
    '{"kv_connector": "MooncakeConnectorV1",
      "kv_buffer_device": "npu",
      "kv_role": "kv_producer",
      "kv_parallel_size": "1",
      "kv_port": "20001",
      "engine_id": "0"
    }' \
    --additional-config '{"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}'
# run_dp_template.sh
#!/bin/sh

# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip
nic_name="xxxx"
local_ip="xxxx"

# basic configuration for HCCL and connection
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_BUFFSIZE=1024

# obtain parameters from distributed DP server
export VLLM_DP_SIZE=$1
export VLLM_DP_MASTER_IP=$2
export VLLM_DP_MASTER_PORT=$3
export VLLM_DP_RANK_LOCAL=$4
export VLLM_DP_RANK=$5
export VLLM_DP_SIZE_LOCAL=$7

#pytorch_npu settings and vllm settings
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export TASK_QUEUE_ENABLE=1
export VLLM_USE_MODELSCOPE="True"

# enable the distributed DP server
export VLLM_WORKER_MULTIPROC_METHOD="fork"
export VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED=1

# The w8a8 weight can be obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8
# "--additional-config" is used to enable characteristics from vllm-ascend
vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
    --host 0.0.0.0 \
    --port $6 \
    --tensor-parallel-size 1 \
    --enable-expert-parallel \
    --seed 1024 \
    --served-model-name deepseek_r1 \
    --max-model-len 17000 \
    --max-num-batched-tokens 256 \
    --trust-remote-code \
    --max-num-seqs 28 \
    --gpu-memory-utilization 0.9 \
    --quantization ascend \
    --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
    --kv-transfer-config \
        '{"kv_connector": "MooncakeConnectorV1",
        "kv_buffer_device": "npu",
        "kv_role": "kv_consumer",
        "kv_parallel_size": "1",
        "kv_port": "20001",
        "engine_id": "0"
        }' \
    --additional-config '{"enable_weight_nz_layout":true}'

启动用于预填充-解码解耦的分布式数据并行服务器#

在所有节点上执行以下 Python 文件以使用分布式数据并行服务器。(我们建议在 v0.9.1 正式版本中使用此功能)

import multiprocessing
import os
import sys
dp_size = 2 # total number of DP engines for decode/prefill
dp_size_local = 2 # number of DP engines on the current node
dp_rank_start = 0 # starting DP rank for the current node
# dp_ip is different on prefiller nodes in this example
dp_ip = "192.0.0.1" # master node IP for DP communication
dp_port = 13395 # port used for DP communication
engine_port = 9000 # starting port for all DP groups on the current node
template_path = "./run_dp_template.sh"
if not os.path.exists(template_path):
  print(f"Template file {template_path} does not exist.")
  sys.exit(1)
def run_command(dp_rank_local, dp_rank, engine_port_):
  command = f"bash ./run_dp_template.sh {dp_size} {dp_ip} {dp_port} {dp_rank_local} {dp_rank} {engine_port_} {dp_size_local}"
  os.system(command)
processes = []
for i in range(dp_size_local):
  dp_rank = dp_rank_start + i
  dp_rank_local = i
  engine_port_ = engine_port + i
  process = multiprocessing.Process(target=run_command, args=(dp_rank_local, dp_rank, engine_port_))
  processes.append(process)
  process.start()
for process in processes:
  process.join()
import multiprocessing
import os
import sys
dp_size = 64 # total number of DP engines for decode/prefill
dp_size_local = 16 # number of DP engines on the current node
dp_rank_start = 0 # starting DP rank for the current node. e.g. 0/16/32/48
# dp_ip is the same on decoder nodes in this example
dp_ip = "192.0.0.5" # master node IP for DP communication.
dp_port = 13395 # port used for DP communication
engine_port = 9000 # starting port for all DP groups on the current node
template_path = "./run_dp_template.sh"
if not os.path.exists(template_path):
  print(f"Template file {template_path} does not exist.")
  sys.exit(1)
def run_command(dp_rank_local, dp_rank, engine_port_):
  command = f"bash ./run_dp_template.sh {dp_size} {dp_ip} {dp_port} {dp_rank_local} {dp_rank} {engine_port_} {dp_size_local}"
  os.system(command)
processes = []
for i in range(dp_size_local):
  dp_rank = dp_rank_start + i
  dp_rank_local = i
  engine_port_ = engine_port + i
  process = multiprocessing.Process(target=run_command, args=(dp_rank_local, dp_rank, engine_port_))
  processes.append(process)
  process.start()
for process in processes:
  process.join()

请注意,预填充节点和解码节点可能具有不同的配置。在此示例中,每个预填充节点独立部署为主节点,而解码节点使用 192.0.0.5 节点作为主节点。这导致了 'dp_size_local' 和 'dp_rank_start' 的差异。

分布式数据并行服务器示例代理#

在 PD 分离场景下,我们需要一个代理来分发请求。执行以下命令以启用示例代理:

python load_balance_proxy_server_example.py \
  --port 8000 \
  --host 0.0.0.0 \
  --prefiller-hosts \
    192.0.0.1 \
    192.0.0.2 \
    192.0.0.3 \
    192.0.0.4 \
  --prefiller-hosts-num \
    2 2 2 2 \
  --prefiller-ports \
    9000 9000 9000 9000 \
  --prefiller-ports-inc \
    2 2 2 2\
  --decoder-hosts \
    192.0.0.5 \
    192.0.0.6 \
    192.0.0.7 \
    192.0.0.8 \
  --decoder-hosts-num \
    16 16 16 16 \
  --decoder-ports  \
    9000 9000 9000 9000 \
  --decoder-ports-inc \
    16 16 16 16 \

参数

含义

--port

代理服务端口

--host

代理服务主机 IP

--prefiller-hosts

预填充节点主机列表

--prefiller-hosts-num

预填充节点主机重复次数

--prefiller-ports

预填充节点端口列表

--prefiller-ports-inc

预填充节点端口增量数

--decoder-hosts

解码节点主机列表

--decoder-hosts-num

解码节点主机重复次数

--decoder-ports

解码节点端口列表

--decoder-ports-inc

解码节点端口增量数

您可以在仓库的示例中找到代理程序,load_balance_proxy_server_example.py

基准测试#

我们推荐使用 aisbench 工具评估性能。aisbench。执行以下命令安装 aisbench

git clone https://gitee.com/aisbench/benchmark.git
cd benchmark/
pip3 install -e ./

在评估性能前,您需要取消 http 代理,如下所示:

# unset proxy
unset http_proxy
unset https_proxy
  • 您可以将数据集放置在目录:benchmark/ais_bench/datasets

  • 您可以在目录:benchmark/ais_bench/benchmark/configs/models/vllm_api 中更改配置。以 vllm_api_stream_chat.py 为例:

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChatStream,
        abbr='vllm-api-stream-chat',
        path="vllm-ascend/DeepSeek-R1-W8A8",
        model="dsr1",
        request_rate = 28,
        retry = 2,
        host_ip = "192.0.0.1", # Proxy service host IP
        host_port = 8000,  # Proxy service Port
        max_out_len = 10,
        batch_size=1536,
        trust_remote_code=True,
        generation_kwargs = dict(
            temperature = 0,
            seed = 1024,
            ignore_eos=False,
        )
    )
]
  • 以 gsm8k 数据集为例,执行以下命令评估性能。

ais_bench --models vllm_api_stream_chat --datasets gsm8k_gen_0_shot_cot_str_perf  --debug  --mode perf
  • 有关 aisbench 命令和参数的更多详细信息,请参考 aisbench

预填充与解码配置详情#

在 PD 分离场景下,我们提供了优化配置。

  • 预填充节点

  1. 设置 HCCL_BUFFSIZE=256

  2. 向 'vllm serve' 添加 '--enforce-eager' 命令

  3. 按如下方式设置 '--kv-transfer-config':

    --kv-transfer-config \
        '{"kv_connector": "MooncakeConnectorV1",
          "kv_buffer_device": "npu",
          "kv_role": "kv_producer",
          "kv_parallel_size": "1",
          "kv_port": "20001",
          "engine_id": "0"
        }'
    
  4. 按如下方式设置 '--additional-config':

    --additional-config '{"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}'
    
  • 解码节点

  1. 设置 HCCL_BUFFSIZE=1024

  2. 按如下方式设置 '--kv-transfer-config':

    --kv-transfer-config
        '{"kv_connector": "MooncakeConnectorV1",
          "kv_buffer_device": "npu",
          "kv_role": "kv_consumer",
          "kv_parallel_size": "1",
          "kv_port": "20001",
          "engine_id": "0"
        }'
    
  3. 按如下方式设置 '--additional-config':

    --additional-config '{"enable_weight_nz_layout":true}'
    

参数说明#

  1. '--additional-config' 参数介绍:

    • "enable_weight_nz_layout":是否将量化权重转换为 NZ 格式以加速矩阵乘法。

    • "enable_prefill_optimizations":是否启用 DeepSeek 模型的预填充优化。

  2. 启用 MTP 在您的配置中添加以下命令。

    --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}'
    

常见问题#

1.预填充节点需要预热#

由于部分NPU算子的计算需要经过数轮预热才能达到最佳性能,我们建议在进行性能测试前,先用一些请求预热服务,以达到最佳的端到端吞吐量。