多节点测试#
多节点 CI 旨在测试超大规模模型的分布式场景,例如:跨多个节点的解耦式预填充(disaggregated_prefill)多数据并行(DP)等。
工作原理#
下图展示了多节点 CI 机制的基本部署视图,说明了 GitHub Action 如何与 lws(一种 Kubernetes CRD 资源)交互。

从工作流的角度,我们可以看到最终测试脚本的执行过程。关键在于共享文件 tests/e2e/nightly/multi_node/scripts/lws.yaml.jinja2 和 tests/e2e/nightly/multi_node/scripts/run.sh,它们分别定义了集群模板和 Pod 入口脚本。每个节点根据 LWS_WORKER_INDEX 环境变量执行不同的逻辑,从而使多个节点能够组成分布式集群来执行任务。run.sh 根据配置路径选择 pytest 入口点:内部 DP 配置使用 internal_dp/scripts/test_multi_node.py,外部 DP 配置使用 external_dp/scripts/test_external_dp.py。

如何贡献#
上传自定义权重
如果您需要自定义权重,例如为 DeepSeek-V3 量化了 w8a8 权重并希望在 CI 上运行,欢迎将权重上传至 ModelScope 的 vllm-ascend 组织。如果您没有上传权限,请联系 @Potabk。
添加配置文件
对于常规的内部 DP 多节点流程,将配置文件添加到
tests/e2e/nightly/multi_node/internal_dp/config/,例如DeepSeek-V3.yaml。外部 DP 用例使用独立的tests/e2e/nightly/multi_node/external_dp/config/目录,并应通过工作流中的config_base_path或本地的CONFIG_BASE_PATH传递该目录。假设您有 2个节点 运行 1P1D 配置(1个预填充器 + 1个解码器):
您可以添加一个类似这样的配置文件:
test_name: "test DeepSeek-V3 disaggregated_prefill" # the model being tested model: "vllm-ascend/DeepSeek-V3-W8A8" # how large the cluster is num_nodes: 2 npu_per_node: 16 # All env vars you need should add it here env_common: &env_common VLLM_USE_MODELSCOPE: true OMP_PROC_BIND: false OMP_NUM_THREADS: 100 HCCL_BUFFSIZE: 1024 SERVER_PORT: 8080 disaggregated_prefill: enabled: true # node index(a list) which meet all the conditions: # - prefiller # - no headless(have api server) prefiller_host_index: [0] # node index(a list) which meet all the conditions: # - decoder decoder_host_index: [1] # Add each node's vllm serve cli command just like you run locally # Add each node's individual envs like follow deployment: - name: prefiller node # optional: just for description, not used in code envs: <<: *env_common VLLM_ASCEND_ENABLE_FLASHCOMM1: 1 # Continue to add other envs if needed server_cmd: > vllm serve ... - name: decoder node # optional: just for description, not used in code envs: <<: *env_common VLLM_ASCEND_ENABLE_FLASHCOMM1: 1 # Continue to add other envs if needed server_cmd: > vllm serve ... benchmarks: perf: # fill with performance test kwargs acc: # fill with accuracy test kwargs
将测试用例添加到夜间工作流
当前,多节点测试工作流定义在 .github/workflows/schedule_nightly_test_a3.yaml 中。
```yaml
multi-node-tests:
name: multi-node
if: always() && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
strategy:
fail-fast: false
max-parallel: 1
matrix:
test_config:
- name: multi-node-deepseek-pd
config_file_path: DeepSeek-V3.yaml
size: 2
- name: multi-node-qwen3-dp
config_file_path: Qwen3-235B-A22B.yaml
size: 2
- name: GLM5_1-W8A8-EP-external
config_file_path: GLM5_1-W8A8-EP-external.yaml
config_base_path: tests/e2e/nightly/multi_node/external_dp/config/
size: 4
uses: ./.github/workflows/_e2e_nightly_multi_node.yaml
with:
soc_version: a3
runner: linux-aarch64-a3-0
image: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3'
replicas: 1
size: ${{ matrix.test_config.size }}
config_file_path: ${{ matrix.test_config.config_file_path }}
config_base_path: ${{ matrix.test_config.config_base_path || '' }}
name: ${{ matrix.test_config.name }}
secrets:
KUBECONFIG_B64: ${{ secrets.KUBECONFIG_B64 }}
```
上述矩阵定义了添加多机用例所需的所有参数。值得注意的参数是 size、config_file_path 和 config_base_path。size 定义用例所需的节点数。config_file_path 是 yaml 文件名,config_base_path 告诉加载器使用哪个配置目录。对于内部 DP 用例,使用空的 config_base_path,加载器将使用默认的内部 DP 配置目录。对于外部 DP 用例,将其设置为 tests/e2e/nightly/multi_node/external_dp/config/。
本地运行多节点测试#
1.使用 Kubernetes#
本节假定您本地已有一个 Kubernetes NPU 集群环境。这样您就可以轻松一键启动我们的测试。
步骤 1.安装 LWS CRD 资源
步骤 2.按需部署以下
lws.yaml文件apiVersion: leaderworkerset.x-k8s.io/v1 kind: LeaderWorkerSet metadata: name: test-server namespace: vllm-project spec: replicas: 1 leaderWorkerTemplate: size: 2 restartPolicy: None leaderTemplate: metadata: labels: role: leader spec: containers: - name: vllm-leader imagePullPolicy: Always image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3 env: - name: CONFIG_YAML_PATH value: DeepSeek-V3.yaml - name: CONFIG_BASE_PATH value: tests/e2e/nightly/multi_node/internal_dp/config/ - name: WORKSPACE value: "/vllm-workspace" - name: FAIL_TAG value: FAIL_TAG command: - sh - -c - | bash /vllm-workspace/vllm-ascend/tests/e2e/nightly/multi_node/scripts/run.sh resources: limits: huawei.com/ascend-1980: 16 memory: 512Gi ephemeral-storage: 100Gi requests: huawei.com/ascend-1980: 16 memory: 512Gi ephemeral-storage: 100Gi cpu: 125 ports: - containerPort: 8080 # readinessProbe: # tcpSocket: # port: 8080 # initialDelaySeconds: 15 # periodSeconds: 10 volumeMounts: - mountPath: /root/.cache name: shared-volume - mountPath: /usr/local/Ascend/driver/tools name: driver-tools - mountPath: /dev/shm name: dshm volumes: - name: dshm emptyDir: medium: Memory sizeLimit: 15Gi - name: shared-volume persistentVolumeClaim: claimName: nv-action-vllm-benchmarks-v2 - name: driver-tools hostPath: path: /usr/local/Ascend/driver/tools workerTemplate: spec: containers: - name: vllm-worker imagePullPolicy: Always image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3 env: - name: CONFIG_YAML_PATH value: DeepSeek-V3.yaml - name: CONFIG_BASE_PATH value: tests/e2e/nightly/multi_node/internal_dp/config/ - name: WORKSPACE value: "/vllm-workspace" - name: FAIL_TAG value: FAIL_TAG command: - sh - -c - | bash /vllm-workspace/vllm-ascend/tests/e2e/nightly/multi_node/scripts/run.sh resources: limits: huawei.com/ascend-1980: 16 memory: 512Gi ephemeral-storage: 100Gi requests: huawei.com/ascend-1980: 16 ephemeral-storage: 100Gi cpu: 125 volumeMounts: - mountPath: /root/.cache name: shared-volume - mountPath: /usr/local/Ascend/driver/tools name: driver-tools - mountPath: /dev/shm name: dshm volumes: - name: dshm emptyDir: medium: Memory sizeLimit: 15Gi - name: shared-volume persistentVolumeClaim: claimName: nv-action-vllm-benchmarks-v2 - name: driver-tools hostPath: path: /usr/local/Ascend/driver/tools --- apiVersion: v1 kind: Service metadata: name: vllm-leader namespace: vllm-project spec: ports: - name: http port: 8080 protocol: TCP targetPort: 8080 selector: leaderworkerset.sigs.k8s.io/name: vllm role: leader type: ClusterIP
kubectl apply -f lws.yaml
验证 Pod 状态:
kubectl get pods -n vllm-project
应该会得到类似以下的输出:
NAME READY STATUS RESTARTS AGE vllm-0 1/1 Running 0 2s vllm-0-1 1/1 Running 0 2s
验证分布式推理是否正常工作:
kubectl logs -f vllm-0 -n vllm-project
应该会得到类似以下的结果:
INFO 12-30 11:00:57 [__init__.py:43] Available plugins for group vllm.platform_plugins: INFO 12-30 11:00:57 [__init__.py:45] - ascend -> vllm_ascend:register INFO 12-30 11:00:57 [__init__.py:48] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load. INFO 12-30 11:00:57 [__init__.py:217] Platform plugin ascend is activated INFO 12-30 11:00:57 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available. ================================================================================================== test session starts =================================================================================================== platform linux -- Python 3.12.13, pytest-8.4.2, pluggy-1.6.0 -- /usr/local/python3.12.13/bin/python3 cachedir: .pytest_cache rootdir: /vllm-workspace/vllm-ascend configfile: pyproject.toml plugins: cov-7.0.0, asyncio-1.3.0, mock-3.15.1, anyio-4.12.0 asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function collected 1 item tests/e2e/nightly/multi_node/internal_dp/scripts/test_multi_node.py::test_multi_node [2025-12-30 11:01:01] INFO multi_node_config.py:294: Loading config yaml: tests/e2e/nightly/multi_node/internal_dp/config/DeepSeek-V3.yaml [2025-12-30 11:01:01] INFO multi_node_config.py:348: Resolving cluster IPs via DNS... [2025-12-30 11:01:01] INFO multi_node_config.py:212: Node 0 envs: {'VLLM_USE_MODELSCOPE': 'True', 'OMP_PROC_BIND': 'False', 'OMP_NUM_THREADS': '100', 'HCCL_BUFFSIZE': '1024', 'SERVER_PORT': '8080', 'NUMEXPR_MAX_THREADS': '128', 'DISAGGREGATED_PREFILL_PROXY_SCRIPT': 'examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py', 'HCCL_IF_IP': '10.0.0.102', 'HCCL_SOCKET_IFNAME': 'eth0', 'GLOO_SOCKET_IFNAME': 'eth0', 'TP_SOCKET_IFNAME': 'eth0', 'LOCAL_IP': '10.0.0.102', 'NIC_NAME': 'eth0', 'MASTER_IP': '10.0.0.102'} [2025-12-30 11:01:01] INFO multi_node_config.py:159: Launching proxy: python examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py --host 10.0.0.102 --port 6000 --prefiller-hosts 10.0.0.102 --prefiller-ports 8080 --decoder-hosts 10.0.0.138 --decoder-ports 8080 [2025-12-30 11:01:01] INFO conftest.py:107: Starting server with command: vllm serve vllm-ascend/DeepSeek-V3-W8A8 --host 0.0.0.0 --port 8080 --data-parallel-size 2 --data-parallel-size-local 2 --tensor-parallel-size 8 --seed 1024 --enforce-eager --enable-expert-parallel --max-num-seqs 16 --max-model-len 8192 --max-num-batched-tokens 8192 --quantization ascend --trust-remote-code --no-enable-prefix-caching --gpu-memory-utilization 0.9 --kv-transfer-config {"kv_connector": "MooncakeConnectorV1", "kv_role": "kv_producer", "kv_port": "30000", "kv_connector_extra_config": { "prefill": { "dp_size": 2, "tp_size": 8 }, "decode": { "dp_size": 2, "tp_size": 8 } } }
2.不使用 Kubernetes 进行测试#
相同的 tests/e2e/nightly/multi_node/scripts/run.sh 入口点可以在准备好的裸机或容器主机上使用。在没有 LWS 的情况下,自行设置 Kubernetes 通常注入的值:
配置 yaml 中的
cluster_hosts,使用每个节点均可访问的 IP。每个节点上的
LWS_WORKER_INDEX,从0开始。CONFIG_YAML_PATH作为配置文件名,CONFIG_BASE_PATH作为配置目录。
使用可以相互通信的主机网卡 IP,例如在活动网络接口上通过 ip addr 或 ifconfig 显示的地址。不要使用每个主机的 Docker 桥接地址,例如 172.17.0.1,因为每个主机都有自己的本地桥接。
在提交 PR 之前,应移除本地的 cluster_hosts 编辑,除非这些主机是已提交测试环境的一部分。
2.1 内部 DP 本地运行#
2.1.1 添加集群主机#
编辑您要运行的内部 DP 配置,例如:
tests/e2e/nightly/multi_node/internal_dp/config/DeepSeek-V3.yaml
添加 cluster_hosts 作为顶级字段,例如在 num_nodes 和 npu_per_node 附近:
cluster_hosts:
- "172.22.0.xxx"
- "172.22.0.xxx"
2.1.2 准备环境#
在每个集群主机上安装 vllm-ascend 开发依赖:
cd /vllm-workspace/vllm-ascend
python3 -m pip install -r requirements-dev.txt
在第一个主机(即 LWS_WORKER_INDEX=0 的节点)上安装 AISBench:
export AIS_BENCH_TAG="v3.1-20260330-master"
export AIS_BENCH_URL="https://github.com/AISBench/benchmark.git"
export BENCHMARK_HOME=/vllm-workspace/vllm-ascend/benchmark
git clone -b ${AIS_BENCH_TAG} --depth 1 ${AIS_BENCH_URL} $BENCHMARK_HOME
cd $BENCHMARK_HOME
pip install -e . -r requirements/api.txt -r requirements/extra.txt
如果您的本地镜像已包含模型、基准数据、Ascend 运行时和 AISBench,则只需下一步中的运行时导出。
2.1.3 启动每个节点#
分别在每个节点上运行脚本。先启动工作节点,再启动节点 0。
在节点 1 上:
export WORKSPACE=/vllm-workspace
export IS_PR_TEST=false
export CONFIG_YAML_PATH=DeepSeek-V3.yaml
export CONFIG_BASE_PATH=tests/e2e/nightly/multi_node/internal_dp/config/
export LWS_WORKER_INDEX=1
cd $WORKSPACE/vllm-ascend
bash tests/e2e/nightly/multi_node/scripts/run.sh
在节点 0 上:
export WORKSPACE=/vllm-workspace
export IS_PR_TEST=false
export CONFIG_YAML_PATH=DeepSeek-V3.yaml
export CONFIG_BASE_PATH=tests/e2e/nightly/multi_node/internal_dp/config/
export LWS_WORKER_INDEX=0
cd $WORKSPACE/vllm-ascend
bash tests/e2e/nightly/multi_node/scripts/run.sh
内部 DP 日志主要打印到运行 run.sh 的终端。当设置了 LOG_PREFIX 时,共享脚本还会将 Ascend 日志备份到:
$LOG_PREFIX/node_<LWS_WORKER_INDEX>_plogs/
2.2 外部 DP 本地运行#
2.2.1 添加集群主机#
编辑您要运行的外部 DP 配置。例如:
tests/e2e/nightly/multi_node/external_dp/config/GLM5_1-W8A8-EP-external.yaml
添加 cluster_hosts 作为顶级字段,例如在 num_nodes 和 npu_per_node 附近:
cluster_hosts:
- "172.22.0.xxx"
- "172.22.0.xxx"
- "172.22.0.xxx"
- "172.22.0.xxx"
2.2.2 准备环境#
在每个集群主机上安装 vllm-ascend 开发依赖:
cd /vllm-workspace/vllm-ascend
python3 -m pip install -r requirements-dev.txt
在节点 0 上安装 AISBench:
export AIS_BENCH_TAG="v3.1-20260330-master"
export AIS_BENCH_URL="https://github.com/AISBench/benchmark.git"
export BENCHMARK_HOME=/vllm-workspace/vllm-ascend/benchmark
git clone -b ${AIS_BENCH_TAG} --depth 1 ${AIS_BENCH_URL} $BENCHMARK_HOME
cd $BENCHMARK_HOME
pip install -e . -r requirements/api.txt -r requirements/extra.txt
如果您的本地镜像已包含模型、基准数据、Ascend 运行时和 AISBench,则只需下一步中的运行时导出。
2.2.3 启动每个节点#
外部 DP 使用相同的共享 run.sh。将 CONFIG_BASE_PATH 设置为外部 DP 配置目录,以便脚本选择 external_dp/scripts/test_external_dp.py。
然后先启动非主节点,最后启动节点 0。以下示例使用 GLM5_1-W8A8-EP-external.yaml,这是一个 4 节点解耦式预填充用例。
在节点 1、节点 2 和节点 3 上,设置匹配的 LWS_WORKER_INDEX:
export WORKSPACE=/vllm-workspace
export IS_PR_TEST=false
export CONFIG_BASE_PATH=tests/e2e/nightly/multi_node/external_dp/config/
export CONFIG_YAML_PATH=GLM5_1-W8A8-EP-external.yaml
export LWS_WORKER_INDEX=1 # Use 2 on node 2, and 3 on node 3.
cd $WORKSPACE/vllm-ascend
bash tests/e2e/nightly/multi_node/scripts/run.sh
在节点 0 上:
export WORKSPACE=/vllm-workspace
export IS_PR_TEST=false
export CONFIG_BASE_PATH=tests/e2e/nightly/multi_node/external_dp/config/
export CONFIG_YAML_PATH=GLM5_1-W8A8-EP-external.yaml
export LWS_WORKER_INDEX=0
cd $WORKSPACE/vllm-ascend
bash tests/e2e/nightly/multi_node/scripts/run.sh
对于 GLM5_1-W8A8-EP-external.yaml,节点 0 和节点 1 启动预填充器进程,节点 2 和节点 3 启动解码器进程,节点 0 还启动代理和基准测试。
2.2.4 在测试运行时读取日志#
运行 run.sh 的终端会打印 pytest 编排日志。对于外部 DP,AISBench 输出也会在节点 0 上打印,而 rank 和 proxy 的标准输出/标准错误会写入 EXTERNAL_DP_LOG_DIR。默认布局为:
/tmp/external_dp_logs/
node-0/
rank-0.log
rank-1.log
proxy.log
node-1/
rank-0.log
rank-1.log
每个 rank 日志的第一行记录了启动该 rank 所使用的确切命令和环境。proxy.log 仅存在于配置的代理节点上,通常是节点 0。
在本地运行多个实验时,请使用单独的日志目录:
export EXTERNAL_DP_LOG_DIR=/tmp/external_dp_logs_pd_local
要实时查看日志,请在相应节点的另一个终端中运行以下命令:
# node 0: ranks and proxy
tail -F /tmp/external_dp_logs/node-0/rank-0.log \
/tmp/external_dp_logs/node-0/rank-1.log \
/tmp/external_dp_logs/node-0/proxy.log
# node 1: ranks
tail -F /tmp/external_dp_logs/node-1/rank-0.log \
/tmp/external_dp_logs/node-1/rank-1.log