多节点测试#

多节点 CI 旨在测试超大规模模型的分布式场景,例如:跨多个节点的解耦式预填充(disaggregated_prefill)多数据并行(DP)等。

工作原理#

下图展示了多节点 CI 机制的基本部署视图,说明了 GitHub Action 如何与 lws(一种 Kubernetes CRD 资源)交互。

多节点CI部署架构图

从工作流的角度,我们可以看到最终测试脚本的执行过程。关键在于共享文件 tests/e2e/nightly/multi_node/scripts/lws.yaml.jinja2tests/e2e/nightly/multi_node/scripts/run.sh,它们分别定义了集群模板和 Pod 入口脚本。每个节点根据 LWS_WORKER_INDEX 环境变量执行不同的逻辑,从而使多个节点能够组成分布式集群来执行任务。run.sh 根据配置路径选择 pytest 入口点:内部 DP 配置使用 internal_dp/scripts/test_multi_node.py,外部 DP 配置使用 external_dp/scripts/test_external_dp.py

多节点测试工作流程图

如何贡献#

  1. 上传自定义权重

    如果您需要自定义权重,例如为 DeepSeek-V3 量化了 w8a8 权重并希望在 CI 上运行,欢迎将权重上传至 ModelScope 的 vllm-ascend 组织。如果您没有上传权限,请联系 @Potabk。

  2. 添加配置文件

    对于常规的内部 DP 多节点流程,将配置文件添加到 tests/e2e/nightly/multi_node/internal_dp/config/,例如 DeepSeek-V3.yaml。外部 DP 用例使用独立的 tests/e2e/nightly/multi_node/external_dp/config/ 目录,并应通过工作流中的 config_base_path 或本地的 CONFIG_BASE_PATH 传递该目录。

    假设您有 2个节点 运行 1P1D 配置(1个预填充器 + 1个解码器):

    您可以添加一个类似这样的配置文件:

    test_name: "test DeepSeek-V3 disaggregated_prefill"
    # the model being tested
    model: "vllm-ascend/DeepSeek-V3-W8A8"
    # how large the cluster is
    num_nodes: 2
    npu_per_node: 16
    # All env vars you need should add it here
    env_common: &env_common
      VLLM_USE_MODELSCOPE: true
      OMP_PROC_BIND: false
      OMP_NUM_THREADS: 100
      HCCL_BUFFSIZE: 1024
      SERVER_PORT: 8080
    disaggregated_prefill:
      enabled: true
      # node index(a list) which meet all the conditions:
      #  - prefiller
      #  - no headless(have api server)
      prefiller_host_index: [0]
      # node index(a list) which meet all the conditions:
      #  - decoder
      decoder_host_index: [1]
    
    # Add each node's vllm serve cli command just like you run locally
    # Add each node's individual envs like follow
    deployment:
    - name: prefiller node # optional: just for description, not used in code
      envs:
        <<: *env_common
        VLLM_ASCEND_ENABLE_FLASHCOMM1: 1
        # Continue to add other envs if needed
      server_cmd: >
        vllm serve ...
    - name: decoder node # optional: just for description, not used in code
      envs:
        <<: *env_common
        VLLM_ASCEND_ENABLE_FLASHCOMM1: 1
        # Continue to add other envs if needed
      server_cmd: >
        vllm serve ...
    benchmarks:
      perf:
        # fill with performance test kwargs
      acc:
        # fill with accuracy test kwargs
    
  3. 将测试用例添加到夜间工作流

当前,多节点测试工作流定义在 .github/workflows/schedule_nightly_test_a3.yaml 中。

```yaml
multi-node-tests:
  name: multi-node
  if: always() && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
  strategy:
    fail-fast: false
    max-parallel: 1
    matrix:
      test_config:
        - name: multi-node-deepseek-pd
          config_file_path: DeepSeek-V3.yaml
          size: 2
        - name: multi-node-qwen3-dp
          config_file_path: Qwen3-235B-A22B.yaml
          size: 2
        - name: GLM5_1-W8A8-EP-external
          config_file_path: GLM5_1-W8A8-EP-external.yaml
          config_base_path: tests/e2e/nightly/multi_node/external_dp/config/
          size: 4
  uses: ./.github/workflows/_e2e_nightly_multi_node.yaml
  with:
    soc_version: a3
    runner: linux-aarch64-a3-0
    image: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3'
    replicas: 1
    size: ${{ matrix.test_config.size }}
    config_file_path: ${{ matrix.test_config.config_file_path }}
    config_base_path: ${{ matrix.test_config.config_base_path || '' }}
    name: ${{ matrix.test_config.name }}
  secrets:
    KUBECONFIG_B64: ${{ secrets.KUBECONFIG_B64 }}
```

上述矩阵定义了添加多机用例所需的所有参数。值得注意的参数是 sizeconfig_file_pathconfig_base_pathsize 定义用例所需的节点数。config_file_path 是 yaml 文件名,config_base_path 告诉加载器使用哪个配置目录。对于内部 DP 用例,使用空的 config_base_path,加载器将使用默认的内部 DP 配置目录。对于外部 DP 用例,将其设置为 tests/e2e/nightly/multi_node/external_dp/config/

本地运行多节点测试#

1.使用 Kubernetes#

本节假定您本地已有一个 Kubernetes NPU 集群环境。这样您就可以轻松一键启动我们的测试。

  • 步骤 1.安装 LWS CRD 资源

    参考 https://lws.sigs.k8s.io/docs/installation/

  • 步骤 2.按需部署以下 lws.yaml 文件

    apiVersion: leaderworkerset.x-k8s.io/v1
    kind: LeaderWorkerSet
    metadata:
      name: test-server
      namespace: vllm-project
    spec:
      replicas: 1
      leaderWorkerTemplate:
        size: 2
        restartPolicy: None
        leaderTemplate:
          metadata:
            labels:
              role: leader
          spec:
            containers:
              - name: vllm-leader
                imagePullPolicy: Always
                image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3
                env:
                  - name: CONFIG_YAML_PATH
                    value: DeepSeek-V3.yaml
                  - name: CONFIG_BASE_PATH
                    value: tests/e2e/nightly/multi_node/internal_dp/config/
                  - name: WORKSPACE
                    value: "/vllm-workspace"
                  - name: FAIL_TAG
                    value: FAIL_TAG
                command:
                  - sh
                  - -c
                  - |
                    bash /vllm-workspace/vllm-ascend/tests/e2e/nightly/multi_node/scripts/run.sh
                resources:
                  limits:
                    huawei.com/ascend-1980: 16
                    memory: 512Gi
                    ephemeral-storage: 100Gi
                  requests:
                    huawei.com/ascend-1980: 16
                    memory: 512Gi
                    ephemeral-storage: 100Gi
                    cpu: 125
                ports:
                  - containerPort: 8080
                # readinessProbe:
                #   tcpSocket:
                #     port: 8080
                #   initialDelaySeconds: 15
                #   periodSeconds: 10
                volumeMounts:
                  - mountPath: /root/.cache
                    name: shared-volume
                  - mountPath: /usr/local/Ascend/driver/tools
                    name: driver-tools
                  - mountPath: /dev/shm
                    name: dshm
            volumes:
              - name: dshm
                emptyDir:
                  medium: Memory
                  sizeLimit: 15Gi
              - name: shared-volume
                persistentVolumeClaim:
                  claimName: nv-action-vllm-benchmarks-v2
              - name: driver-tools
                hostPath:
                  path: /usr/local/Ascend/driver/tools
        workerTemplate:
          spec:
            containers:
              - name: vllm-worker
                imagePullPolicy: Always
                image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3
                env:
                  - name: CONFIG_YAML_PATH
                    value: DeepSeek-V3.yaml
                  - name: CONFIG_BASE_PATH
                    value: tests/e2e/nightly/multi_node/internal_dp/config/
                  - name: WORKSPACE
                    value: "/vllm-workspace"
                  - name: FAIL_TAG
                    value: FAIL_TAG
                command:
                  - sh
                  - -c
                  - |
                    bash /vllm-workspace/vllm-ascend/tests/e2e/nightly/multi_node/scripts/run.sh
                resources:
                  limits:
                    huawei.com/ascend-1980: 16
                    memory: 512Gi
                    ephemeral-storage: 100Gi
                  requests:
                    huawei.com/ascend-1980: 16
                    ephemeral-storage: 100Gi
                    cpu: 125
                volumeMounts:
                  - mountPath: /root/.cache
                    name: shared-volume
                  - mountPath: /usr/local/Ascend/driver/tools
                    name: driver-tools
                  - mountPath: /dev/shm
                    name: dshm
            volumes:
              - name: dshm
                emptyDir:
                  medium: Memory
                  sizeLimit: 15Gi
              - name: shared-volume
                persistentVolumeClaim:
                  claimName: nv-action-vllm-benchmarks-v2
              - name: driver-tools
                hostPath:
                  path: /usr/local/Ascend/driver/tools
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: vllm-leader
      namespace: vllm-project
    spec:
      ports:
        - name: http
          port: 8080
          protocol: TCP
          targetPort: 8080
      selector:
        leaderworkerset.sigs.k8s.io/name: vllm
        role: leader
      type: ClusterIP
    
    kubectl apply -f lws.yaml
    

    验证 Pod 状态:

    kubectl get pods -n vllm-project
    

    应该会得到类似以下的输出:

    NAME       READY   STATUS    RESTARTS   AGE
    vllm-0     1/1     Running   0          2s
    vllm-0-1   1/1     Running   0          2s
    

    验证分布式推理是否正常工作:

    kubectl logs -f vllm-0 -n vllm-project
    

    应该会得到类似以下的结果:

    INFO 12-30 11:00:57 [__init__.py:43] Available plugins for group vllm.platform_plugins:
    INFO 12-30 11:00:57 [__init__.py:45] - ascend -> vllm_ascend:register
    INFO 12-30 11:00:57 [__init__.py:48] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
    INFO 12-30 11:00:57 [__init__.py:217] Platform plugin ascend is activated
    INFO 12-30 11:00:57 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
    ================================================================================================== test session starts ===================================================================================================
    platform linux -- Python 3.12.13, pytest-8.4.2, pluggy-1.6.0 -- /usr/local/python3.12.13/bin/python3
    cachedir: .pytest_cache
    rootdir: /vllm-workspace/vllm-ascend
    configfile: pyproject.toml
    plugins: cov-7.0.0, asyncio-1.3.0, mock-3.15.1, anyio-4.12.0
    asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
    collected 1 item
    
    tests/e2e/nightly/multi_node/internal_dp/scripts/test_multi_node.py::test_multi_node [2025-12-30 11:01:01] INFO multi_node_config.py:294: Loading config yaml: tests/e2e/nightly/multi_node/internal_dp/config/DeepSeek-V3.yaml
    [2025-12-30 11:01:01] INFO multi_node_config.py:348: Resolving cluster IPs via DNS...
    [2025-12-30 11:01:01] INFO multi_node_config.py:212: Node 0 envs: {'VLLM_USE_MODELSCOPE': 'True', 'OMP_PROC_BIND': 'False', 'OMP_NUM_THREADS': '100', 'HCCL_BUFFSIZE': '1024', 'SERVER_PORT': '8080', 'NUMEXPR_MAX_THREADS': '128', 'DISAGGREGATED_PREFILL_PROXY_SCRIPT': 'examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py', 'HCCL_IF_IP': '10.0.0.102', 'HCCL_SOCKET_IFNAME': 'eth0', 'GLOO_SOCKET_IFNAME': 'eth0', 'TP_SOCKET_IFNAME': 'eth0', 'LOCAL_IP': '10.0.0.102', 'NIC_NAME': 'eth0', 'MASTER_IP': '10.0.0.102'}
    [2025-12-30 11:01:01] INFO multi_node_config.py:159: Launching proxy: python examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py --host 10.0.0.102 --port 6000 --prefiller-hosts 10.0.0.102 --prefiller-ports 8080 --decoder-hosts 10.0.0.138 --decoder-ports 8080
    [2025-12-30 11:01:01] INFO conftest.py:107: Starting server with command: vllm serve vllm-ascend/DeepSeek-V3-W8A8 --host 0.0.0.0 --port 8080 --data-parallel-size 2 --data-parallel-size-local 2 --tensor-parallel-size 8 --seed 1024 --enforce-eager --enable-expert-parallel --max-num-seqs 16 --max-model-len 8192 --max-num-batched-tokens 8192 --quantization ascend --trust-remote-code --no-enable-prefix-caching --gpu-memory-utilization 0.9 --kv-transfer-config {"kv_connector": "MooncakeConnectorV1", "kv_role": "kv_producer", "kv_port": "30000", 
    "kv_connector_extra_config": {
            "prefill": {
                    "dp_size": 2,
                    "tp_size": 8
            },
            "decode": {
                    "dp_size": 2,
                    "tp_size": 8
            }
        }
    }
    

2.不使用 Kubernetes 进行测试#

相同的 tests/e2e/nightly/multi_node/scripts/run.sh 入口点可以在准备好的裸机或容器主机上使用。在没有 LWS 的情况下,自行设置 Kubernetes 通常注入的值:

  • 配置 yaml 中的 cluster_hosts,使用每个节点均可访问的 IP。

  • 每个节点上的 LWS_WORKER_INDEX,从 0 开始。

  • CONFIG_YAML_PATH 作为配置文件名,CONFIG_BASE_PATH 作为配置目录。

使用可以相互通信的主机网卡 IP,例如在活动网络接口上通过 ip addrifconfig 显示的地址。不要使用每个主机的 Docker 桥接地址,例如 172.17.0.1,因为每个主机都有自己的本地桥接。

在提交 PR 之前,应移除本地的 cluster_hosts 编辑,除非这些主机是已提交测试环境的一部分。

2.1 内部 DP 本地运行#

2.1.1 添加集群主机#

编辑您要运行的内部 DP 配置,例如:

tests/e2e/nightly/multi_node/internal_dp/config/DeepSeek-V3.yaml

添加 cluster_hosts 作为顶级字段,例如在 num_nodesnpu_per_node 附近:

cluster_hosts:
  - "172.22.0.xxx"
  - "172.22.0.xxx"
2.1.2 准备环境#

在每个集群主机上安装 vllm-ascend 开发依赖:

cd /vllm-workspace/vllm-ascend
python3 -m pip install -r requirements-dev.txt

在第一个主机(即 LWS_WORKER_INDEX=0 的节点)上安装 AISBench:

export AIS_BENCH_TAG="v3.1-20260330-master"
export AIS_BENCH_URL="https://github.com/AISBench/benchmark.git"
export BENCHMARK_HOME=/vllm-workspace/vllm-ascend/benchmark

git clone -b ${AIS_BENCH_TAG} --depth 1 ${AIS_BENCH_URL} $BENCHMARK_HOME
cd $BENCHMARK_HOME
pip install -e . -r requirements/api.txt -r requirements/extra.txt

如果您的本地镜像已包含模型、基准数据、Ascend 运行时和 AISBench,则只需下一步中的运行时导出。

2.1.3 启动每个节点#

分别在每个节点上运行脚本。先启动工作节点,再启动节点 0。

在节点 1 上:

export WORKSPACE=/vllm-workspace
export IS_PR_TEST=false
export CONFIG_YAML_PATH=DeepSeek-V3.yaml
export CONFIG_BASE_PATH=tests/e2e/nightly/multi_node/internal_dp/config/
export LWS_WORKER_INDEX=1

cd $WORKSPACE/vllm-ascend
bash tests/e2e/nightly/multi_node/scripts/run.sh

在节点 0 上:

export WORKSPACE=/vllm-workspace
export IS_PR_TEST=false
export CONFIG_YAML_PATH=DeepSeek-V3.yaml
export CONFIG_BASE_PATH=tests/e2e/nightly/multi_node/internal_dp/config/
export LWS_WORKER_INDEX=0

cd $WORKSPACE/vllm-ascend
bash tests/e2e/nightly/multi_node/scripts/run.sh

内部 DP 日志主要打印到运行 run.sh 的终端。当设置了 LOG_PREFIX 时,共享脚本还会将 Ascend 日志备份到:

$LOG_PREFIX/node_<LWS_WORKER_INDEX>_plogs/

2.2 外部 DP 本地运行#

2.2.1 添加集群主机#

编辑您要运行的外部 DP 配置。例如:

tests/e2e/nightly/multi_node/external_dp/config/GLM5_1-W8A8-EP-external.yaml

添加 cluster_hosts 作为顶级字段,例如在 num_nodesnpu_per_node 附近:

cluster_hosts:
  - "172.22.0.xxx"
  - "172.22.0.xxx"
  - "172.22.0.xxx"
  - "172.22.0.xxx"
2.2.2 准备环境#

在每个集群主机上安装 vllm-ascend 开发依赖:

cd /vllm-workspace/vllm-ascend
python3 -m pip install -r requirements-dev.txt

在节点 0 上安装 AISBench:

export AIS_BENCH_TAG="v3.1-20260330-master"
export AIS_BENCH_URL="https://github.com/AISBench/benchmark.git"
export BENCHMARK_HOME=/vllm-workspace/vllm-ascend/benchmark

git clone -b ${AIS_BENCH_TAG} --depth 1 ${AIS_BENCH_URL} $BENCHMARK_HOME
cd $BENCHMARK_HOME
pip install -e . -r requirements/api.txt -r requirements/extra.txt

如果您的本地镜像已包含模型、基准数据、Ascend 运行时和 AISBench,则只需下一步中的运行时导出。

2.2.3 启动每个节点#

外部 DP 使用相同的共享 run.sh。将 CONFIG_BASE_PATH 设置为外部 DP 配置目录,以便脚本选择 external_dp/scripts/test_external_dp.py

然后先启动非主节点,最后启动节点 0。以下示例使用 GLM5_1-W8A8-EP-external.yaml,这是一个 4 节点解耦式预填充用例。

在节点 1、节点 2 和节点 3 上,设置匹配的 LWS_WORKER_INDEX

export WORKSPACE=/vllm-workspace
export IS_PR_TEST=false
export CONFIG_BASE_PATH=tests/e2e/nightly/multi_node/external_dp/config/
export CONFIG_YAML_PATH=GLM5_1-W8A8-EP-external.yaml
export LWS_WORKER_INDEX=1  # Use 2 on node 2, and 3 on node 3.

cd $WORKSPACE/vllm-ascend
bash tests/e2e/nightly/multi_node/scripts/run.sh

在节点 0 上:

export WORKSPACE=/vllm-workspace
export IS_PR_TEST=false
export CONFIG_BASE_PATH=tests/e2e/nightly/multi_node/external_dp/config/
export CONFIG_YAML_PATH=GLM5_1-W8A8-EP-external.yaml
export LWS_WORKER_INDEX=0

cd $WORKSPACE/vllm-ascend
bash tests/e2e/nightly/multi_node/scripts/run.sh

对于 GLM5_1-W8A8-EP-external.yaml,节点 0 和节点 1 启动预填充器进程,节点 2 和节点 3 启动解码器进程,节点 0 还启动代理和基准测试。

2.2.4 在测试运行时读取日志#

运行 run.sh 的终端会打印 pytest 编排日志。对于外部 DP,AISBench 输出也会在节点 0 上打印,而 rank 和 proxy 的标准输出/标准错误会写入 EXTERNAL_DP_LOG_DIR。默认布局为:

/tmp/external_dp_logs/
  node-0/
    rank-0.log
    rank-1.log
    proxy.log
  node-1/
    rank-0.log
    rank-1.log

每个 rank 日志的第一行记录了启动该 rank 所使用的确切命令和环境。proxy.log 仅存在于配置的代理节点上,通常是节点 0。

在本地运行多个实验时,请使用单独的日志目录:

export EXTERNAL_DP_LOG_DIR=/tmp/external_dp_logs_pd_local

要实时查看日志,请在相应节点的另一个终端中运行以下命令:

# node 0: ranks and proxy
tail -F /tmp/external_dp_logs/node-0/rank-0.log \
        /tmp/external_dp_logs/node-0/rank-1.log \
        /tmp/external_dp_logs/node-0/proxy.log

# node 1: ranks
tail -F /tmp/external_dp_logs/node-1/rank-0.log \
        /tmp/external_dp_logs/node-1/rank-1.log