Qwen3-30B-A3B with 8xH100#
Environment Preparation#
The environment setup, model download, data, and checkpoint conversion are the same as for the Qwen3-4B model. You can refer to Example: Qwen3-4B Model, replacing mentions of Qwen3-4B with Qwen3-30B-A3B.
To convert huggingface checkpoint to torch_dist, please try:
cd vime/
pip install -e . --no-deps
source scripts/models/qwen3-30B-A3B.sh
PYTHONPATH=/root/Megatron-LM/ torchrun --nproc-per-node 8 \
tools/convert_hf_to_torch_dist.py \
${MODEL_ARGS[@]} \
--hf-checkpoint /root/Qwen3-30B-A3B/ \
--save /root/Qwen3-30B-A3B_torch_dist/
Run Training#
Execute the training script:
cd /root/vime
bash scripts/run-qwen3-30B-A3B.sh
Parameter Introduction#
Here, we will briefly introduce the MoE-related parts in the run-qwen3-30B-A3B.sh script.
To support running Qwen3-30B-A3B in an 8xH800 environment, we need to enable Megatron’s CPU Adam to save GPU memory. The corresponding configuration is:
OPTIMIZER_ARGS=( ... --optimizer-cpu-offload --overlap-cpu-optimizer-d2h-h2d --use-precision-aware-optimizer )
Enable MoE optimization supported by Megatron. The current configuration is tp4, ep8:
PERF_ARGS=( --tensor-model-parallel-size 4 --sequence-parallel --pipeline-model-parallel-size 1 --context-parallel-size 1 --expert-model-parallel-size 8 --expert-tensor-parallel-size 1 ... )
Enable MoE expert parallelism in vLLM. EP size is auto-derived as
tensor_parallel_size × data_parallel_size, so for an 8-GPU engine--vllm-enable-expert-parallelalone gives you EP=8:VLLM_ARGS=( --rollout-num-gpus-per-engine 8 --vllm-gpu-memory-utilization 0.7 --vllm-enable-expert-parallel --vllm-cudagraph-capture-sizes 1 2 4 8 $(seq 16 8 256) )
For DP on the attention block plus EP on the experts, combine
--vllm-data-parallel-size Nwith--vllm-enable-expert-parallel.
Multi-Node Support#
The following uses two machines with 8 GPUs each (16 GPUs total) as the starting example; scripts and parameters scale to N nodes. Key differences from single-node:
Place weights, checkpoints, and data on storage visible to every node (e.g. NFS).
Set
MASTER_ADDRto the head LAN IP (not127.0.0.1).Omit CPU Adam (multi-node uses a distributed optimizer; do not use
--optimizer-cpu-offload).global-batch-sizemust equalrollout-batch-size × n-samples-per-prompt.
Topology#
Component |
Dual-node defaults |
|---|---|
Cluster |
|
Megatron training |
TP=8, EP=8, CP=2 (experts sharded across nodes) |
vLLM rollout |
Cross-node TP=16 ( |
Scheduling |
Ray cluster + |
Convert checkpoints with Megatron parallelism matching training (dual-node: TP=8, EP=8). Checkpoint EP must match --expert-model-parallel-size, or load_checkpoint may hang or resharding may be extremely slow.
Start the Ray Cluster#
Start Ray outside the training script on each node. Join all workers first; verify ray status reports the expected GPU count, then submit training from the head. Dual-node example:
# === Head node ===
export MASTER_ADDR=<head_lan_ip>
ray start --head --node-ip-address="${MASTER_ADDR}" --num-gpus 8 --disable-usage-stats \
--dashboard-host=0.0.0.0 --dashboard-port=8265
# === Each worker node ===
export MASTER_ADDR=<head_lan_ip>
ray start --address="${MASTER_ADDR}:6379" --node-ip-address=<this_node_lan_ip> --num-gpus 8
See Quick Start — Multi-node training for more details.
Run Training#
After the Ray cluster is ready, on the head node set multi-node env vars and run the same script as single-node (ACTOR_NUM_NODES>1 skips Ray startup and applies multi-node defaults):
export MASTER_ADDR=<head_lan_ip>
export ACTOR_NUM_NODES=2
export ACTOR_NUM_GPUS_PER_NODE=8
cd /root/vime
bash scripts/run-qwen3-30B-A3B.sh
2-step smoke test:
NUM_ROLLOUT=2 ENABLE_R3=0 bash scripts/run-qwen3-30B-A3B.sh
To scale to N nodes (e.g. 4×8), join all workers to Ray, set ACTOR_NUM_NODES=4 on the head, and tune MEGATRON_TP / MEGATRON_EP / MEGATRON_CP / ROLLOUT_NUM_GPUS_PER_ENGINE for total GPU count.
Key Multi-Node Parameters#
Variable |
Dual-node default |
Description |
|---|---|---|
|
2 (default 1 for single-node) |
Total nodes including head; script skips Ray startup when >1 |
|
8 |
GPUs per node |
|
8 / 8 / 2 |
Megatron parallelism |
|
total GPUs |
vLLM engine GPU count |
|
1 |
set to 0 to disable R3 |
Default batch: rollout-batch-size=4, n-samples-per-prompt=2, global-batch-size=8; vLLM uses --vllm-moe-backend triton.
Multi-Node Troubleshooting#
Worker cannot join Ray / NCCL failures: check
MASTER_ADDR, container/etc/hosts(hostname must not map to127.0.0.1),NCCL_SOCKET_IFNAME/GLOO_SOCKET_IFNAME.Not enough samples X for global_batch_size Y: keepglobal-batch-sizeequal torollout-batch-size × n-samples-per-prompt.GPU memory full but no processes: restart the container or run
ray stop --forceto clear stale vLLM contexts.
EPLB#
When the total number of GPUs is not a multiple or divisor of the total number of experts, enable vLLM’s EPLB (Expert Parallelism Load Balancer) and configure redundant experts via --vllm-eplb-config. For example, in a 24-GPU scenario:
VLLM_ARGS=(
--rollout-num-gpus-per-engine 24
--vllm-gpu-memory-utilization 0.7
--vllm-data-parallel-size 3
--vllm-enable-expert-parallel
--vllm-enable-eplb
--vllm-eplb-config '{"num_redundant_experts": 16}'
)